[PATCH v5 00/10] Enable shared device assignment

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/10] Enable shared device assignment
@ 2025-05-20 10:28 Chenyi Qiang
  2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
                   ` (10 more replies)
  0 siblings, 11 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

This is the v5 series of the shared device assignment support.

As discussed in the v4 series [1], the GenericStateManager parent class
and PrivateSharedManager child interface were deemed to be in the wrong
direction. This series reverts back to the original single
RamDiscardManager interface and puts it as future work to allow the
co-existence of multiple pairs of state management. For example, if we
want to have virtio-mem co-exist with guest_memfd, it will need a new
framework to combine the private/shared/discard states [2].

Another change since the last version is the error handling of memory
conversion. Currently, the failure of kvm_convert_memory() causes QEMU
to quit instead of resuming the guest. The complex rollback operation
doesn't add value and merely adds code that is difficult to test.
Although in the future, it is more likely to encounter more errors on
conversion paths like unmap failure on shared to private in-place
conversion. This series keeps complex error handling out of the picture
for now and attaches related handling at the end of the series for
future extension.

Apart from the above two parts with future work, there's some
optimization work in the future, i.e., using other more memory-efficient
mechanism to track ranges of contiguous states instead of a bitmap [3].
This series still uses a bitmap for simplicity.

The overview of this series:
- Patch 1-3: Preparation patches. These include function exposure and
  some definition changes to return values.
- Patch 4-5: Introduce a new object to implement RamDiscardManager
  interface and a helper to notify the shared/private state change.
- Patch 6: Store the new object including guest_memfd information in
  RAMBlock. Register the RamDiscardManager instance to the target
  RAMBlock's MemoryRegion so that the RamDiscardManager users can run in
  the specific path.
- Patch 7: Unlock the coordinate discard so that the shared device
  assignment (VFIO) can work with guest_memfd. After this patch, the
  basic device assignement functionality can work properly.
- Patch 8-9: Some cleanup work. Move the state change handling into a
  RamDiscardListener so that it can be invoked together with the VFIO
  listener by the state_change() call. This series dropped the priority
  support in v4 which is required by in-place conversions, because the
  conversion path will likely change.
- Patch 10: More complex error handing including rollback and mixture
  states conversion case.

More small changes or details can be found in the individual patches.

---
Original cover letter:

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. In current implementation, shared memory is allocated
with normal methods (e.g. mmap or fallocate) while private memory is
allocated from guest_memfd. When a VM performs memory conversions, QEMU
frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
one side, and allocates new pages from the other side. This will cause a
stale IOMMU mapping issue mentioned in [4] when we try to enable shared
device assignment in confidential VMs.

Solution
========
The key to enable shared device assignment is to update the IOMMU mappings
on page conversion. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Although the required operations in
VFIO for page conversion are similar to memory plug/unplug, the states of
private/shared are different from discard/populated. We want a similar
mechanism with RamDiscardManager but used to manage the state of private
and shared.

This series introduce a new parent abstract class to manage a pair of
opposite states with RamDiscardManager as its child to manage
populate/discard states, and introduce a new child class,
PrivateSharedManager, which can also utilize the same infrastructure to
notify VFIO of page conversions.

Relationship with in-place page conversion
==========================================
To support 1G page support for guest_memfd [5], the current direction is to
allow mmap() of guest_memfd to userspace so that both private and shared
memory can use the same physical pages as the backend. This in-place page
conversion design eliminates the need to discard pages during shared/private
conversions. However, device assignment will still be blocked because the
in-place page conversion will reject the conversion when the page is pinned
by VFIO.

To address this, the key difference lies in the sequence of VFIO map/unmap
operations and the page conversion. It can be adjusted to achieve
unmap-before-conversion-to-private and map-after-conversion-to-shared,
ensuring compatibility with guest_memfd.

Limitation
==========
One limitation is that VFIO expects the DMA mapping for a specific IOVA
to be mapped and unmapped with the same granularity. The guest may
perform partial conversions, such as converting a small region within a
larger region. To prevent such invalid cases, all operations are
performed with 4K granularity. This could be optimized after the
cut_mapping operation[6] is introduced in future. We can alway perform a
split-before-unmap if partial conversions happen. If the split succeeds,
the unmap will succeed and be atomic. If the split fails, the unmap
process fails.

Testing
=======
This patch series is tested based on TDX patches available at:
KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-coco-queue-snapshot-20250408
QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-20

Because the new features like cut_mapping operation will only be support in iommufd.
It is recommended to use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/qemu-devel/20250407074939.18657-1-chenyi.qiang@intel.com/
[2] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
[3] https://lore.kernel.org/qemu-devel/96ab7fa9-bd7a-444d-aef8-8c9c30439044@redhat.com/
[4] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
[5] https://lore.kernel.org/kvm/cover.1747264138.git.ackerleytng@google.com/
[6] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/

Chenyi Qiang (10):
  memory: Export a helper to get intersection of a MemoryRegionSection
    with a given range
  memory: Change memory_region_set_ram_discard_manager() to return the
    result
  memory: Unify the definiton of ReplayRamPopulate() and
    ReplayRamDiscard()
  ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock
    with guest_memfd
  ram-block-attribute: Introduce a helper to notify shared/private state
    changes
  memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
  RAMBlock: Make guest_memfd require coordinate discard
  memory: Change NotifyRamDiscard() definition to return the result
  KVM: Introduce RamDiscardListener for attribute changes during memory
    conversions
  ram-block-attribute: Add more error handling during state changes

 MAINTAINERS                                 |   1 +
 accel/kvm/kvm-all.c                         |  79 ++-
 hw/vfio/listener.c                          |   6 +-
 hw/virtio/virtio-mem.c                      |  83 ++--
 include/system/confidential-guest-support.h |   9 +
 include/system/memory.h                     |  76 ++-
 include/system/ramblock.h                   |  22 +
 migration/ram.c                             |  33 +-
 system/memory.c                             |  22 +-
 system/meson.build                          |   1 +
 system/physmem.c                            |  18 +-
 system/ram-block-attribute.c                | 514 ++++++++++++++++++++
 target/i386/kvm/tdx.c                       |   1 +
 target/i386/sev.c                           |   1 +
 14 files changed, 770 insertions(+), 96 deletions(-)
 create mode 100644 system/ram-block-attribute.c

-- 
2.43.5

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

Rename the helper to memory_region_section_intersect_range() to make it
more generic. Meanwhile, define the @end as Int128 and replace the
related operations with Int128_* format since the helper is exported as
a wider API.

Suggested-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Indent change for int128 ops to avoid the line over 80
    - Add two Review-by from Alexey and Zhao

Changes in v4:
    - No change.

Changes in v3:
    - No change

Changes in v2:
    - Make memory_region_section_intersect_range() an inline function.
    - Add Reviewed-by from David
    - Define the @end as Int128 and use the related Int128_* ops as a wilder
      API (Alexey)
---
 hw/virtio/virtio-mem.c  | 32 +++++---------------------------
 include/system/memory.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index a3d1a676e7..b3c126ea1e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -244,28 +244,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -287,7 +265,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -319,7 +297,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -355,7 +333,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -371,7 +349,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -388,7 +366,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/system/memory.h b/include/system/memory.h
index fbbf4cf911..b961c4076a 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -1211,6 +1211,36 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static inline bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                                         uint64_t offset,
+                                                         uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    Int128 end = int128_min(int128_add(int128_make64(s->offset_within_region),
+                                       s->size),
+                            int128_add(int128_make64(offset),
+                                       int128_make64(size)));
+
+    if (int128_le(end, int128_make64(start))) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_sub(end, int128_make64(start));
+    return true;
+}
+
 /**
  * memory_region_init: Initialize a memory region
  *
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
  2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  8:40   ` David Hildenbrand
  2025-05-27  6:56   ` Alexey Kardashevskiy
  2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
RamDiscardManager is already set in the MemoryRegion. The caller must
handle this failure, such as having virtio-mem undo its actions and fail
the realize() process. Opportunistically move the call earlier to avoid
complex error handling.

This change is beneficial when introducing a new RamDiscardManager
instance besides virtio-mem. After
ram_block_coordinated_discard_require(true) unlocks all
RamDiscardManager instances, only one instance is allowed to be set for
one MemoryRegion at present.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Nit in commit message (return false -> -EBUSY)
    - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
      fails.

Changes in v4:
    - No change.

Changes in v3:
    - Move set_ram_discard_manager() up to avoid a g_free()
    - Clean up set_ram_discard_manager() definition

Changes in v2:
    - newly added.
---
 hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
 include/system/memory.h |  6 +++---
 system/memory.c         | 10 +++++++---
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index b3c126ea1e..2e491e8c44 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    /*
+     * Set ourselves as RamDiscardManager before the plug handler maps the
+     * memory region and exposes it via an address space.
+     */
+    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
+                                              RAM_DISCARD_MANAGER(vmem))) {
+        error_setg(errp, "Failed to set RamDiscardManager");
+        ram_block_coordinated_discard_require(false);
+        return;
+    }
+
     /*
      * We don't know at this point whether shared RAM is migrated using
      * QEMU or migrated using the file content. "x-ignore-shared" will be
@@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
         if (ret) {
             error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
+            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
             ram_block_coordinated_discard_require(false);
             return;
         }
@@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
     vmem->system_reset->vmem = vmem;
     qemu_register_resettable(obj);
-
-    /*
-     * Set ourselves as RamDiscardManager before the plug handler maps the
-     * memory region and exposes it via an address space.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
-                                          RAM_DISCARD_MANAGER(vmem));
 }
 
 static void virtio_mem_device_unrealize(DeviceState *dev)
@@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOMEM *vmem = VIRTIO_MEM(dev);
 
-    /*
-     * The unplug handler unmapped the memory region, it cannot be
-     * found via an address space anymore. Unset ourselves.
-     */
-    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-
     qemu_unregister_resettable(OBJECT(vmem->system_reset));
     object_unref(OBJECT(vmem->system_reset));
 
@@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
     virtio_del_queue(vdev, 0);
     virtio_cleanup(vdev);
     g_free(vmem->bitmap);
+    /*
+     * The unplug handler unmapped the memory region, it cannot be
+     * found via an address space anymore. Unset ourselves.
+     */
+    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
     ram_block_coordinated_discard_require(false);
 }
 
diff --git a/include/system/memory.h b/include/system/memory.h
index b961c4076a..896948deb1 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
  *
  * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
  * that does not cover RAM, or a #MemoryRegion that already has a
- * #RamDiscardManager assigned.
+ * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
  *
  * @mr: the #MemoryRegion
  * @rdm: #RamDiscardManager to set
  */
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm);
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm);
 
 /**
  * memory_region_find: translate an address/size relative to a
diff --git a/system/memory.c b/system/memory.c
index 63b983efcd..b45b508dce 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
     return mr->rdm;
 }
 
-void memory_region_set_ram_discard_manager(MemoryRegion *mr,
-                                           RamDiscardManager *rdm)
+int memory_region_set_ram_discard_manager(MemoryRegion *mr,
+                                          RamDiscardManager *rdm)
 {
     g_assert(memory_region_is_ram(mr));
-    g_assert(!rdm || !mr->rdm);
+    if (mr->rdm && rdm) {
+        return -EBUSY;
+    }
+
     mr->rdm = rdm;
+    return 0;
 }
 
 uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
  2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
  2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  8:42   ` David Hildenbrand
                     ` (2 more replies)
  2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

Update ReplayRamDiscard() function to return the result and unify the
ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
the same time due to their identical definitions. This unification
simplifies related structures, such as VirtIOMEMReplayData, which makes
it cleaner.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
    - return data->fn(s, data->opaque) instead of 0 in
      virtio_mem_rdm_replay_discarded_cb(). (Alexey)

Changes in v4:
    - Modify the commit message. We won't use Replay() operation when
      doing the attribute change like v3.

Changes in v3:
    - Newly added.
---
 hw/virtio/virtio-mem.c  | 21 ++++++++++-----------
 include/system/memory.h | 36 +++++++++++++++++++-----------------
 migration/ram.c         |  5 +++--
 system/memory.c         | 12 ++++++------
 4 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 2e491e8c44..c46f6f9c3e 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1732,7 +1732,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
 }
 
 struct VirtIOMEMReplayData {
-    void *fn;
+    ReplayRamDiscardState fn;
     void *opaque;
 };
 
@@ -1740,12 +1740,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return data->fn(s, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
-                                           ReplayRamPopulate replay_fn,
+                                           ReplayRamDiscardState replay_fn,
                                            void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
@@ -1764,14 +1764,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
-    return 0;
+    return data->fn(s, data->opaque);
 }
 
-static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
-                                            MemoryRegionSection *s,
-                                            ReplayRamDiscard replay_fn,
-                                            void *opaque)
+static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                           MemoryRegionSection *s,
+                                           ReplayRamDiscardState replay_fn,
+                                           void *opaque)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     struct VirtIOMEMReplayData data = {
@@ -1780,8 +1779,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
     };
 
     g_assert(s->mr == &vmem->memdev->mr);
-    virtio_mem_for_each_unplugged_section(vmem, s, &data,
-                                          virtio_mem_rdm_replay_discarded_cb);
+    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                                 virtio_mem_rdm_replay_discarded_cb);
 }
 
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
diff --git a/include/system/memory.h b/include/system/memory.h
index 896948deb1..83b28551c4 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -575,8 +575,8 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
+                                     void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -650,36 +650,38 @@ struct RamDiscardManagerClass {
     /**
      * @replay_populated:
      *
-     * Call the #ReplayRamPopulate callback for all populated parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all populated parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * In case any call fails, no further calls are made.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamPopulate callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
      *
      * Returns 0 on success, or a negative error if any notification failed.
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
-                            ReplayRamPopulate replay_fn, void *opaque);
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @replay_discarded:
      *
-     * Call the #ReplayRamDiscard callback for all discarded parts within the
-     * #MemoryRegionSection via the #RamDiscardManager.
+     * Call the #ReplayRamDiscardState callback for all discarded parts within
+     * the #MemoryRegionSection via the #RamDiscardManager.
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
-     * @replay_fn: the #ReplayRamDiscard callback
+     * @replay_fn: the #ReplayRamDiscardState callback
      * @opaque: pointer to forward to the callback
+     *
+     * Returns 0 on success, or a negative error if any notification failed.
      */
-    void (*replay_discarded)(const RamDiscardManager *rdm,
-                             MemoryRegionSection *section,
-                             ReplayRamDiscard replay_fn, void *opaque);
+    int (*replay_discarded)(const RamDiscardManager *rdm,
+                            MemoryRegionSection *section,
+                            ReplayRamDiscardState replay_fn, void *opaque);
 
     /**
      * @register_listener:
@@ -722,13 +724,13 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque);
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque);
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque);
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
diff --git a/migration/ram.c b/migration/ram.c
index e12913b43e..c004f37060 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -848,8 +848,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
-static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+static int dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                      void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -868,6 +868,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
     }
     *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
     bitmap_clear(rb->bmap, start, npages);
+    return 0;
 }
 
 /*
diff --git a/system/memory.c b/system/memory.c
index b45b508dce..de45fbdd3f 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2138,7 +2138,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
-                                         ReplayRamPopulate replay_fn,
+                                         ReplayRamDiscardState replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
@@ -2147,15 +2147,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
-void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
-                                          MemoryRegionSection *section,
-                                          ReplayRamDiscard replay_fn,
-                                          void *opaque)
+int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (2 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:01   ` David Hildenbrand
  2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") highlighted that subsystems like VFIO may disable RAM block
discard. However, guest_memfd relies on discard operations for page
conversion between private and shared memory, potentially leading to
stale IOMMU mapping issue when assigning hardware devices to
confidential VMs via shared memory. To address this and allow shared
device assignement, it is crucial to ensure VFIO system refresh its
IOMMU mappings.

RamDiscardManager is an existing interface (used by virtio-mem) to
adjust VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other. Therefore, similar actions are required for page
conversion events. Introduce the RamDiscardManager to guest_memfd to
facilitate this process.

Since guest_memfd is not an object, it cannot directly implement the
RamDiscardManager interface. Implementing it in HostMemoryBackend is
not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
have a memory backend while others do not. Notably, virtual BIOS
RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
backend.

To manage RAMBlocks with guest_memfd, define a new object named
RamBlockAttribute to implement the RamDiscardManager interface. This
object can store the guest_memfd information such as bitmap for shared
memory, and handles page conversion notification. In the context of
RamDiscardManager, shared state is analogous to populated and private
state is treated as discard. The memory state is tracked at the host
page size granularity, as minimum memory conversion size can be one page
per request. Additionally, VFIO expects the DMA mapping for a specific
iova to be mapped and unmapped with the same granularity. Confidential
VMs may perform partial conversions, such as conversions on small
regions within larger regions. To prevent such invalid cases and until
cut_mapping operation support is available, all operations are performed
with 4K granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Revert to use RamDiscardManager interface instead of introducing
      new hierarchy of class to manage private/shared state, and keep
      using the new name of RamBlockAttribute compared with the
      MemoryAttributeManager in v3.
    - Use *simple* version of object_define and object_declare since the
      state_change() function is changed as an exported function instead
      of a virtual function in later patch.
    - Move the introduction of RamBlockAttribute field to this patch and
      rename it to ram_shared. (Alexey)
    - call the exit() when register/unregister failed. (Zhao)
    - Add the ram-block-attribute.c to Memory API related part in
      MAINTAINERS.

Changes in v4:
    - Change the name from memory-attribute-manager to
      ram-block-attribute.
    - Implement the newly-introduced PrivateSharedManager instead of
      RamDiscardManager and change related commit message.
    - Define the new object in ramblock.h instead of adding a new file.

Changes in v3:
    - Some rename (bitmap_size->shared_bitmap_size,
      first_one/zero_bit->first_bit, etc.)
    - Change shared_bitmap_size from uint32_t to unsigned
    - Return mgr->mr->ram_block->page_size in get_block_size()
    - Move set_ram_discard_manager() up to avoid a g_free() in failure
      case.
    - Add const for the memory_attribute_manager_get_block_size()
    - Unify the ReplayRamPopulate and ReplayRamDiscard and related
      callback.

Changes in v2:
    - Rename the object name to MemoryAttributeManager
    - Rename the bitmap to shared_bitmap to make it more clear.
    - Remove block_size field and get it from a helper. In future, we
      can get the page_size from RAMBlock if necessary.
    - Remove the unncessary "struct" before GuestMemfdReplayData
    - Remove the unncessary g_free() for the bitmap
    - Add some error report when the callback failure for
      populated/discarded section.
    - Move the realize()/unrealize() definition to this patch.
---
 MAINTAINERS                  |   1 +
 include/system/ramblock.h    |  20 +++
 system/meson.build           |   1 +
 system/ram-block-attribute.c | 311 +++++++++++++++++++++++++++++++++++
 4 files changed, 333 insertions(+)
 create mode 100644 system/ram-block-attribute.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 6dacd6d004..3b4947dc74 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3149,6 +3149,7 @@ F: system/memory.c
 F: system/memory_mapping.c
 F: system/physmem.c
 F: system/memory-internal.h
+F: system/ram-block-attribute.c
 F: scripts/coccinelle/memory-region-housekeeping.cocci
 
 Memory devices
diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index d8a116ba99..09255e8495 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -22,6 +22,10 @@
 #include "exec/cpu-common.h"
 #include "qemu/rcu.h"
 #include "exec/ramlist.h"
+#include "system/hostmem.h"
+
+#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
+OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -42,6 +46,8 @@ struct RAMBlock {
     int fd;
     uint64_t fd_offset;
     int guest_memfd;
+    /* 1-setting of the bitmap in ram_shared represents ram is shared */
+    RamBlockAttribute *ram_shared;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
@@ -91,4 +97,18 @@ struct RAMBlock {
     ram_addr_t postcopy_length;
 };
 
+struct RamBlockAttribute {
+    Object parent;
+
+    MemoryRegion *mr;
+
+    unsigned bitmap_size;
+    unsigned long *bitmap;
+
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
+void ram_block_attribute_destroy(RamBlockAttribute *attr);
+
 #endif
diff --git a/system/meson.build b/system/meson.build
index c2f0082766..107596ce86 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -17,6 +17,7 @@ libsystem_ss.add(files(
   'dma-helpers.c',
   'globals.c',
   'ioport.c',
+  'ram-block-attribute.c',
   'memory_mapping.c',
   'memory.c',
   'physmem.c',
diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
new file mode 100644
index 0000000000..8d4a24738c
--- /dev/null
+++ b/system/ram-block-attribute.c
@@ -0,0 +1,311 @@
+/*
+ * QEMU ram block attribute
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "system/ramblock.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(RamBlockAttribute,
+                                          ram_block_attribute,
+                                          RAM_BLOCK_ATTRIBUTE,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static size_t ram_block_attribute_get_block_size(const RamBlockAttribute *attr)
+{
+    /*
+     * Because page conversion could be manipulated in the size of at least 4K
+     * or 4K aligned, Use the host page size as the granularity to track the
+     * memory attribute.
+     */
+    g_assert(attr && attr->mr && attr->mr->ram_block);
+    g_assert(attr->mr->ram_block->page_size == qemu_real_host_page_size());
+    return attr->mr->ram_block->page_size;
+}
+
+
+static bool
+ram_block_attribute_rdm_is_populated(const RamDiscardManager *rdm,
+                                     const MemoryRegionSection *section)
+{
+    const RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    uint64_t first_bit = section->offset_within_region / block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
+                                           first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*ram_block_attribute_section_cb)(MemoryRegionSection *s,
+                                              void *arg);
+
+static int ram_block_attribute_notify_populate_cb(MemoryRegionSection *section,
+                                                   void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int ram_block_attribute_notify_discard_cb(MemoryRegionSection *section,
+                                                 void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+    return 0;
+}
+
+static int
+ram_block_attribute_for_each_populated_section(const RamBlockAttribute *attr,
+                                               MemoryRegionSection *section,
+                                               void *arg,
+                                               ram_block_attribute_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                              first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                  last_bit + 2);
+    }
+
+    return ret;
+}
+
+static int
+ram_block_attribute_for_each_discard_section(const RamBlockAttribute *attr,
+                                             MemoryRegionSection *section,
+                                             void *arg,
+                                             ram_block_attribute_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    int ret = 0;
+
+    first_bit = section->offset_within_region / block_size;
+    first_bit = find_next_zero_bit(attr->bitmap, attr->bitmap_size,
+                                   first_bit);
+
+    while (first_bit < attr->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_bit * block_size;
+        last_bit = find_next_bit(attr->bitmap, attr->bitmap_size,
+                                 first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            error_report("%s: Failed to notify RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            break;
+        }
+
+        first_bit = find_next_zero_bit(attr->bitmap,
+                                       attr->bitmap_size,
+                                       last_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t
+ram_block_attribute_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                            const MemoryRegion *mr)
+{
+    const RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+
+    g_assert(mr == attr->mr);
+    return ram_block_attribute_get_block_size(attr);
+}
+
+static void
+ram_block_attribute_rdm_register_listener(RamDiscardManager *rdm,
+                                          RamDiscardListener *rdl,
+                                          MemoryRegionSection *section)
+{
+    RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+    int ret;
+
+    g_assert(section->mr == attr->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&attr->rdl_list, rdl, next);
+
+    ret = ram_block_attribute_for_each_populated_section(attr, section, rdl,
+                                    ram_block_attribute_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s",
+                     __func__, strerror(-ret));
+        exit(1);
+    }
+}
+
+static void
+ram_block_attribute_rdm_unregister_listener(RamDiscardManager *rdm,
+                                            RamDiscardListener *rdl)
+{
+    RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == attr->mr);
+
+    if (rdl->double_discard_supported) {
+        rdl->notify_discard(rdl, rdl->section);
+    } else {
+        ret = ram_block_attribute_for_each_populated_section(attr,
+                rdl->section, rdl, ram_block_attribute_notify_discard_cb);
+        if (ret) {
+            error_report("%s: Failed to unregister RAM discard listener: %s",
+                         __func__, strerror(-ret));
+            exit(1);
+        }
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+}
+
+typedef struct RamBlockAttributeReplayData {
+    ReplayRamDiscardState fn;
+    void *opaque;
+} RamBlockAttributeReplayData;
+
+static int ram_block_attribute_rdm_replay_cb(MemoryRegionSection *section,
+                                             void *arg)
+{
+    RamBlockAttributeReplayData *data = arg;
+
+    return data->fn(section, data->opaque);
+}
+
+static int
+ram_block_attribute_rdm_replay_populated(const RamDiscardManager *rdm,
+                                         MemoryRegionSection *section,
+                                         ReplayRamDiscardState replay_fn,
+                                         void *opaque)
+{
+    RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+    RamBlockAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->mr);
+    return ram_block_attribute_for_each_populated_section(attr, section, &data,
+                                            ram_block_attribute_rdm_replay_cb);
+}
+
+static int
+ram_block_attribute_rdm_replay_discard(const RamDiscardManager *rdm,
+                                       MemoryRegionSection *section,
+                                       ReplayRamDiscardState replay_fn,
+                                       void *opaque)
+{
+    RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(rdm);
+    RamBlockAttributeReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == attr->mr);
+    return ram_block_attribute_for_each_discard_section(attr, section, &data,
+                                            ram_block_attribute_rdm_replay_cb);
+}
+
+RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
+{
+    uint64_t bitmap_size;
+    const int block_size  = qemu_real_host_page_size();
+    RamBlockAttribute *attr;
+    int ret;
+
+    attr = RAM_BLOCK_ATTRIBUTE(object_new(TYPE_RAM_BLOCK_ATTRIBUTE));
+
+    attr->mr = mr;
+    ret = memory_region_set_ram_discard_manager(mr, RAM_DISCARD_MANAGER(attr));
+    if (ret) {
+        object_unref(OBJECT(attr));
+        return NULL;
+    }
+    bitmap_size = ROUND_UP(mr->size, block_size) / block_size;
+    attr->bitmap_size = bitmap_size;
+    attr->bitmap = bitmap_new(bitmap_size);
+
+    return attr;
+}
+
+void ram_block_attribute_destroy(RamBlockAttribute *attr)
+{
+    if (!attr) {
+        return;
+    }
+
+    g_free(attr->bitmap);
+    memory_region_set_ram_discard_manager(attr->mr, NULL);
+    object_unref(OBJECT(attr));
+}
+
+static void ram_block_attribute_init(Object *obj)
+{
+    RamBlockAttribute *attr = RAM_BLOCK_ATTRIBUTE(obj);
+
+    QLIST_INIT(&attr->rdl_list);
+}
+
+static void ram_block_attribute_finalize(Object *obj)
+{
+}
+
+static void ram_block_attribute_class_init(ObjectClass *klass,
+                                           const void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(klass);
+
+    rdmc->get_min_granularity = ram_block_attribute_rdm_get_min_granularity;
+    rdmc->register_listener = ram_block_attribute_rdm_register_listener;
+    rdmc->unregister_listener = ram_block_attribute_rdm_unregister_listener;
+    rdmc->is_populated = ram_block_attribute_rdm_is_populated;
+    rdmc->replay_populated = ram_block_attribute_rdm_replay_populated;
+    rdmc->replay_discarded = ram_block_attribute_rdm_replay_discard;
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (3 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:02   ` David Hildenbrand
  2025-05-27  7:35   ` Alexey Kardashevskiy
  2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

A new state_change() helper is introduced for RamBlockAttribute
to efficiently notify all registered RamDiscardListeners, including
VFIO listeners, about memory conversion events in guest_memfd. The VFIO
listener can dynamically DMA map/unmap shared pages based on conversion
types:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Currently, memory conversion failures cause QEMU to quit instead of
resuming the guest or retrying the operation. It would be a future work
to add more error handling or rollback mechanisms once conversion
failures are allowed. For example, in-place conversion of guest_memfd
could retry the unmap operation during the conversion from shared to
private. However, for now, keep the complex error handling out of the
picture as it is not required:

- If a conversion request is made for a page already in the desired
  state, the helper simply returns success.
- For requests involving a range partially in the desired state, there
  is no such scenario in practice at present. Simply return error.
- If a conversion request is declined by other systems, such as a
  failure from VFIO during notify_to_populated(), the failure is
  returned directly. As for notify_to_discard(), VFIO cannot fail
  unmap/unpin, so no error is returned.

Note that the bitmap status is updated before callbacks, allowing
listeners to handle memory based on the latest status.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Change in v5:
    - Move the state_change() back to a helper instead of a callback of
      the class since there's no child for the RamBlockAttributeClass.
    - Remove the error handling and move them to an individual patch for
      simple management.

Changes in v4:
    - Add the state_change() callback in PrivateSharedManagerClass
      instead of the RamBlockAttribute.

Changes in v3:
    - Move the bitmap update before notifier callbacks.
    - Call the notifier callbacks directly in notify_discard/populate()
      with the expectation that the request memory range is in the
      desired attribute.
    - For the case that only partial range in the desire status, handle
      the range with block_size granularity for ease of rollback
      (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com/)

Changes in v2:
    - Do the alignment changes due to the rename to MemoryAttributeManager
    - Move the state_change() helper definition in this patch.
---
 include/system/ramblock.h    |   2 +
 system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
 2 files changed, 136 insertions(+)

diff --git a/include/system/ramblock.h b/include/system/ramblock.h
index 09255e8495..270dffb2f3 100644
--- a/include/system/ramblock.h
+++ b/include/system/ramblock.h
@@ -108,6 +108,8 @@ struct RamBlockAttribute {
     QLIST_HEAD(, RamDiscardListener) rdl_list;
 };
 
+int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
+                                     uint64_t size, bool to_private);
 RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
 void ram_block_attribute_destroy(RamBlockAttribute *attr);
 
diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
index 8d4a24738c..f12dd4b881 100644
--- a/system/ram-block-attribute.c
+++ b/system/ram-block-attribute.c
@@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const RamDiscardManager *rdm,
                                             ram_block_attribute_rdm_replay_cb);
 }
 
+static bool ram_block_attribute_is_valid_range(RamBlockAttribute *attr,
+                                               uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = attr->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    int block_size = ram_block_attribute_get_block_size(attr);
+
+    if (!QEMU_IS_ALIGNED(offset, block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
+                                                  uint64_t offset,
+                                                  uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        rdl->notify_discard(rdl, &tmp);
+    }
+}
+
+static int
+ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
+                                        uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+        ret = rdl->notify_populate(rdl, &tmp);
+        if (ret) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
+static bool ram_block_attribute_is_range_populated(RamBlockAttribute *attr,
+                                                   uint64_t offset,
+                                                   uint64_t size)
+{
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
+                                   first_bit);
+    return found_bit > last_bit;
+}
+
+static bool
+ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
+                                     uint64_t offset, uint64_t size)
+{
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long last_bit = first_bit + (size / block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
+                                     uint64_t size, bool to_private)
+{
+    const int block_size = ram_block_attribute_get_block_size(attr);
+    const unsigned long first_bit = offset / block_size;
+    const unsigned long nbits = size / block_size;
+    int ret = 0;
+
+    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    /* Already discard/populated */
+    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
+         to_private) ||
+        (ram_block_attribute_is_range_populated(attr, offset, size) &&
+         !to_private)) {
+        return 0;
+    }
+
+    /* Unexpected mixture */
+    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
+         to_private) ||
+        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
+         !to_private)) {
+        error_report("%s, the range is not all in the desired state: "
+                     "(offset 0x%lx, size 0x%lx), %s",
+                     __func__, offset, size,
+                     to_private ? "private" : "shared");
+        return -1;
+    }
+
+    if (to_private) {
+        bitmap_clear(attr->bitmap, first_bit, nbits);
+        ram_block_attribute_notify_to_discard(attr, offset, size);
+    } else {
+        bitmap_set(attr->bitmap, first_bit, nbits);
+        ret = ram_block_attribute_notify_to_populated(attr, offset, size);
+    }
+
+    return ret;
+}
+
 RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
 {
     uint64_t bitmap_size;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (4 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:06   ` David Hildenbrand
  2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

A new field, ram_shared, was introduced in RAMBlock to link to a
RamBlockAttribute object, which centralizes all guest_memfd state
information (such as fd and shared_bitmap) within a RAMBlock.

Create and initialize the RamBlockAttribute object upon ram_block_add().
Meanwhile, register the object in the target RAMBlock's MemoryRegion.
After that, guest_memfd-backed RAMBlock is associated with the
RamDiscardManager interface, and the users will execute
RamDiscardManager specific handling. For example, VFIO will register the
RamDiscardListener as expected. The live migration path needs to be
avoided since it is not supported yet in confidential VMs.

Additionally, use the ram_block_attribute_state_change() helper to
notify the registered RamDiscardListener of these changes.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Revert to use RamDiscardManager interface.
    - Move the object_new() into the ram_block_attribute_create()
      helper.
    - Add some check in migration path.

Changes in v4:
    - Remove the replay operations for attribute changes which will be
      handled in a listener in following patches.
    - Add some comment in the error path of realize() to remind the
      future development of the unified error path.

Changes in v3:
    - Use ram_discard_manager_reply_populated/discarded() to set the
      memory attribute and add the undo support if state_change()
      failed.
    - Didn't add Reviewed-by from Alexey due to the new changes in this
      commit.

Changes in v2:
    - Introduce a new field memory_attribute_manager in RAMBlock.
    - Move the state_change() handling during page conversion in this patch.
    - Undo what we did if it fails to set.
    - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
---
 accel/kvm/kvm-all.c |  9 +++++++++
 migration/ram.c     | 28 ++++++++++++++++++++++++++++
 system/physmem.c    | 14 ++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 51526d301b..2d7ecaeb6a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    ret = ram_block_attribute_state_change(RAM_BLOCK_ATTRIBUTE(mr->rdm),
+                                           offset, size, to_private);
+    if (ret) {
+        error_report("Failed to notify the listener the state change of "
+                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
+                     start, size, to_private ? "private" : "shared");
+        goto out_unref;
+    }
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/migration/ram.c b/migration/ram.c
index c004f37060..69c9a42f16 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -890,6 +890,13 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
 
     if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
         RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+
+        if (object_dynamic_cast(OBJECT(rdm), TYPE_RAM_BLOCK_ATTRIBUTE)) {
+            error_report("%s: Live migration for confidential VM is not "
+                         "supported yet.", __func__);
+            exit(1);
+        }
+
         MemoryRegionSection section = {
             .mr = rb->mr,
             .offset_within_region = 0,
@@ -913,6 +920,13 @@ bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
 {
     if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
         RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+
+        if (object_dynamic_cast(OBJECT(rdm), TYPE_RAM_BLOCK_ATTRIBUTE)) {
+            error_report("%s: Live migration for confidential VM is not "
+                         "supported yet.", __func__);
+            exit(1);
+        }
+
         MemoryRegionSection section = {
             .mr = rb->mr,
             .offset_within_region = start,
@@ -1552,6 +1566,13 @@ static void ram_block_populate_read(RAMBlock *rb)
      */
     if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
         RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+
+        if (object_dynamic_cast(OBJECT(rdm), TYPE_RAM_BLOCK_ATTRIBUTE)) {
+            error_report("%s: Live migration for confidential VM is not "
+                         "supported yet.", __func__);
+            exit(1);
+        }
+
         MemoryRegionSection section = {
             .mr = rb->mr,
             .offset_within_region = 0,
@@ -1611,6 +1632,13 @@ static int ram_block_uffd_protect(RAMBlock *rb, int uffd_fd)
     /* See ram_block_populate_read() */
     if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
         RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+
+        if (object_dynamic_cast(OBJECT(rdm), TYPE_RAM_BLOCK_ATTRIBUTE)) {
+            error_report("%s: Live migration for confidential VM is not "
+                         "supported yet.", __func__);
+            exit(1);
+        }
+
         MemoryRegionSection section = {
             .mr = rb->mr,
             .offset_within_region = 0,
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca309e..f05f7ff09a 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1931,6 +1931,19 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             goto out_free;
         }
 
+        new_block->ram_shared = ram_block_attribute_create(new_block->mr);
+        if (!new_block->ram_shared) {
+            error_setg(errp, "Failed to create ram block attribute");
+            /*
+             * The error path could be unified if the rest of ram_block_add()
+             * ever develops a need to check for errors.
+             */
+            close(new_block->guest_memfd);
+            ram_block_discard_require(false);
+            qemu_mutex_unlock_ramlist();
+            goto out_free;
+        }
+
         /*
          * Add a specific guest_memfd blocker if a generic one would not be
          * added by ram_block_add_cpr_blocker.
@@ -2287,6 +2300,7 @@ static void reclaim_ramblock(RAMBlock *block)
     }
 
     if (block->guest_memfd >= 0) {
+        ram_block_attribute_destroy(block->ram_shared);
         close(block->guest_memfd);
         ram_block_discard_require(false);
     }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (5 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:08   ` David Hildenbrand
  2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

As guest_memfd is now managed by RamBlockAttribute with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Revert to use RamDiscardManager.

Changes in v4:
    - Modify commit message (RamDiscardManager->PrivateSharedManager).

Changes in v3:
    - No change.

Changes in v2:
    - Change the ram_block_discard_require(false) to
      ram_block_coordinated_discard_require(false).
---
 system/physmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index f05f7ff09a..58b7614660 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         }
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");
@@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
              * ever develops a need to check for errors.
              */
             close(new_block->guest_memfd);
-            ram_block_discard_require(false);
+            ram_block_coordinated_discard_require(false);
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
@@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
     if (block->guest_memfd >= 0) {
         ram_block_attribute_destroy(block->ram_shared);
         close(block->guest_memfd);
-        ram_block_discard_require(false);
+        ram_block_coordinated_discard_require(false);
     }
 
     g_free(block);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (6 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:31   ` Philippe Mathieu-Daudé
  2025-05-26 10:36   ` Cédric Le Goater
  2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

So that the caller can check the result of NotifyRamDiscard() handler if
the operation fails.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Revert to use of NotifyRamDiscard()

Changes in v4:
    - Newly added.
---
 hw/vfio/listener.c           | 6 ++++--
 include/system/memory.h      | 4 ++--
 system/ram-block-attribute.c | 3 +--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index bfacb3d8d9..06454e0584 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -190,8 +190,8 @@ out:
     rcu_read_unlock();
 }
 
-static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+static int vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
+                                           MemoryRegionSection *section)
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
@@ -206,6 +206,8 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
         error_report("%s: vfio_container_dma_unmap() failed: %s", __func__,
                      strerror(-ret));
     }
+
+    return ret;
 }
 
 static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
diff --git a/include/system/memory.h b/include/system/memory.h
index 83b28551c4..e5155120d9 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -518,8 +518,8 @@ struct IOMMUMemoryRegionClass {
 typedef struct RamDiscardListener RamDiscardListener;
 typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
                                  MemoryRegionSection *section);
-typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
-                                 MemoryRegionSection *section);
+typedef int (*NotifyRamDiscard)(RamDiscardListener *rdl,
+                                MemoryRegionSection *section);
 
 struct RamDiscardListener {
     /*
diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
index f12dd4b881..896c3d7543 100644
--- a/system/ram-block-attribute.c
+++ b/system/ram-block-attribute.c
@@ -66,8 +66,7 @@ static int ram_block_attribute_notify_discard_cb(MemoryRegionSection *section,
 {
     RamDiscardListener *rdl = arg;
 
-    rdl->notify_discard(rdl, section);
-    return 0;
+    return rdl->notify_discard(rdl, section);
 }
 
 static int
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (7 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:22   ` David Hildenbrand
  2025-05-27  8:01   ` Alexey Kardashevskiy
  2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
  2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
  10 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

With the introduction of the RamBlockAttribute object to manage
RAMBlocks with guest_memfd, it is more elegant to move KVM set attribute
into a RamDiscardListener.

The KVM attribute change RamDiscardListener is registered/unregistered
for each memory region section during kvm_region_add/del(). The listener
handler performs attribute change upon receiving notifications from
ram_block_attribute_state_change() calls. After this change, the
operations in kvm_convert_memory() can be removed.

Note that, errors can be returned in
ram_block_attribute_notify_to_discard() by KVM attribute changes,
although it is currently unlikely to happen. With in-place conversion
guest_memfd in the future, it would be more likely to encounter errors
and require error handling. For now, simply return the result, and
kvm_convert_memory() will cause QEMU to quit if any issue arises.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
Changes in v5:
    - Revert to use RamDiscardListener

Changes in v4:
    - Newly added.
---
 accel/kvm/kvm-all.c                         | 72 ++++++++++++++++++---
 include/system/confidential-guest-support.h |  9 +++
 system/ram-block-attribute.c                | 16 +++--
 target/i386/kvm/tdx.c                       |  1 +
 target/i386/sev.c                           |  1 +
 5 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2d7ecaeb6a..ca4ef8062b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -49,6 +49,7 @@
 #include "kvm-cpus.h"
 #include "system/dirtylimit.h"
 #include "qemu/range.h"
+#include "system/confidential-guest-support.h"
 
 #include "hw/boards.h"
 #include "system/stats.h"
@@ -1689,28 +1690,90 @@ static int kvm_dirty_ring_init(KVMState *s)
     return 0;
 }
 
+static int kvm_private_shared_notify(RamDiscardListener *rdl,
+                                     MemoryRegionSection *section,
+                                     bool to_private)
+{
+    hwaddr start = section->offset_within_address_space;
+    hwaddr size = section->size;
+
+    if (to_private) {
+        return kvm_set_memory_attributes_private(start, size);
+    } else {
+        return kvm_set_memory_attributes_shared(start, size);
+    }
+}
+
+static int kvm_ram_discard_notify_to_shared(RamDiscardListener *rdl,
+                                            MemoryRegionSection *section)
+{
+    return kvm_private_shared_notify(rdl, section, false);
+}
+
+static int kvm_ram_discard_notify_to_private(RamDiscardListener *rdl,
+                                             MemoryRegionSection *section)
+{
+    return kvm_private_shared_notify(rdl, section, true);
+}
+
 static void kvm_region_add(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
     KVMMemoryUpdate *update;
+    CGSRamDiscardListener *crdl;
+    RamDiscardListener *rdl;
+
 
     update = g_new0(KVMMemoryUpdate, 1);
     update->section = *section;
 
     QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next);
+
+    if (!memory_region_has_guest_memfd(section->mr) || !rdm) {
+        return;
+    }
+
+    crdl = g_new0(CGSRamDiscardListener, 1);
+    crdl->mr = section->mr;
+    crdl->offset_within_address_space = section->offset_within_address_space;
+    rdl = &crdl->listener;
+    QLIST_INSERT_HEAD(&cgs->cgs_rdl_list, crdl, next);
+    ram_discard_listener_init(rdl, kvm_ram_discard_notify_to_shared,
+                              kvm_ram_discard_notify_to_private, true);
+    ram_discard_manager_register_listener(rdm, rdl, section);
 }
 
 static void kvm_region_del(MemoryListener *listener,
                            MemoryRegionSection *section)
 {
     KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
+    ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
+    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
     KVMMemoryUpdate *update;
+    CGSRamDiscardListener *crdl;
+    RamDiscardListener *rdl;
 
     update = g_new0(KVMMemoryUpdate, 1);
     update->section = *section;
 
     QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next);
+    if (!memory_region_has_guest_memfd(section->mr) || !rdm) {
+        return;
+    }
+
+    QLIST_FOREACH(crdl, &cgs->cgs_rdl_list, next) {
+        if (crdl->mr == section->mr &&
+            crdl->offset_within_address_space == section->offset_within_address_space) {
+            rdl = &crdl->listener;
+            ram_discard_manager_unregister_listener(rdm, rdl);
+            QLIST_REMOVE(crdl, next);
+            g_free(crdl);
+            break;
+        }
+    }
 }
 
 static void kvm_region_commit(MemoryListener *listener)
@@ -3077,15 +3140,6 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
         goto out_unref;
     }
 
-    if (to_private) {
-        ret = kvm_set_memory_attributes_private(start, size);
-    } else {
-        ret = kvm_set_memory_attributes_shared(start, size);
-    }
-    if (ret) {
-        goto out_unref;
-    }
-
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
diff --git a/include/system/confidential-guest-support.h b/include/system/confidential-guest-support.h
index ea46b50c56..974abdbf6b 100644
--- a/include/system/confidential-guest-support.h
+++ b/include/system/confidential-guest-support.h
@@ -19,12 +19,19 @@
 #define QEMU_CONFIDENTIAL_GUEST_SUPPORT_H
 
 #include "qom/object.h"
+#include "system/memory.h"
 
 #define TYPE_CONFIDENTIAL_GUEST_SUPPORT "confidential-guest-support"
 OBJECT_DECLARE_TYPE(ConfidentialGuestSupport,
                     ConfidentialGuestSupportClass,
                     CONFIDENTIAL_GUEST_SUPPORT)
 
+typedef struct CGSRamDiscardListener {
+    MemoryRegion *mr;
+    hwaddr offset_within_address_space;
+    RamDiscardListener listener;
+    QLIST_ENTRY(CGSRamDiscardListener) next;
+} CGSRamDiscardListener;
 
 struct ConfidentialGuestSupport {
     Object parent;
@@ -34,6 +41,8 @@ struct ConfidentialGuestSupport {
      */
     bool require_guest_memfd;
 
+    QLIST_HEAD(, CGSRamDiscardListener) cgs_rdl_list;
+
     /*
      * ready: flag set by CGS initialization code once it's ready to
      *        start executing instructions in a potentially-secure
diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
index 896c3d7543..387501b569 100644
--- a/system/ram-block-attribute.c
+++ b/system/ram-block-attribute.c
@@ -274,11 +274,12 @@ static bool ram_block_attribute_is_valid_range(RamBlockAttribute *attr,
     return true;
 }
 
-static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
-                                                  uint64_t offset,
-                                                  uint64_t size)
+static int ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
+                                                 uint64_t offset,
+                                                 uint64_t size)
 {
     RamDiscardListener *rdl;
+    int ret = 0;
 
     QLIST_FOREACH(rdl, &attr->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
@@ -286,8 +287,13 @@ static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-        rdl->notify_discard(rdl, &tmp);
+        ret = rdl->notify_discard(rdl, &tmp);
+        if (ret) {
+            break;
+        }
     }
+
+    return ret;
 }
 
 static int
@@ -377,7 +383,7 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
 
     if (to_private) {
         bitmap_clear(attr->bitmap, first_bit, nbits);
-        ram_block_attribute_notify_to_discard(attr, offset, size);
+        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
     } else {
         bitmap_set(attr->bitmap, first_bit, nbits);
         ret = ram_block_attribute_notify_to_populated(attr, offset, size);
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7ef49690bd..17b360059c 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1492,6 +1492,7 @@ static void tdx_guest_init(Object *obj)
     qemu_mutex_init(&tdx->lock);
 
     cgs->require_guest_memfd = true;
+    QLIST_INIT(&cgs->cgs_rdl_list);
     tdx->attributes = TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
 
     object_property_add_uint64_ptr(obj, "attributes", &tdx->attributes,
diff --git a/target/i386/sev.c b/target/i386/sev.c
index adf787797e..f1b9c35fc3 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -2430,6 +2430,7 @@ sev_snp_guest_instance_init(Object *obj)
     SevSnpGuestState *sev_snp_guest = SEV_SNP_GUEST(obj);
 
     cgs->require_guest_memfd = true;
+    QLIST_INIT(&cgs->cgs_rdl_list);
 
     /* default init/start/finish params for kvm */
     sev_snp_guest->kvm_start_conf.policy = DEFAULT_SEV_SNP_POLICY;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (8 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
@ 2025-05-20 10:28 ` Chenyi Qiang
  2025-05-26  9:17   ` David Hildenbrand
  2025-05-27  9:11   ` Alexey Kardashevskiy
  2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
  10 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-20 10:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: Chenyi Qiang, qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu,
	Gao Chao, Xu Yilun, Li Xiaoyao

The current error handling is simple with the following assumption:
- QEMU will quit instead of resuming the guest if kvm_convert_memory()
  fails, thus no need to do rollback.
- The convert range is required to be in the desired state. It is not
  allowed to handle the mixture case.
- The conversion from shared to private is a non-failure operation.

This is sufficient for now as complext error handling is not required.
For future extension, add some potential error handling.
- For private to shared conversion, do the rollback operation if
  ram_block_attribute_notify_to_populated() fails.
- For shared to private conversion, still assert it as a non-failure
  operation for now. It could be an easy fail path with in-place
  conversion, which will likely have to retry the conversion until it
  works in the future.
- For mixture case, process individual blocks for ease of rollback.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
 1 file changed, 90 insertions(+), 26 deletions(-)

diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
index 387501b569..0af3396aa4 100644
--- a/system/ram-block-attribute.c
+++ b/system/ram-block-attribute.c
@@ -289,7 +289,12 @@ static int ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
         }
         ret = rdl->notify_discard(rdl, &tmp);
         if (ret) {
-            break;
+            /*
+             * The current to_private listeners (VFIO dma_unmap and
+             * KVM set_attribute_private) are non-failing operations.
+             * TODO: add rollback operations if it is allowed to fail.
+             */
+            g_assert(ret);
         }
     }
 
@@ -300,7 +305,7 @@ static int
 ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
                                         uint64_t offset, uint64_t size)
 {
-    RamDiscardListener *rdl;
+    RamDiscardListener *rdl, *rdl2;
     int ret = 0;
 
     QLIST_FOREACH(rdl, &attr->rdl_list, next) {
@@ -315,6 +320,20 @@ ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
         }
     }
 
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &attr->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl == rdl2) {
+                break;
+            }
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+                continue;
+            }
+            rdl2->notify_discard(rdl2, &tmp);
+        }
+    }
     return ret;
 }
 
@@ -353,6 +372,9 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
     const int block_size = ram_block_attribute_get_block_size(attr);
     const unsigned long first_bit = offset / block_size;
     const unsigned long nbits = size / block_size;
+    const uint64_t end = offset + size;
+    unsigned long bit;
+    uint64_t cur;
     int ret = 0;
 
     if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
@@ -361,32 +383,74 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
         return -1;
     }
 
-    /* Already discard/populated */
-    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
-         to_private) ||
-        (ram_block_attribute_is_range_populated(attr, offset, size) &&
-         !to_private)) {
-        return 0;
-    }
-
-    /* Unexpected mixture */
-    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
-         to_private) ||
-        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
-         !to_private)) {
-        error_report("%s, the range is not all in the desired state: "
-                     "(offset 0x%lx, size 0x%lx), %s",
-                     __func__, offset, size,
-                     to_private ? "private" : "shared");
-        return -1;
-    }
-
     if (to_private) {
-        bitmap_clear(attr->bitmap, first_bit, nbits);
-        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
+        if (ram_block_attribute_is_range_discard(attr, offset, size)) {
+            /* Already private */
+        } else if (!ram_block_attribute_is_range_populated(attr, offset,
+                                                           size)) {
+            /* Unexpected mixture: process individual blocks */
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (!test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                clear_bit(bit, attr->bitmap);
+                ram_block_attribute_notify_to_discard(attr, cur, block_size);
+            }
+        } else {
+            /* Completely shared */
+            bitmap_clear(attr->bitmap, first_bit, nbits);
+            ram_block_attribute_notify_to_discard(attr, offset, size);
+        }
     } else {
-        bitmap_set(attr->bitmap, first_bit, nbits);
-        ret = ram_block_attribute_notify_to_populated(attr, offset, size);
+        if (ram_block_attribute_is_range_populated(attr, offset, size)) {
+            /* Already shared */
+        } else if (!ram_block_attribute_is_range_discard(attr, offset, size)) {
+            /* Unexpected mixture: process individual blocks */
+            unsigned long *modified_bitmap = bitmap_new(nbits);
+
+            for (cur = offset; cur < end; cur += block_size) {
+                bit = cur / block_size;
+                if (test_bit(bit, attr->bitmap)) {
+                    continue;
+                }
+                set_bit(bit, attr->bitmap);
+                ret = ram_block_attribute_notify_to_populated(attr, cur,
+                                                           block_size);
+                if (!ret) {
+                    set_bit(bit - first_bit, modified_bitmap);
+                    continue;
+                }
+                clear_bit(bit, attr->bitmap);
+                break;
+            }
+
+            if (ret) {
+                /*
+                 * Very unexpected: something went wrong. Revert to the old
+                 * state, marking only the blocks as private that we converted
+                 * to shared.
+                 */
+                for (cur = offset; cur < end; cur += block_size) {
+                    bit = cur / block_size;
+                    if (!test_bit(bit - first_bit, modified_bitmap)) {
+                        continue;
+                    }
+                    assert(test_bit(bit, attr->bitmap));
+                    clear_bit(bit, attr->bitmap);
+                    ram_block_attribute_notify_to_discard(attr, cur,
+                                                          block_size);
+                }
+            }
+            g_free(modified_bitmap);
+        } else {
+            /* Complete private */
+            bitmap_set(attr->bitmap, first_bit, nbits);
+            ret = ram_block_attribute_notify_to_populated(attr, offset, size);
+            if (ret) {
+                bitmap_clear(attr->bitmap, first_bit, nbits);
+            }
+        }
     }
 
     return ret;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
@ 2025-05-26  8:40   ` David Hildenbrand
  2025-05-27  6:56   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  8:40 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> one MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
@ 2025-05-26  8:42   ` David Hildenbrand
  2025-05-26  9:35   ` Philippe Mathieu-Daudé
  2025-05-27  6:56   ` Alexey Kardashevskiy
  2 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  8:42 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
@ 2025-05-26  9:01   ` David Hildenbrand
  2025-05-26  9:28     ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:01 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") highlighted that subsystems like VFIO may disable RAM block
> discard. However, guest_memfd relies on discard operations for page
> conversion between private and shared memory, potentially leading to
> stale IOMMU mapping issue when assigning hardware devices to
> confidential VMs via shared memory. To address this and allow shared
> device assignement, it is crucial to ensure VFIO system refresh its
> IOMMU mappings.
> 
> RamDiscardManager is an existing interface (used by virtio-mem) to
> adjust VFIO mappings in relation to VM page assignment. Effectively page
> conversion is similar to hot-removing a page in one mode and adding it
> back in the other. Therefore, similar actions are required for page
> conversion events. Introduce the RamDiscardManager to guest_memfd to
> facilitate this process.
> 
> Since guest_memfd is not an object, it cannot directly implement the
> RamDiscardManager interface. Implementing it in HostMemoryBackend is
> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
> have a memory backend while others do not. Notably, virtual BIOS
> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
> backend.
> 
> To manage RAMBlocks with guest_memfd, define a new object named
> RamBlockAttribute to implement the RamDiscardManager interface. This
> object can store the guest_memfd information such as bitmap for shared
> memory, and handles page conversion notification. In the context of
> RamDiscardManager, shared state is analogous to populated and private
> state is treated as discard. The memory state is tracked at the host
> page size granularity, as minimum memory conversion size can be one page
> per request. Additionally, VFIO expects the DMA mapping for a specific
> iova to be mapped and unmapped with the same granularity. Confidential
> VMs may perform partial conversions, such as conversions on small
> regions within larger regions. To prevent such invalid cases and until
> cut_mapping operation support is available, all operations are performed
> with 4K granularity.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use RamDiscardManager interface instead of introducing
>        new hierarchy of class to manage private/shared state, and keep
>        using the new name of RamBlockAttribute compared with the
>        MemoryAttributeManager in v3.
>      - Use *simple* version of object_define and object_declare since the
>        state_change() function is changed as an exported function instead
>        of a virtual function in later patch.
>      - Move the introduction of RamBlockAttribute field to this patch and
>        rename it to ram_shared. (Alexey)
>      - call the exit() when register/unregister failed. (Zhao)
>      - Add the ram-block-attribute.c to Memory API related part in
>        MAINTAINERS.
> 
> Changes in v4:
>      - Change the name from memory-attribute-manager to
>        ram-block-attribute.
>      - Implement the newly-introduced PrivateSharedManager instead of
>        RamDiscardManager and change related commit message.
>      - Define the new object in ramblock.h instead of adding a new file.
> 
> Changes in v3:
>      - Some rename (bitmap_size->shared_bitmap_size,
>        first_one/zero_bit->first_bit, etc.)
>      - Change shared_bitmap_size from uint32_t to unsigned
>      - Return mgr->mr->ram_block->page_size in get_block_size()
>      - Move set_ram_discard_manager() up to avoid a g_free() in failure
>        case.
>      - Add const for the memory_attribute_manager_get_block_size()
>      - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>        callback.
> 
> Changes in v2:
>      - Rename the object name to MemoryAttributeManager
>      - Rename the bitmap to shared_bitmap to make it more clear.
>      - Remove block_size field and get it from a helper. In future, we
>        can get the page_size from RAMBlock if necessary.
>      - Remove the unncessary "struct" before GuestMemfdReplayData
>      - Remove the unncessary g_free() for the bitmap
>      - Add some error report when the callback failure for
>        populated/discarded section.
>      - Move the realize()/unrealize() definition to this patch.
> ---
>   MAINTAINERS                  |   1 +
>   include/system/ramblock.h    |  20 +++
>   system/meson.build           |   1 +
>   system/ram-block-attribute.c | 311 +++++++++++++++++++++++++++++++++++
>   4 files changed, 333 insertions(+)
>   create mode 100644 system/ram-block-attribute.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dacd6d004..3b4947dc74 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3149,6 +3149,7 @@ F: system/memory.c
>   F: system/memory_mapping.c
>   F: system/physmem.c
>   F: system/memory-internal.h
> +F: system/ram-block-attribute.c
>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>   
>   Memory devices
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index d8a116ba99..09255e8495 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -22,6 +22,10 @@
>   #include "exec/cpu-common.h"
>   #include "qemu/rcu.h"
>   #include "exec/ramlist.h"
> +#include "system/hostmem.h"
> +
> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>   
>   struct RAMBlock {
>       struct rcu_head rcu;
> @@ -42,6 +46,8 @@ struct RAMBlock {
>       int fd;
>       uint64_t fd_offset;
>       int guest_memfd;
> +    /* 1-setting of the bitmap in ram_shared represents ram is shared */

That comment looks misplaced, and the variable misnamed.

The commet should go into RamBlockAttribute and the variable should 
likely be named "attributes".

Also, "ram_shared" is not used at all in this patch, it should be moved 
into the corresponding patch.

> +    RamBlockAttribute *ram_shared;
>       size_t page_size;
>       /* dirty bitmap used during migration */
>       unsigned long *bmap;
> @@ -91,4 +97,18 @@ struct RAMBlock {
>       ram_addr_t postcopy_length;
>   };
>   
> +struct RamBlockAttribute {

Should this actually be "RamBlockAttributes" ?

> +    Object parent;
> +
> +    MemoryRegion *mr;


Should we link to the parent RAMBlock instead, and lookup the MR from there?


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
  2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
@ 2025-05-26  9:02   ` David Hildenbrand
  2025-05-27  7:35   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:02 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> A new state_change() helper is introduced for RamBlockAttribute
> to efficiently notify all registered RamDiscardListeners, including
> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
> listener can dynamically DMA map/unmap shared pages based on conversion
> types:
> - For conversions from shared to private, the VFIO system ensures the
>    discarding of shared mapping from the IOMMU.
> - For conversions from private to shared, it triggers the population of
>    the shared mapping into the IOMMU.
> 
> Currently, memory conversion failures cause QEMU to quit instead of
> resuming the guest or retrying the operation. It would be a future work
> to add more error handling or rollback mechanisms once conversion
> failures are allowed. For example, in-place conversion of guest_memfd
> could retry the unmap operation during the conversion from shared to
> private. However, for now, keep the complex error handling out of the
> picture as it is not required:
> 
> - If a conversion request is made for a page already in the desired
>    state, the helper simply returns success.
> - For requests involving a range partially in the desired state, there
>    is no such scenario in practice at present. Simply return error.
> - If a conversion request is declined by other systems, such as a
>    failure from VFIO during notify_to_populated(), the failure is
>    returned directly. As for notify_to_discard(), VFIO cannot fail
>    unmap/unpin, so no error is returned.
> 
> Note that the bitmap status is updated before callbacks, allowing
> listeners to handle memory based on the latest status.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---

I think this should be squashed into the previous patch: I fail to see 
why the split makes sense.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
  2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
@ 2025-05-26  9:06   ` David Hildenbrand
  2025-05-26  9:46     ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:06 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> A new field, ram_shared, was introduced in RAMBlock to link to a
> RamBlockAttribute object, which centralizes all guest_memfd state
> information (such as fd and shared_bitmap) within a RAMBlock.
> 
> Create and initialize the RamBlockAttribute object upon ram_block_add().
> Meanwhile, register the object in the target RAMBlock's MemoryRegion.
> After that, guest_memfd-backed RAMBlock is associated with the
> RamDiscardManager interface, and the users will execute
> RamDiscardManager specific handling. For example, VFIO will register the
> RamDiscardListener as expected. The live migration path needs to be
> avoided since it is not supported yet in confidential VMs.
> 
> Additionally, use the ram_block_attribute_state_change() helper to
> notify the registered RamDiscardListener of these changes.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use RamDiscardManager interface.
>      - Move the object_new() into the ram_block_attribute_create()
>        helper.
>      - Add some check in migration path.
> 
> Changes in v4:
>      - Remove the replay operations for attribute changes which will be
>        handled in a listener in following patches.
>      - Add some comment in the error path of realize() to remind the
>        future development of the unified error path.
> 
> Changes in v3:
>      - Use ram_discard_manager_reply_populated/discarded() to set the
>        memory attribute and add the undo support if state_change()
>        failed.
>      - Didn't add Reviewed-by from Alexey due to the new changes in this
>        commit.
> 
> Changes in v2:
>      - Introduce a new field memory_attribute_manager in RAMBlock.
>      - Move the state_change() handling during page conversion in this patch.
>      - Undo what we did if it fails to set.
>      - Change the order of close(guest_memfd) and memory_attribute_manager cleanup.
> ---
>   accel/kvm/kvm-all.c |  9 +++++++++
>   migration/ram.c     | 28 ++++++++++++++++++++++++++++
>   system/physmem.c    | 14 ++++++++++++++
>   3 files changed, 51 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 51526d301b..2d7ecaeb6a 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>       addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
>       rb = qemu_ram_block_from_host(addr, false, &offset);
>   
> +    ret = ram_block_attribute_state_change(RAM_BLOCK_ATTRIBUTE(mr->rdm),
> +                                           offset, size, to_private);
> +    if (ret) {
> +        error_report("Failed to notify the listener the state change of "
> +                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
> +                     start, size, to_private ? "private" : "shared");
> +        goto out_unref;
> +    }
> +
>       if (to_private) {
>           if (rb->page_size != qemu_real_host_page_size()) {
>               /*
> diff --git a/migration/ram.c b/migration/ram.c
> index c004f37060..69c9a42f16 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -890,6 +890,13 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
>   
>       if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
>           RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
> +
> +        if (object_dynamic_cast(OBJECT(rdm), TYPE_RAM_BLOCK_ATTRIBUTE)) {
> +            error_report("%s: Live migration for confidential VM is not "
> +                         "supported yet.", __func__);
> +            exit(1);
> +        }
>

These checks seem conceptually wrong.

I think if we were to special-case specific implementations, we should 
do so using a different callback.

But why should we bother at all checking basic live migration handling 
("unsupported for confidential VMs") on this level, and even just 
exit()'ing the process?

All these object_dynamic_cast() checks in this patch should be dropped.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
@ 2025-05-26  9:08   ` David Hildenbrand
  2025-05-27  5:47     ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:08 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> As guest_memfd is now managed by RamBlockAttribute with
> RamDiscardManager, only block uncoordinated discard.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use RamDiscardManager.
> 
> Changes in v4:
>      - Modify commit message (RamDiscardManager->PrivateSharedManager).
> 
> Changes in v3:
>      - No change.
> 
> Changes in v2:
>      - Change the ram_block_discard_require(false) to
>        ram_block_coordinated_discard_require(false).
> ---
>   system/physmem.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index f05f7ff09a..58b7614660 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>           }
>           assert(new_block->guest_memfd < 0);
>   
> -        ret = ram_block_discard_require(true);
> +        ret = ram_block_coordinated_discard_require(true);
>           if (ret < 0) {
>               error_setg_errno(errp, -ret,
>                                "cannot set up private guest memory: discard currently blocked");
> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                * ever develops a need to check for errors.
>                */
>               close(new_block->guest_memfd);
> -            ram_block_discard_require(false);
> +            ram_block_coordinated_discard_require(false);
>               qemu_mutex_unlock_ramlist();
>               goto out_free;
>           }
> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>       if (block->guest_memfd >= 0) {
>           ram_block_attribute_destroy(block->ram_shared);
>           close(block->guest_memfd);
> -        ram_block_discard_require(false);
> +        ram_block_coordinated_discard_require(false);
>       }
>   
>       g_free(block);


I think this patch should be squashed into the previous one, then the 
story in that single patch is consistent.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
@ 2025-05-26  9:17   ` David Hildenbrand
  2025-05-26 10:19     ` Chenyi Qiang
  2025-05-27  9:11   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:17 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> The current error handling is simple with the following assumption:
> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>    fails, thus no need to do rollback.
> - The convert range is required to be in the desired state. It is not
>    allowed to handle the mixture case.
> - The conversion from shared to private is a non-failure operation.
> 
> This is sufficient for now as complext error handling is not required.
> For future extension, add some potential error handling.
> - For private to shared conversion, do the rollback operation if
>    ram_block_attribute_notify_to_populated() fails.
> - For shared to private conversion, still assert it as a non-failure
>    operation for now. It could be an easy fail path with in-place
>    conversion, which will likely have to retry the conversion until it
>    works in the future.
> - For mixture case, process individual blocks for ease of rollback.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
>   1 file changed, 90 insertions(+), 26 deletions(-)
> 
> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
> index 387501b569..0af3396aa4 100644
> --- a/system/ram-block-attribute.c
> +++ b/system/ram-block-attribute.c
> @@ -289,7 +289,12 @@ static int ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>           }
>           ret = rdl->notify_discard(rdl, &tmp);
>           if (ret) {
> -            break;
> +            /*
> +             * The current to_private listeners (VFIO dma_unmap and
> +             * KVM set_attribute_private) are non-failing operations.
> +             * TODO: add rollback operations if it is allowed to fail.
> +             */
> +            g_assert(ret);
>           }
>       }
>   

If it's not allowed to fail for now, then patch #8 does not make sense 
and should be dropped :)

The implementations (vfio) should likely exit() instead on unexpected 
errors when discarding.



Why not squash all the below into the corresponding patch? Looks mostly 
like handling partial conversions correctly (as discussed previously)?

> @@ -300,7 +305,7 @@ static int
>   ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>                                           uint64_t offset, uint64_t size)
>   {
> -    RamDiscardListener *rdl;
> +    RamDiscardListener *rdl, *rdl2;
>       int ret = 0;
>   
>       QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> @@ -315,6 +320,20 @@ ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>           }
>       }
>   
> +    if (ret) {
> +        /* Notify all already-notified listeners. */
> +        QLIST_FOREACH(rdl2, &attr->rdl_list, next) {
> +            MemoryRegionSection tmp = *rdl2->section;
> +
> +            if (rdl == rdl2) {
> +                break;
> +            }
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +                continue;
> +            }
> +            rdl2->notify_discard(rdl2, &tmp);
> +        }
> +    }
>       return ret;
>   }
>   
> @@ -353,6 +372,9 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
>       const int block_size = ram_block_attribute_get_block_size(attr);
>       const unsigned long first_bit = offset / block_size;
>       const unsigned long nbits = size / block_size;
> +    const uint64_t end = offset + size;
> +    unsigned long bit;
> +    uint64_t cur;
>       int ret = 0;
>   
>       if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
> @@ -361,32 +383,74 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
>           return -1;
>       }
>   
> -    /* Already discard/populated */
> -    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
> -         to_private) ||
> -        (ram_block_attribute_is_range_populated(attr, offset, size) &&
> -         !to_private)) {
> -        return 0;
> -    }
> -
> -    /* Unexpected mixture */
> -    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
> -         to_private) ||
> -        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
> -         !to_private)) {
> -        error_report("%s, the range is not all in the desired state: "
> -                     "(offset 0x%lx, size 0x%lx), %s",
> -                     __func__, offset, size,
> -                     to_private ? "private" : "shared");
> -        return -1;
> -    }
> -
>       if (to_private) {
> -        bitmap_clear(attr->bitmap, first_bit, nbits);
> -        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
> +        if (ram_block_attribute_is_range_discard(attr, offset, size)) {
> +            /* Already private */
> +        } else if (!ram_block_attribute_is_range_populated(attr, offset,
> +                                                           size)) {
> +            /* Unexpected mixture: process individual blocks */
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (!test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                ram_block_attribute_notify_to_discard(attr, cur, block_size);
> +            }
> +        } else {
> +            /* Completely shared */
> +            bitmap_clear(attr->bitmap, first_bit, nbits);
> +            ram_block_attribute_notify_to_discard(attr, offset, size);
> +        }
>       } else {
> -        bitmap_set(attr->bitmap, first_bit, nbits);
> -        ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> +        if (ram_block_attribute_is_range_populated(attr, offset, size)) {
> +            /* Already shared */
> +        } else if (!ram_block_attribute_is_range_discard(attr, offset, size)) {
> +            /* Unexpected mixture: process individual blocks */
> +            unsigned long *modified_bitmap = bitmap_new(nbits);
> +
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                set_bit(bit, attr->bitmap);
> +                ret = ram_block_attribute_notify_to_populated(attr, cur,
> +                                                           block_size);
> +                if (!ret) {
> +                    set_bit(bit - first_bit, modified_bitmap);
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                break;
> +            }
> +
> +            if (ret) {
> +                /*
> +                 * Very unexpected: something went wrong. Revert to the old
> +                 * state, marking only the blocks as private that we converted
> +                 * to shared.
> +                 */
> +                for (cur = offset; cur < end; cur += block_size) {
> +                    bit = cur / block_size;
> +                    if (!test_bit(bit - first_bit, modified_bitmap)) {
> +                        continue;
> +                    }
> +                    assert(test_bit(bit, attr->bitmap));
> +                    clear_bit(bit, attr->bitmap);
> +                    ram_block_attribute_notify_to_discard(attr, cur,
> +                                                          block_size);
> +                }
> +            }
> +            g_free(modified_bitmap);
> +        } else {
> +            /* Complete private */
> +            bitmap_set(attr->bitmap, first_bit, nbits);
> +            ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> +            if (ret) {
> +                bitmap_clear(attr->bitmap, first_bit, nbits);
> +            }
> +        }
>       }
>   
>       return ret;


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions
  2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
@ 2025-05-26  9:22   ` David Hildenbrand
  2025-05-27  8:01   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26  9:22 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20.05.25 12:28, Chenyi Qiang wrote:
> With the introduction of the RamBlockAttribute object to manage
> RAMBlocks with guest_memfd, it is more elegant to move KVM set attribute
> into a RamDiscardListener.
> 
> The KVM attribute change RamDiscardListener is registered/unregistered
> for each memory region section during kvm_region_add/del(). The listener
> handler performs attribute change upon receiving notifications from
> ram_block_attribute_state_change() calls. After this change, the
> operations in kvm_convert_memory() can be removed.
> 
> Note that, errors can be returned in
> ram_block_attribute_notify_to_discard() by KVM attribute changes,
> although it is currently unlikely to happen. With in-place conversion
> guest_memfd in the future, it would be more likely to encounter errors
> and require error handling. For now, simply return the result, and
> kvm_convert_memory() will cause QEMU to quit if any issue arises.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---

[...]

>   static void kvm_region_commit(MemoryListener *listener)
> @@ -3077,15 +3140,6 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>           goto out_unref;
>       }
>   
> -    if (to_private) {
> -        ret = kvm_set_memory_attributes_private(start, size);
> -    } else {
> -        ret = kvm_set_memory_attributes_shared(start, size);
> -    }
> -    if (ret) {
> -        goto out_unref;
> -    }
> -

I wonder if it's best to leave that out for now. With in-place 
conversion it will all get a bit more tricky, because we'd need to call 
in different orders ...

e.g., do private -> shared before mapping to vfio, but to shared 
->private after unmapping from vfio.

That can be easier handled when doing the calls from KVM code directly.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-26  9:01   ` David Hildenbrand
@ 2025-05-26  9:28     ` Chenyi Qiang
  2025-05-26 11:16       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26  9:28 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 5:01 PM, David Hildenbrand wrote:
> On 20.05.25 12:28, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") highlighted that subsystems like VFIO may disable RAM block
>> discard. However, guest_memfd relies on discard operations for page
>> conversion between private and shared memory, potentially leading to
>> stale IOMMU mapping issue when assigning hardware devices to
>> confidential VMs via shared memory. To address this and allow shared
>> device assignement, it is crucial to ensure VFIO system refresh its
>> IOMMU mappings.
>>
>> RamDiscardManager is an existing interface (used by virtio-mem) to
>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>> conversion is similar to hot-removing a page in one mode and adding it
>> back in the other. Therefore, similar actions are required for page
>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>> facilitate this process.
>>
>> Since guest_memfd is not an object, it cannot directly implement the
>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>> have a memory backend while others do not. Notably, virtual BIOS
>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>> backend.
>>
>> To manage RAMBlocks with guest_memfd, define a new object named
>> RamBlockAttribute to implement the RamDiscardManager interface. This
>> object can store the guest_memfd information such as bitmap for shared
>> memory, and handles page conversion notification. In the context of
>> RamDiscardManager, shared state is analogous to populated and private
>> state is treated as discard. The memory state is tracked at the host
>> page size granularity, as minimum memory conversion size can be one page
>> per request. Additionally, VFIO expects the DMA mapping for a specific
>> iova to be mapped and unmapped with the same granularity. Confidential
>> VMs may perform partial conversions, such as conversions on small
>> regions within larger regions. To prevent such invalid cases and until
>> cut_mapping operation support is available, all operations are performed
>> with 4K granularity.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v5:
>>      - Revert to use RamDiscardManager interface instead of introducing
>>        new hierarchy of class to manage private/shared state, and keep
>>        using the new name of RamBlockAttribute compared with the
>>        MemoryAttributeManager in v3.
>>      - Use *simple* version of object_define and object_declare since the
>>        state_change() function is changed as an exported function instead
>>        of a virtual function in later patch.
>>      - Move the introduction of RamBlockAttribute field to this patch and
>>        rename it to ram_shared. (Alexey)
>>      - call the exit() when register/unregister failed. (Zhao)
>>      - Add the ram-block-attribute.c to Memory API related part in
>>        MAINTAINERS.
>>
>> Changes in v4:
>>      - Change the name from memory-attribute-manager to
>>        ram-block-attribute.
>>      - Implement the newly-introduced PrivateSharedManager instead of
>>        RamDiscardManager and change related commit message.
>>      - Define the new object in ramblock.h instead of adding a new file.
>>
>> Changes in v3:
>>      - Some rename (bitmap_size->shared_bitmap_size,
>>        first_one/zero_bit->first_bit, etc.)
>>      - Change shared_bitmap_size from uint32_t to unsigned
>>      - Return mgr->mr->ram_block->page_size in get_block_size()
>>      - Move set_ram_discard_manager() up to avoid a g_free() in failure
>>        case.
>>      - Add const for the memory_attribute_manager_get_block_size()
>>      - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>        callback.
>>
>> Changes in v2:
>>      - Rename the object name to MemoryAttributeManager
>>      - Rename the bitmap to shared_bitmap to make it more clear.
>>      - Remove block_size field and get it from a helper. In future, we
>>        can get the page_size from RAMBlock if necessary.
>>      - Remove the unncessary "struct" before GuestMemfdReplayData
>>      - Remove the unncessary g_free() for the bitmap
>>      - Add some error report when the callback failure for
>>        populated/discarded section.
>>      - Move the realize()/unrealize() definition to this patch.
>> ---
>>   MAINTAINERS                  |   1 +
>>   include/system/ramblock.h    |  20 +++
>>   system/meson.build           |   1 +
>>   system/ram-block-attribute.c | 311 +++++++++++++++++++++++++++++++++++
>>   4 files changed, 333 insertions(+)
>>   create mode 100644 system/ram-block-attribute.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 6dacd6d004..3b4947dc74 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>   F: system/memory_mapping.c
>>   F: system/physmem.c
>>   F: system/memory-internal.h
>> +F: system/ram-block-attribute.c
>>   F: scripts/coccinelle/memory-region-housekeeping.cocci
>>     Memory devices
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index d8a116ba99..09255e8495 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -22,6 +22,10 @@
>>   #include "exec/cpu-common.h"
>>   #include "qemu/rcu.h"
>>   #include "exec/ramlist.h"
>> +#include "system/hostmem.h"
>> +
>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>     struct RAMBlock {
>>       struct rcu_head rcu;
>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>       int fd;
>>       uint64_t fd_offset;
>>       int guest_memfd;
>> +    /* 1-setting of the bitmap in ram_shared represents ram is shared */
> 
> That comment looks misplaced, and the variable misnamed.
> 
> The commet should go into RamBlockAttribute and the variable should
> likely be named "attributes".
> 
> Also, "ram_shared" is not used at all in this patch, it should be moved
> into the corresponding patch.

I thought we only manage the private and shared attribute, so name it as
ram_shared. And in the future if managing other attributes, then rename
it to attributes. It seems I overcomplicated things.

> 
>> +    RamBlockAttribute *ram_shared;
>>       size_t page_size;
>>       /* dirty bitmap used during migration */
>>       unsigned long *bmap;
>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>       ram_addr_t postcopy_length;
>>   };
>>   +struct RamBlockAttribute {
> 
> Should this actually be "RamBlockAttributes" ?

Yes. To match with variable name "attributes", it can be renamed as
RamBlockAttributes.

> 
>> +    Object parent;
>> +
>> +    MemoryRegion *mr;
> 
> 
> Should we link to the parent RAMBlock instead, and lookup the MR from
> there?

Good suggestion! It can also help to reduce the long arrow operation in
ram_block_attribute_get_block_size().

> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result
  2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
@ 2025-05-26  9:31   ` Philippe Mathieu-Daudé
  2025-05-26 10:36   ` Cédric Le Goater
  1 sibling, 0 replies; 51+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-05-26  9:31 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 20/5/25 12:28, Chenyi Qiang wrote:
> So that the caller can check the result of NotifyRamDiscard() handler if
> the operation fails.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use of NotifyRamDiscard()
> 
> Changes in v4:
>      - Newly added.
> ---
>   hw/vfio/listener.c           | 6 ++++--
>   include/system/memory.h      | 4 ++--
>   system/ram-block-attribute.c | 3 +--
>   3 files changed, 7 insertions(+), 6 deletions(-)


> diff --git a/include/system/memory.h b/include/system/memory.h
> index 83b28551c4..e5155120d9 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -518,8 +518,8 @@ struct IOMMUMemoryRegionClass {
>   typedef struct RamDiscardListener RamDiscardListener;
>   typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
>                                    MemoryRegionSection *section);
> -typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
> -                                 MemoryRegionSection *section);
> +typedef int (*NotifyRamDiscard)(RamDiscardListener *rdl,
> +                                MemoryRegionSection *section);

Please document the return value.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
  2025-05-26  8:42   ` David Hildenbrand
@ 2025-05-26  9:35   ` Philippe Mathieu-Daudé
  2025-05-26 10:21     ` Chenyi Qiang
  2025-05-27  6:56   ` Alexey Kardashevskiy
  2 siblings, 1 reply; 51+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-05-26  9:35 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

Hi Chenyi Qiang,

On 20/5/25 12:28, Chenyi Qiang wrote:
> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
>      - return data->fn(s, data->opaque) instead of 0 in
>        virtio_mem_rdm_replay_discarded_cb(). (Alexey)
> 
> Changes in v4:
>      - Modify the commit message. We won't use Replay() operation when
>        doing the attribute change like v3.
> 
> Changes in v3:
>      - Newly added.
> ---
>   hw/virtio/virtio-mem.c  | 21 ++++++++++-----------
>   include/system/memory.h | 36 +++++++++++++++++++-----------------
>   migration/ram.c         |  5 +++--
>   system/memory.c         | 12 ++++++------
>   4 files changed, 38 insertions(+), 36 deletions(-)


> diff --git a/include/system/memory.h b/include/system/memory.h
> index 896948deb1..83b28551c4 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -575,8 +575,8 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
>       rdl->double_discard_supported = double_discard_supported;
>   }
>   
> -typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
> -typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
> +typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
> +                                     void *opaque);

While changing this prototype, please add a documentation comment.

>   /*
>    * RamDiscardManagerClass:
> @@ -650,36 +650,38 @@ struct RamDiscardManagerClass {
>       /**
>        * @replay_populated:
>        *
> -     * Call the #ReplayRamPopulate callback for all populated parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all populated parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * In case any call fails, no further calls are made.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamPopulate callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
>        *
>        * Returns 0 on success, or a negative error if any notification failed.
>        */
>       int (*replay_populated)(const RamDiscardManager *rdm,
>                               MemoryRegionSection *section,
> -                            ReplayRamPopulate replay_fn, void *opaque);
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @replay_discarded:
>        *
> -     * Call the #ReplayRamDiscard callback for all discarded parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all discarded parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamDiscard callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
> +     *
> +     * Returns 0 on success, or a negative error if any notification failed.
>        */
> -    void (*replay_discarded)(const RamDiscardManager *rdm,
> -                             MemoryRegionSection *section,
> -                             ReplayRamDiscard replay_fn, void *opaque);
> +    int (*replay_discarded)(const RamDiscardManager *rdm,
> +                            MemoryRegionSection *section,
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @register_listener:
> @@ -722,13 +724,13 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
>   
>   int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>                                            MemoryRegionSection *section,
> -                                         ReplayRamPopulate replay_fn,
> +                                         ReplayRamDiscardState replay_fn,
>                                            void *opaque);
>   
> -void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> -                                          MemoryRegionSection *section,
> -                                          ReplayRamDiscard replay_fn,
> -                                          void *opaque);
> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> +                                         MemoryRegionSection *section,
> +                                         ReplayRamDiscardState replay_fn,
> +                                         void *opaque);

Similar for ram_discard_manager_replay_populated() and
ram_discard_manager_replay_discarded(), since you understood
what they do :)

Thanks!

Phil.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
  2025-05-26  9:06   ` David Hildenbrand
@ 2025-05-26  9:46     ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26  9:46 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 5:06 PM, David Hildenbrand wrote:
> On 20.05.25 12:28, Chenyi Qiang wrote:
>> A new field, ram_shared, was introduced in RAMBlock to link to a
>> RamBlockAttribute object, which centralizes all guest_memfd state
>> information (such as fd and shared_bitmap) within a RAMBlock.
>>
>> Create and initialize the RamBlockAttribute object upon ram_block_add().
>> Meanwhile, register the object in the target RAMBlock's MemoryRegion.
>> After that, guest_memfd-backed RAMBlock is associated with the
>> RamDiscardManager interface, and the users will execute
>> RamDiscardManager specific handling. For example, VFIO will register the
>> RamDiscardListener as expected. The live migration path needs to be
>> avoided since it is not supported yet in confidential VMs.
>>
>> Additionally, use the ram_block_attribute_state_change() helper to
>> notify the registered RamDiscardListener of these changes.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v5:
>>      - Revert to use RamDiscardManager interface.
>>      - Move the object_new() into the ram_block_attribute_create()
>>        helper.
>>      - Add some check in migration path.
>>
>> Changes in v4:
>>      - Remove the replay operations for attribute changes which will be
>>        handled in a listener in following patches.
>>      - Add some comment in the error path of realize() to remind the
>>        future development of the unified error path.
>>
>> Changes in v3:
>>      - Use ram_discard_manager_reply_populated/discarded() to set the
>>        memory attribute and add the undo support if state_change()
>>        failed.
>>      - Didn't add Reviewed-by from Alexey due to the new changes in this
>>        commit.
>>
>> Changes in v2:
>>      - Introduce a new field memory_attribute_manager in RAMBlock.
>>      - Move the state_change() handling during page conversion in this
>> patch.
>>      - Undo what we did if it fails to set.
>>      - Change the order of close(guest_memfd) and
>> memory_attribute_manager cleanup.
>> ---
>>   accel/kvm/kvm-all.c |  9 +++++++++
>>   migration/ram.c     | 28 ++++++++++++++++++++++++++++
>>   system/physmem.c    | 14 ++++++++++++++
>>   3 files changed, 51 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 51526d301b..2d7ecaeb6a 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -3089,6 +3089,15 @@ int kvm_convert_memory(hwaddr start, hwaddr
>> size, bool to_private)
>>       addr = memory_region_get_ram_ptr(mr) +
>> section.offset_within_region;
>>       rb = qemu_ram_block_from_host(addr, false, &offset);
>>   +    ret = ram_block_attribute_state_change(RAM_BLOCK_ATTRIBUTE(mr-
>> >rdm),
>> +                                           offset, size, to_private);
>> +    if (ret) {
>> +        error_report("Failed to notify the listener the state change
>> of "
>> +                     "(0x%"HWADDR_PRIx" + 0x%"HWADDR_PRIx") to %s",
>> +                     start, size, to_private ? "private" : "shared");
>> +        goto out_unref;
>> +    }
>> +
>>       if (to_private) {
>>           if (rb->page_size != qemu_real_host_page_size()) {
>>               /*
>> diff --git a/migration/ram.c b/migration/ram.c
>> index c004f37060..69c9a42f16 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -890,6 +890,13 @@ static uint64_t
>> ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
>>         if (rb->mr && rb->bmap &&
>> memory_region_has_ram_discard_manager(rb->mr)) {
>>           RamDiscardManager *rdm =
>> memory_region_get_ram_discard_manager(rb->mr);
>> +
>> +        if (object_dynamic_cast(OBJECT(rdm),
>> TYPE_RAM_BLOCK_ATTRIBUTE)) {
>> +            error_report("%s: Live migration for confidential VM is
>> not "
>> +                         "supported yet.", __func__);
>> +            exit(1);
>> +        }
>>
> 
> These checks seem conceptually wrong.
> 
> I think if we were to special-case specific implementations, we should
> do so using a different callback.
> 
> But why should we bother at all checking basic live migration handling
> ("unsupported for confidential VMs") on this level, and even just
> exit()'ing the process?

I thought these checks can act as some placeholder and provide
notifications when someone is trying to implement live migration in
Coco-VM. (As this series brought some confusion when developing the
related live migration handling internally). It is perfectly OK to drop
them. Sorry for confusion.

> 
> All these object_dynamic_cast() checks in this patch should be dropped.

Will do. Thanks!

> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-26  9:17   ` David Hildenbrand
@ 2025-05-26 10:19     ` Chenyi Qiang
  2025-05-26 12:10       ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26 10:19 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 5:17 PM, David Hildenbrand wrote:
> On 20.05.25 12:28, Chenyi Qiang wrote:
>> The current error handling is simple with the following assumption:
>> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>>    fails, thus no need to do rollback.
>> - The convert range is required to be in the desired state. It is not
>>    allowed to handle the mixture case.
>> - The conversion from shared to private is a non-failure operation.
>>
>> This is sufficient for now as complext error handling is not required.
>> For future extension, add some potential error handling.
>> - For private to shared conversion, do the rollback operation if
>>    ram_block_attribute_notify_to_populated() fails.
>> - For shared to private conversion, still assert it as a non-failure
>>    operation for now. It could be an easy fail path with in-place
>>    conversion, which will likely have to retry the conversion until it
>>    works in the future.
>> - For mixture case, process individual blocks for ease of rollback.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
>>   1 file changed, 90 insertions(+), 26 deletions(-)
>>
>> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
>> index 387501b569..0af3396aa4 100644
>> --- a/system/ram-block-attribute.c
>> +++ b/system/ram-block-attribute.c
>> @@ -289,7 +289,12 @@ static int
>> ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>>           }
>>           ret = rdl->notify_discard(rdl, &tmp);
>>           if (ret) {
>> -            break;
>> +            /*
>> +             * The current to_private listeners (VFIO dma_unmap and
>> +             * KVM set_attribute_private) are non-failing operations.
>> +             * TODO: add rollback operations if it is allowed to fail.
>> +             */
>> +            g_assert(ret);
>>           }
>>       }
>>   
> 
> If it's not allowed to fail for now, then patch #8 does not make sense
> and should be dropped :)

It was intended for future extension as in-place conversion to_private
allows it to fail. So I add the patch #8.

But as you mentioned, since the conversion path is changing, and maybe
it is easier to handle from KVM code directly. Let me drop patch #8 and
wait for the in-place conversion to mature.

> 
> The implementations (vfio) should likely exit() instead on unexpected
> errors when discarding.

After drop patch #8, maybe keep vfio discard handling as it was. Adding
an additional exit() is also OK to me since it's non-fail case.

> 
> 
> 
> Why not squash all the below into the corresponding patch? Looks mostly
> like handling partial conversions correctly (as discussed previously)?

I extract these two handling 1) mixture conversion; 2) operation
rollback into this individual patch because they are not the practical
cases and are untested.

For 1), I still don't see any real case which will convert a range with
mixture attributes.

For 2), current failure of memory conversion (as seen in kvm_cpu_exec()
->kvm_convert_memory()) will cause the QEMU to quit instead of resuming
guest. Doing the rollback seems useless at present.

> 
>> @@ -300,7 +305,7 @@ static int
>>   ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>                                           uint64_t offset, uint64_t size)
>>   {
>> -    RamDiscardListener *rdl;
>> +    RamDiscardListener *rdl, *rdl2;
>>       int ret = 0;
>>         QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> @@ -315,6 +320,20 @@
>> ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>           }
>>       }
>>   +    if (ret) {
>> +        /* Notify all already-notified listeners. */
>> +        QLIST_FOREACH(rdl2, &attr->rdl_list, next) {
>> +            MemoryRegionSection tmp = *rdl2->section;
>> +
>> +            if (rdl == rdl2) {
>> +                break;
>> +            }
>> +            if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +                continue;
>> +            }
>> +            rdl2->notify_discard(rdl2, &tmp);
>> +        }
>> +    }
>>       return ret;
>>   }
>>   @@ -353,6 +372,9 @@ int
>> ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t
>> offset,
>>       const int block_size = ram_block_attribute_get_block_size(attr);
>>       const unsigned long first_bit = offset / block_size;
>>       const unsigned long nbits = size / block_size;
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>>       int ret = 0;
>>         if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>> @@ -361,32 +383,74 @@ int
>> ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t
>> offset,
>>           return -1;
>>       }
>>   -    /* Already discard/populated */
>> -    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>> -         to_private) ||
>> -        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>> -         !to_private)) {
>> -        return 0;
>> -    }
>> -
>> -    /* Unexpected mixture */
>> -    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>> -         to_private) ||
>> -        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>> -         !to_private)) {
>> -        error_report("%s, the range is not all in the desired state: "
>> -                     "(offset 0x%lx, size 0x%lx), %s",
>> -                     __func__, offset, size,
>> -                     to_private ? "private" : "shared");
>> -        return -1;
>> -    }
>> -
>>       if (to_private) {
>> -        bitmap_clear(attr->bitmap, first_bit, nbits);
>> -        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
>> +        if (ram_block_attribute_is_range_discard(attr, offset, size)) {
>> +            /* Already private */
>> +        } else if (!ram_block_attribute_is_range_populated(attr, offset,
>> +                                                           size)) {
>> +            /* Unexpected mixture: process individual blocks */
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (!test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                ram_block_attribute_notify_to_discard(attr, cur,
>> block_size);
>> +            }
>> +        } else {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attribute_notify_to_discard(attr, offset, size);
>> +        }
>>       } else {
>> -        bitmap_set(attr->bitmap, first_bit, nbits);
>> -        ret = ram_block_attribute_notify_to_populated(attr, offset,
>> size);
>> +        if (ram_block_attribute_is_range_populated(attr, offset,
>> size)) {
>> +            /* Already shared */
>> +        } else if (!ram_block_attribute_is_range_discard(attr,
>> offset, size)) {
>> +            /* Unexpected mixture: process individual blocks */
>> +            unsigned long *modified_bitmap = bitmap_new(nbits);
>> +
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                set_bit(bit, attr->bitmap);
>> +                ret = ram_block_attribute_notify_to_populated(attr, cur,
>> +                                                           block_size);
>> +                if (!ret) {
>> +                    set_bit(bit - first_bit, modified_bitmap);
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                break;
>> +            }
>> +
>> +            if (ret) {
>> +                /*
>> +                 * Very unexpected: something went wrong. Revert to
>> the old
>> +                 * state, marking only the blocks as private that we
>> converted
>> +                 * to shared.
>> +                 */
>> +                for (cur = offset; cur < end; cur += block_size) {
>> +                    bit = cur / block_size;
>> +                    if (!test_bit(bit - first_bit, modified_bitmap)) {
>> +                        continue;
>> +                    }
>> +                    assert(test_bit(bit, attr->bitmap));
>> +                    clear_bit(bit, attr->bitmap);
>> +                    ram_block_attribute_notify_to_discard(attr, cur,
>> +                                                          block_size);
>> +                }
>> +            }
>> +            g_free(modified_bitmap);
>> +        } else {
>> +            /* Complete private */
>> +            bitmap_set(attr->bitmap, first_bit, nbits);
>> +            ret = ram_block_attribute_notify_to_populated(attr,
>> offset, size);
>> +            if (ret) {
>> +                bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            }
>> +        }
>>       }
>>         return ret;
> 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-26  9:35   ` Philippe Mathieu-Daudé
@ 2025-05-26 10:21     ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26 10:21 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, David Hildenbrand,
	Alexey Kardashevskiy, Peter Xu, Gupta Pankaj, Paolo Bonzini,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 5:35 PM, Philippe Mathieu-Daudé wrote:
> Hi Chenyi Qiang,
> 
> On 20/5/25 12:28, Chenyi Qiang wrote:
>> Update ReplayRamDiscard() function to return the result and unify the
>> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
>> the same time due to their identical definitions. This unification
>> simplifies related structures, such as VirtIOMEMReplayData, which makes
>> it cleaner.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v5:
>>      - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
>>      - return data->fn(s, data->opaque) instead of 0 in
>>        virtio_mem_rdm_replay_discarded_cb(). (Alexey)
>>
>> Changes in v4:
>>      - Modify the commit message. We won't use Replay() operation when
>>        doing the attribute change like v3.
>>
>> Changes in v3:
>>      - Newly added.
>> ---
>>   hw/virtio/virtio-mem.c  | 21 ++++++++++-----------
>>   include/system/memory.h | 36 +++++++++++++++++++-----------------
>>   migration/ram.c         |  5 +++--
>>   system/memory.c         | 12 ++++++------
>>   4 files changed, 38 insertions(+), 36 deletions(-)
> 
> 
>> diff --git a/include/system/memory.h b/include/system/memory.h
>> index 896948deb1..83b28551c4 100644
>> --- a/include/system/memory.h
>> +++ b/include/system/memory.h
>> @@ -575,8 +575,8 @@ static inline void
>> ram_discard_listener_init(RamDiscardListener *rdl,
>>       rdl->double_discard_supported = double_discard_supported;
>>   }
>>   -typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void
>> *opaque);
>> -typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void
>> *opaque);
>> +typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
>> +                                     void *opaque);
> 
> While changing this prototype, please add a documentation comment.

[...]

> 
>>   /*
>>    * RamDiscardManagerClass:
>> @@ -650,36 +650,38 @@ struct RamDiscardManagerClass {
>>       /**
>>        * @replay_populated:
>>        *
>> -     * Call the #ReplayRamPopulate callback for all populated parts
>> within the
>> -     * #MemoryRegionSection via the #RamDiscardManager.
>> +     * Call the #ReplayRamDiscardState callback for all populated
>> parts within
>> +     * the #MemoryRegionSection via the #RamDiscardManager.
>>        *
>>        * In case any call fails, no further calls are made.
>>        *
>>        * @rdm: the #RamDiscardManager
>>        * @section: the #MemoryRegionSection
>> -     * @replay_fn: the #ReplayRamPopulate callback
>> +     * @replay_fn: the #ReplayRamDiscardState callback
>>        * @opaque: pointer to forward to the callback
>>        *
>>        * Returns 0 on success, or a negative error if any notification
>> failed.
>>        */
>>       int (*replay_populated)(const RamDiscardManager *rdm,
>>                               MemoryRegionSection *section,
>> -                            ReplayRamPopulate replay_fn, void *opaque);
>> +                            ReplayRamDiscardState replay_fn, void
>> *opaque);
>>         /**
>>        * @replay_discarded:
>>        *
>> -     * Call the #ReplayRamDiscard callback for all discarded parts
>> within the
>> -     * #MemoryRegionSection via the #RamDiscardManager.
>> +     * Call the #ReplayRamDiscardState callback for all discarded
>> parts within
>> +     * the #MemoryRegionSection via the #RamDiscardManager.
>>        *
>>        * @rdm: the #RamDiscardManager
>>        * @section: the #MemoryRegionSection
>> -     * @replay_fn: the #ReplayRamDiscard callback
>> +     * @replay_fn: the #ReplayRamDiscardState callback
>>        * @opaque: pointer to forward to the callback
>> +     *
>> +     * Returns 0 on success, or a negative error if any notification
>> failed.
>>        */
>> -    void (*replay_discarded)(const RamDiscardManager *rdm,
>> -                             MemoryRegionSection *section,
>> -                             ReplayRamDiscard replay_fn, void *opaque);
>> +    int (*replay_discarded)(const RamDiscardManager *rdm,
>> +                            MemoryRegionSection *section,
>> +                            ReplayRamDiscardState replay_fn, void
>> *opaque);
>>         /**
>>        * @register_listener:
>> @@ -722,13 +724,13 @@ bool ram_discard_manager_is_populated(const
>> RamDiscardManager *rdm,
>>     int ram_discard_manager_replay_populated(const RamDiscardManager
>> *rdm,
>>                                            MemoryRegionSection *section,
>> -                                         ReplayRamPopulate replay_fn,
>> +                                         ReplayRamDiscardState
>> replay_fn,
>>                                            void *opaque);
>>   -void ram_discard_manager_replay_discarded(const RamDiscardManager
>> *rdm,
>> -                                          MemoryRegionSection *section,
>> -                                          ReplayRamDiscard replay_fn,
>> -                                          void *opaque);
>> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
>> +                                         MemoryRegionSection *section,
>> +                                         ReplayRamDiscardState
>> replay_fn,
>> +                                         void *opaque);
> 
> Similar for ram_discard_manager_replay_populated() and
> ram_discard_manager_replay_discarded(), since you understood
> what they do :)

Sure, will do. Thanks!

> 
> Thanks!
> 
> Phil.
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result
  2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
  2025-05-26  9:31   ` Philippe Mathieu-Daudé
@ 2025-05-26 10:36   ` Cédric Le Goater
  2025-05-26 12:44     ` Cédric Le Goater
  1 sibling, 1 reply; 51+ messages in thread
From: Cédric Le Goater @ 2025-05-26 10:36 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 5/20/25 12:28, Chenyi Qiang wrote:
> So that the caller can check the result of NotifyRamDiscard() handler if
> the operation fails.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use of NotifyRamDiscard()
> 
> Changes in v4:
>      - Newly added.
> ---
>   hw/vfio/listener.c           | 6 ++++--
>   include/system/memory.h      | 4 ++--
>   system/ram-block-attribute.c | 3 +--
>   3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index bfacb3d8d9..06454e0584 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -190,8 +190,8 @@ out:
>       rcu_read_unlock();
>   }
>   
> -static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
> -                                            MemoryRegionSection *section)
> +static int vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
> +                                           MemoryRegionSection *section)
>   {
>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>                                                   listener);
> @@ -206,6 +206,8 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>           error_report("%s: vfio_container_dma_unmap() failed: %s", __func__,
>                        strerror(-ret));
>       }
> +
> +    return ret;
>   }

vfio_ram_discard_notify_populate() should also be modified
to return this value.


Thanks,

C.



>   static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 83b28551c4..e5155120d9 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -518,8 +518,8 @@ struct IOMMUMemoryRegionClass {
>   typedef struct RamDiscardListener RamDiscardListener;
>   typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
>                                    MemoryRegionSection *section);
> -typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
> -                                 MemoryRegionSection *section);
> +typedef int (*NotifyRamDiscard)(RamDiscardListener *rdl,
> +                                MemoryRegionSection *section);
>   
>   struct RamDiscardListener {
>       /*
> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
> index f12dd4b881..896c3d7543 100644
> --- a/system/ram-block-attribute.c
> +++ b/system/ram-block-attribute.c
> @@ -66,8 +66,7 @@ static int ram_block_attribute_notify_discard_cb(MemoryRegionSection *section,
>   {
>       RamDiscardListener *rdl = arg;
>   
> -    rdl->notify_discard(rdl, section);
> -    return 0;
> +    return rdl->notify_discard(rdl, section);
>   }
>   
>   static int


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-26  9:28     ` Chenyi Qiang
@ 2025-05-26 11:16       ` Alexey Kardashevskiy
  2025-05-27  1:15         ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-26 11:16 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 26/5/25 19:28, Chenyi Qiang wrote:
> 
> 
> On 5/26/2025 5:01 PM, David Hildenbrand wrote:
>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>> discard. However, guest_memfd relies on discard operations for page
>>> conversion between private and shared memory, potentially leading to
>>> stale IOMMU mapping issue when assigning hardware devices to
>>> confidential VMs via shared memory. To address this and allow shared
>>> device assignement, it is crucial to ensure VFIO system refresh its
>>> IOMMU mappings.
>>>
>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>> adjust VFIO mappings in relation to VM page assignment. Effectively page
>>> conversion is similar to hot-removing a page in one mode and adding it
>>> back in the other. Therefore, similar actions are required for page
>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>> facilitate this process.
>>>
>>> Since guest_memfd is not an object, it cannot directly implement the
>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>> have a memory backend while others do not. Notably, virtual BIOS
>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>> backend.
>>>
>>> To manage RAMBlocks with guest_memfd, define a new object named
>>> RamBlockAttribute to implement the RamDiscardManager interface. This
>>> object can store the guest_memfd information such as bitmap for shared
>>> memory, and handles page conversion notification. In the context of
>>> RamDiscardManager, shared state is analogous to populated and private
>>> state is treated as discard. The memory state is tracked at the host
>>> page size granularity, as minimum memory conversion size can be one page
>>> per request. Additionally, VFIO expects the DMA mapping for a specific
>>> iova to be mapped and unmapped with the same granularity. Confidential
>>> VMs may perform partial conversions, such as conversions on small
>>> regions within larger regions. To prevent such invalid cases and until
>>> cut_mapping operation support is available, all operations are performed
>>> with 4K granularity.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v5:
>>>       - Revert to use RamDiscardManager interface instead of introducing
>>>         new hierarchy of class to manage private/shared state, and keep
>>>         using the new name of RamBlockAttribute compared with the
>>>         MemoryAttributeManager in v3.
>>>       - Use *simple* version of object_define and object_declare since the
>>>         state_change() function is changed as an exported function instead
>>>         of a virtual function in later patch.
>>>       - Move the introduction of RamBlockAttribute field to this patch and
>>>         rename it to ram_shared. (Alexey)
>>>       - call the exit() when register/unregister failed. (Zhao)
>>>       - Add the ram-block-attribute.c to Memory API related part in
>>>         MAINTAINERS.
>>>
>>> Changes in v4:
>>>       - Change the name from memory-attribute-manager to
>>>         ram-block-attribute.
>>>       - Implement the newly-introduced PrivateSharedManager instead of
>>>         RamDiscardManager and change related commit message.
>>>       - Define the new object in ramblock.h instead of adding a new file.
>>>
>>> Changes in v3:
>>>       - Some rename (bitmap_size->shared_bitmap_size,
>>>         first_one/zero_bit->first_bit, etc.)
>>>       - Change shared_bitmap_size from uint32_t to unsigned
>>>       - Return mgr->mr->ram_block->page_size in get_block_size()
>>>       - Move set_ram_discard_manager() up to avoid a g_free() in failure
>>>         case.
>>>       - Add const for the memory_attribute_manager_get_block_size()
>>>       - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>>         callback.
>>>
>>> Changes in v2:
>>>       - Rename the object name to MemoryAttributeManager
>>>       - Rename the bitmap to shared_bitmap to make it more clear.
>>>       - Remove block_size field and get it from a helper. In future, we
>>>         can get the page_size from RAMBlock if necessary.
>>>       - Remove the unncessary "struct" before GuestMemfdReplayData
>>>       - Remove the unncessary g_free() for the bitmap
>>>       - Add some error report when the callback failure for
>>>         populated/discarded section.
>>>       - Move the realize()/unrealize() definition to this patch.
>>> ---
>>>    MAINTAINERS                  |   1 +
>>>    include/system/ramblock.h    |  20 +++
>>>    system/meson.build           |   1 +
>>>    system/ram-block-attribute.c | 311 +++++++++++++++++++++++++++++++++++
>>>    4 files changed, 333 insertions(+)
>>>    create mode 100644 system/ram-block-attribute.c
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 6dacd6d004..3b4947dc74 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>    F: system/memory_mapping.c
>>>    F: system/physmem.c
>>>    F: system/memory-internal.h
>>> +F: system/ram-block-attribute.c
>>>    F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>      Memory devices
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index d8a116ba99..09255e8495 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -22,6 +22,10 @@
>>>    #include "exec/cpu-common.h"
>>>    #include "qemu/rcu.h"
>>>    #include "exec/ramlist.h"
>>> +#include "system/hostmem.h"
>>> +
>>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>>      struct RAMBlock {
>>>        struct rcu_head rcu;
>>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>>        int fd;
>>>        uint64_t fd_offset;
>>>        int guest_memfd;
>>> +    /* 1-setting of the bitmap in ram_shared represents ram is shared */
>>
>> That comment looks misplaced, and the variable misnamed.
>>
>> The commet should go into RamBlockAttribute and the variable should
>> likely be named "attributes".
>>
>> Also, "ram_shared" is not used at all in this patch, it should be moved
>> into the corresponding patch.
> 
> I thought we only manage the private and shared attribute, so name it as
> ram_shared. And in the future if managing other attributes, then rename
> it to attributes. It seems I overcomplicated things.


We manage populated vs discarded. Right now populated==shared but the very next thing I will try doing is flipping this to populated==private. Thanks,

> 
>>
>>> +    RamBlockAttribute *ram_shared;
>>>        size_t page_size;
>>>        /* dirty bitmap used during migration */
>>>        unsigned long *bmap;
>>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>>        ram_addr_t postcopy_length;
>>>    };
>>>    +struct RamBlockAttribute {
>>
>> Should this actually be "RamBlockAttributes" ?
> 
> Yes. To match with variable name "attributes", it can be renamed as
> RamBlockAttributes.
> 
>>
>>> +    Object parent;
>>> +
>>> +    MemoryRegion *mr;
>>
>>
>> Should we link to the parent RAMBlock instead, and lookup the MR from
>> there?
> 
> Good suggestion! It can also help to reduce the long arrow operation in
> ram_block_attribute_get_block_size().
> 
>>
>>
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 00/10] Enable shared device assignment
  2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
                   ` (9 preceding siblings ...)
  2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
@ 2025-05-26 11:37 ` Cédric Le Goater
  2025-05-26 12:16   ` Chenyi Qiang
  10 siblings, 1 reply; 51+ messages in thread
From: Cédric Le Goater @ 2025-05-26 11:37 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, 'Alex Williamson'

On 5/20/25 12:28, Chenyi Qiang wrote:
> This is the v5 series of the shared device assignment support.
> 
> As discussed in the v4 series [1], the GenericStateManager parent class
> and PrivateSharedManager child interface were deemed to be in the wrong
> direction. This series reverts back to the original single
> RamDiscardManager interface and puts it as future work to allow the
> co-existence of multiple pairs of state management. For example, if we
> want to have virtio-mem co-exist with guest_memfd, it will need a new
> framework to combine the private/shared/discard states [2].
> 
> Another change since the last version is the error handling of memory
> conversion. Currently, the failure of kvm_convert_memory() causes QEMU
> to quit instead of resuming the guest. The complex rollback operation
> doesn't add value and merely adds code that is difficult to test.
> Although in the future, it is more likely to encounter more errors on
> conversion paths like unmap failure on shared to private in-place
> conversion. This series keeps complex error handling out of the picture
> for now and attaches related handling at the end of the series for
> future extension.
> 
> Apart from the above two parts with future work, there's some
> optimization work in the future, i.e., using other more memory-efficient
> mechanism to track ranges of contiguous states instead of a bitmap [3].
> This series still uses a bitmap for simplicity.
>   
> The overview of this series:
> - Patch 1-3: Preparation patches. These include function exposure and
>    some definition changes to return values.
> - Patch 4-5: Introduce a new object to implement RamDiscardManager
>    interface and a helper to notify the shared/private state change.
> - Patch 6: Store the new object including guest_memfd information in
>    RAMBlock. Register the RamDiscardManager instance to the target
>    RAMBlock's MemoryRegion so that the RamDiscardManager users can run in
>    the specific path.
> - Patch 7: Unlock the coordinate discard so that the shared device
>    assignment (VFIO) can work with guest_memfd. After this patch, the
>    basic device assignement functionality can work properly.
> - Patch 8-9: Some cleanup work. Move the state change handling into a
>    RamDiscardListener so that it can be invoked together with the VFIO
>    listener by the state_change() call. This series dropped the priority
>    support in v4 which is required by in-place conversions, because the
>    conversion path will likely change.
> - Patch 10: More complex error handing including rollback and mixture
>    states conversion case.
> 
> More small changes or details can be found in the individual patches.
> 
> ---
> Original cover letter:
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. In current implementation, shared memory is allocated
> with normal methods (e.g. mmap or fallocate) while private memory is
> allocated from guest_memfd. When a VM performs memory conversions, QEMU
> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
> one side, and allocates new pages from the other side. This will cause a
> stale IOMMU mapping issue mentioned in [4] when we try to enable shared
> device assignment in confidential VMs.
> 
> Solution
> ========
> The key to enable shared device assignment is to update the IOMMU mappings
> on page conversion. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Although the required operations in
> VFIO for page conversion are similar to memory plug/unplug, the states of
> private/shared are different from discard/populated. We want a similar
> mechanism with RamDiscardManager but used to manage the state of private
> and shared.
> 
> This series introduce a new parent abstract class to manage a pair of
> opposite states with RamDiscardManager as its child to manage
> populate/discard states, and introduce a new child class,
> PrivateSharedManager, which can also utilize the same infrastructure to
> notify VFIO of page conversions.
> 
> Relationship with in-place page conversion
> ==========================================
> To support 1G page support for guest_memfd [5], the current direction is to
> allow mmap() of guest_memfd to userspace so that both private and shared
> memory can use the same physical pages as the backend. This in-place page
> conversion design eliminates the need to discard pages during shared/private
> conversions. However, device assignment will still be blocked because the
> in-place page conversion will reject the conversion when the page is pinned
> by VFIO.
> 
> To address this, the key difference lies in the sequence of VFIO map/unmap
> operations and the page conversion. It can be adjusted to achieve
> unmap-before-conversion-to-private and map-after-conversion-to-shared,
> ensuring compatibility with guest_memfd.
> 
> Limitation
> ==========
> One limitation is that VFIO expects the DMA mapping for a specific IOVA
> to be mapped and unmapped with the same granularity. The guest may
> perform partial conversions, such as converting a small region within a
> larger region. To prevent such invalid cases, all operations are
> performed with 4K granularity. This could be optimized after the
> cut_mapping operation[6] is introduced in future. We can alway perform a
> split-before-unmap if partial conversions happen. If the split succeeds,
> the unmap will succeed and be atomic. If the split fails, the unmap
> process fails.
> 
> Testing
> =======
> This patch series is tested based on TDX patches available at:
> KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-coco-queue-snapshot-20250408
> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-snapshot-2025-05-20
> 
> Because the new features like cut_mapping operation will only be support in iommufd.
> It is recommended to use the iommufd-backed VFIO with the qemu command:

Is it recommended or required ? If the VFIO IOMMU type1 backend is not
supported for confidential VMs, QEMU should fail to start.

Please add Alex Williamson and I to the Cc: list.

Thanks,

C.

> qemu-system-x86_64 [...]
>      -object iommufd,id=iommufd0 \
>      -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/qemu-devel/20250407074939.18657-1-chenyi.qiang@intel.com/
> [2] https://lore.kernel.org/qemu-devel/d1a71e00-243b-4751-ab73-c05a4e090d58@redhat.com/
> [3] https://lore.kernel.org/qemu-devel/96ab7fa9-bd7a-444d-aef8-8c9c30439044@redhat.com/
> [4] https://lore.kernel.org/qemu-devel/20240423150951.41600-54-pbonzini@redhat.com/
> [5] https://lore.kernel.org/kvm/cover.1747264138.git.ackerleytng@google.com/
> [6] https://lore.kernel.org/linux-iommu/0-v2-5c26bde5c22d+58b-iommu_pt_jgg@nvidia.com/
> 
> 
> Chenyi Qiang (10):
>    memory: Export a helper to get intersection of a MemoryRegionSection
>      with a given range
>    memory: Change memory_region_set_ram_discard_manager() to return the
>      result
>    memory: Unify the definiton of ReplayRamPopulate() and
>      ReplayRamDiscard()
>    ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock
>      with guest_memfd
>    ram-block-attribute: Introduce a helper to notify shared/private state
>      changes
>    memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks
>    RAMBlock: Make guest_memfd require coordinate discard
>    memory: Change NotifyRamDiscard() definition to return the result
>    KVM: Introduce RamDiscardListener for attribute changes during memory
>      conversions
>    ram-block-attribute: Add more error handling during state changes
> 
>   MAINTAINERS                                 |   1 +
>   accel/kvm/kvm-all.c                         |  79 ++-
>   hw/vfio/listener.c                          |   6 +-
>   hw/virtio/virtio-mem.c                      |  83 ++--
>   include/system/confidential-guest-support.h |   9 +
>   include/system/memory.h                     |  76 ++-
>   include/system/ramblock.h                   |  22 +
>   migration/ram.c                             |  33 +-
>   system/memory.c                             |  22 +-
>   system/meson.build                          |   1 +
>   system/physmem.c                            |  18 +-
>   system/ram-block-attribute.c                | 514 ++++++++++++++++++++
>   target/i386/kvm/tdx.c                       |   1 +
>   target/i386/sev.c                           |   1 +
>   14 files changed, 770 insertions(+), 96 deletions(-)
>   create mode 100644 system/ram-block-attribute.c
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-26 10:19     ` Chenyi Qiang
@ 2025-05-26 12:10       ` David Hildenbrand
  2025-05-26 12:39         ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-26 12:10 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 26.05.25 12:19, Chenyi Qiang wrote:
> 
> 
> On 5/26/2025 5:17 PM, David Hildenbrand wrote:
>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>> The current error handling is simple with the following assumption:
>>> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>>>     fails, thus no need to do rollback.
>>> - The convert range is required to be in the desired state. It is not
>>>     allowed to handle the mixture case.
>>> - The conversion from shared to private is a non-failure operation.
>>>
>>> This is sufficient for now as complext error handling is not required.
>>> For future extension, add some potential error handling.
>>> - For private to shared conversion, do the rollback operation if
>>>     ram_block_attribute_notify_to_populated() fails.
>>> - For shared to private conversion, still assert it as a non-failure
>>>     operation for now. It could be an easy fail path with in-place
>>>     conversion, which will likely have to retry the conversion until it
>>>     works in the future.
>>> - For mixture case, process individual blocks for ease of rollback.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>>    system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
>>>    1 file changed, 90 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
>>> index 387501b569..0af3396aa4 100644
>>> --- a/system/ram-block-attribute.c
>>> +++ b/system/ram-block-attribute.c
>>> @@ -289,7 +289,12 @@ static int
>>> ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>>>            }
>>>            ret = rdl->notify_discard(rdl, &tmp);
>>>            if (ret) {
>>> -            break;
>>> +            /*
>>> +             * The current to_private listeners (VFIO dma_unmap and
>>> +             * KVM set_attribute_private) are non-failing operations.
>>> +             * TODO: add rollback operations if it is allowed to fail.
>>> +             */
>>> +            g_assert(ret);
>>>            }
>>>        }
>>>    
>>
>> If it's not allowed to fail for now, then patch #8 does not make sense
>> and should be dropped :)
> 
> It was intended for future extension as in-place conversion to_private
> allows it to fail. So I add the patch #8.
> 
> But as you mentioned, since the conversion path is changing, and maybe
> it is easier to handle from KVM code directly. Let me drop patch #8 and
> wait for the in-place conversion to mature.

Makes sense. I'm afraid it might all be a bit complicated to handle: 
vfio can fail private -> shared conversion and KVM the shared -> private 
conversion.

So recovering ... will not be straight forward once multiple pages are 
converted.

> 
>>
>> The implementations (vfio) should likely exit() instead on unexpected
>> errors when discarding.
> 
> After drop patch #8, maybe keep vfio discard handling as it was. Adding
> an additional exit() is also OK to me since it's non-fail case.
> 
>>
>>
>>
>> Why not squash all the below into the corresponding patch? Looks mostly
>> like handling partial conversions correctly (as discussed previously)?
> 
> I extract these two handling 1) mixture conversion; 2) operation
> rollback into this individual patch because they are not the practical
> cases and are untested.
> 
> For 1), I still don't see any real case which will convert a range with
> mixture attributes.

Okay. I thought we were not sure if the guest could trigger that?

I think it would be better to just include the "mixture" handling in the 
original patch.

> 
> For 2), current failure of memory conversion (as seen in kvm_cpu_exec()
> ->kvm_convert_memory()) will cause the QEMU to quit instead of resuming
> guest. Doing the rollback seems useless at present.

Makes sense.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 00/10] Enable shared device assignment
  2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
@ 2025-05-26 12:16   ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26 12:16 UTC (permalink / raw)
  To: Cédric Le Goater, David Hildenbrand, Alexey Kardashevskiy,
	Peter Xu, Gupta Pankaj, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao, 'Alex Williamson'



On 5/26/2025 7:37 PM, Cédric Le Goater wrote:
> On 5/20/25 12:28, Chenyi Qiang wrote:
>> This is the v5 series of the shared device assignment support.
>>
>> As discussed in the v4 series [1], the GenericStateManager parent class
>> and PrivateSharedManager child interface were deemed to be in the wrong
>> direction. This series reverts back to the original single
>> RamDiscardManager interface and puts it as future work to allow the
>> co-existence of multiple pairs of state management. For example, if we
>> want to have virtio-mem co-exist with guest_memfd, it will need a new
>> framework to combine the private/shared/discard states [2].
>>
>> Another change since the last version is the error handling of memory
>> conversion. Currently, the failure of kvm_convert_memory() causes QEMU
>> to quit instead of resuming the guest. The complex rollback operation
>> doesn't add value and merely adds code that is difficult to test.
>> Although in the future, it is more likely to encounter more errors on
>> conversion paths like unmap failure on shared to private in-place
>> conversion. This series keeps complex error handling out of the picture
>> for now and attaches related handling at the end of the series for
>> future extension.
>>
>> Apart from the above two parts with future work, there's some
>> optimization work in the future, i.e., using other more memory-efficient
>> mechanism to track ranges of contiguous states instead of a bitmap [3].
>> This series still uses a bitmap for simplicity.
>>   The overview of this series:
>> - Patch 1-3: Preparation patches. These include function exposure and
>>    some definition changes to return values.
>> - Patch 4-5: Introduce a new object to implement RamDiscardManager
>>    interface and a helper to notify the shared/private state change.
>> - Patch 6: Store the new object including guest_memfd information in
>>    RAMBlock. Register the RamDiscardManager instance to the target
>>    RAMBlock's MemoryRegion so that the RamDiscardManager users can run in
>>    the specific path.
>> - Patch 7: Unlock the coordinate discard so that the shared device
>>    assignment (VFIO) can work with guest_memfd. After this patch, the
>>    basic device assignement functionality can work properly.
>> - Patch 8-9: Some cleanup work. Move the state change handling into a
>>    RamDiscardListener so that it can be invoked together with the VFIO
>>    listener by the state_change() call. This series dropped the priority
>>    support in v4 which is required by in-place conversions, because the
>>    conversion path will likely change.
>> - Patch 10: More complex error handing including rollback and mixture
>>    states conversion case.
>>
>> More small changes or details can be found in the individual patches.
>>
>> ---
>> Original cover letter:
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. In current implementation, shared memory is allocated
>> with normal methods (e.g. mmap or fallocate) while private memory is
>> allocated from guest_memfd. When a VM performs memory conversions, QEMU
>> frees pages via madvise or via PUNCH_HOLE on memfd or guest_memfd from
>> one side, and allocates new pages from the other side. This will cause a
>> stale IOMMU mapping issue mentioned in [4] when we try to enable shared
>> device assignment in confidential VMs.
>>
>> Solution
>> ========
>> The key to enable shared device assignment is to update the IOMMU
>> mappings
>> on page conversion. RamDiscardManager, an existing interface currently
>> utilized by virtio-mem, offers a means to modify IOMMU mappings in
>> accordance with VM page assignment. Although the required operations in
>> VFIO for page conversion are similar to memory plug/unplug, the states of
>> private/shared are different from discard/populated. We want a similar
>> mechanism with RamDiscardManager but used to manage the state of private
>> and shared.
>>
>> This series introduce a new parent abstract class to manage a pair of
>> opposite states with RamDiscardManager as its child to manage
>> populate/discard states, and introduce a new child class,
>> PrivateSharedManager, which can also utilize the same infrastructure to
>> notify VFIO of page conversions.
>>
>> Relationship with in-place page conversion
>> ==========================================
>> To support 1G page support for guest_memfd [5], the current direction
>> is to
>> allow mmap() of guest_memfd to userspace so that both private and shared
>> memory can use the same physical pages as the backend. This in-place page
>> conversion design eliminates the need to discard pages during shared/
>> private
>> conversions. However, device assignment will still be blocked because the
>> in-place page conversion will reject the conversion when the page is
>> pinned
>> by VFIO.
>>
>> To address this, the key difference lies in the sequence of VFIO map/
>> unmap
>> operations and the page conversion. It can be adjusted to achieve
>> unmap-before-conversion-to-private and map-after-conversion-to-shared,
>> ensuring compatibility with guest_memfd.
>>
>> Limitation
>> ==========
>> One limitation is that VFIO expects the DMA mapping for a specific IOVA
>> to be mapped and unmapped with the same granularity. The guest may
>> perform partial conversions, such as converting a small region within a
>> larger region. To prevent such invalid cases, all operations are
>> performed with 4K granularity. This could be optimized after the
>> cut_mapping operation[6] is introduced in future. We can alway perform a
>> split-before-unmap if partial conversions happen. If the split succeeds,
>> the unmap will succeed and be atomic. If the split fails, the unmap
>> process fails.
>>
>> Testing
>> =======
>> This patch series is tested based on TDX patches available at:
>> KVM: https://github.com/intel/tdx/tree/kvm-coco-queue-snapshot/kvm-
>> coco-queue-snapshot-20250408
>> QEMU: https://github.com/intel-staging/qemu-tdx/tree/tdx-upstream-
>> snapshot-2025-05-20
>>
>> Because the new features like cut_mapping operation will only be
>> support in iommufd.
>> It is recommended to use the iommufd-backed VFIO with the qemu command:
> 
> Is it recommended or required ? If the VFIO IOMMU type1 backend is not
> supported for confidential VMs, QEMU should fail to start.

VFIO IOMMU type1 backend is also supported but need to increase the
dma_entry_limit parameter, as this series currently do the map/unmap
with 4K granularity.

> 
> Please add Alex Williamson and I to the Cc: list.

Sure, will do in next version.

> 
> Thanks,
> 
> C.
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-26 12:10       ` David Hildenbrand
@ 2025-05-26 12:39         ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-26 12:39 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 8:10 PM, David Hildenbrand wrote:
> On 26.05.25 12:19, Chenyi Qiang wrote:
>>
>>
>> On 5/26/2025 5:17 PM, David Hildenbrand wrote:
>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>> The current error handling is simple with the following assumption:
>>>> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>>>>     fails, thus no need to do rollback.
>>>> - The convert range is required to be in the desired state. It is not
>>>>     allowed to handle the mixture case.
>>>> - The conversion from shared to private is a non-failure operation.
>>>>
>>>> This is sufficient for now as complext error handling is not required.
>>>> For future extension, add some potential error handling.
>>>> - For private to shared conversion, do the rollback operation if
>>>>     ram_block_attribute_notify_to_populated() fails.
>>>> - For shared to private conversion, still assert it as a non-failure
>>>>     operation for now. It could be an easy fail path with in-place
>>>>     conversion, which will likely have to retry the conversion until it
>>>>     works in the future.
>>>> - For mixture case, process individual blocks for ease of rollback.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>>    system/ram-block-attribute.c | 116 ++++++++++++++++++++++++++
>>>> +--------
>>>>    1 file changed, 90 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/system/ram-block-attribute.c b/system/ram-block-
>>>> attribute.c
>>>> index 387501b569..0af3396aa4 100644
>>>> --- a/system/ram-block-attribute.c
>>>> +++ b/system/ram-block-attribute.c
>>>> @@ -289,7 +289,12 @@ static int
>>>> ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>>>>            }
>>>>            ret = rdl->notify_discard(rdl, &tmp);
>>>>            if (ret) {
>>>> -            break;
>>>> +            /*
>>>> +             * The current to_private listeners (VFIO dma_unmap and
>>>> +             * KVM set_attribute_private) are non-failing operations.
>>>> +             * TODO: add rollback operations if it is allowed to fail.
>>>> +             */
>>>> +            g_assert(ret);
>>>>            }
>>>>        }
>>>>    
>>>
>>> If it's not allowed to fail for now, then patch #8 does not make sense
>>> and should be dropped :)
>>
>> It was intended for future extension as in-place conversion to_private
>> allows it to fail. So I add the patch #8.
>>
>> But as you mentioned, since the conversion path is changing, and maybe
>> it is easier to handle from KVM code directly. Let me drop patch #8 and
>> wait for the in-place conversion to mature.
> 
> Makes sense. I'm afraid it might all be a bit complicated to handle:
> vfio can fail private -> shared conversion and KVM the shared -> private
> conversion.
> 
> So recovering ... will not be straight forward once multiple pages are
> converted.
> 
>>
>>>
>>> The implementations (vfio) should likely exit() instead on unexpected
>>> errors when discarding.
>>
>> After drop patch #8, maybe keep vfio discard handling as it was. Adding
>> an additional exit() is also OK to me since it's non-fail case.
>>
>>>
>>>
>>>
>>> Why not squash all the below into the corresponding patch? Looks mostly
>>> like handling partial conversions correctly (as discussed previously)?
>>
>> I extract these two handling 1) mixture conversion; 2) operation
>> rollback into this individual patch because they are not the practical
>> cases and are untested.
>>
>> For 1), I still don't see any real case which will convert a range with
>> mixture attributes.
> 
> Okay. I thought we were not sure if the guest could trigger that?

Yes. At least I didn't see any statement in TDX GHCI spec to mention
such mixture case. And in TDX KVM code, it will check the start and end
gpa to have the same attribute before exiting to userspace: (maybe with
the assumption that the whole range shares the same attribute?)

vt_is_tdx_private_gpa(vcpu->kvm, gpa) !=
vt_is_tdx_private_gpa(vcpu->kvm, gpa + size - 1)

> 
> I think it would be better to just include the "mixture" handling in the
> original patch.

[...]

> 
>>
>> For 2), current failure of memory conversion (as seen in kvm_cpu_exec()
>> ->kvm_convert_memory()) will cause the QEMU to quit instead of resuming
>> guest. Doing the rollback seems useless at present.
> 
> Makes sense.

OK, then I will move the mixture handling in the original patch and
still keep the rollback operation separately.

> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result
  2025-05-26 10:36   ` Cédric Le Goater
@ 2025-05-26 12:44     ` Cédric Le Goater
  2025-05-27  5:29       ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: Cédric Le Goater @ 2025-05-26 12:44 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Alexey Kardashevskiy, Peter Xu,
	Gupta Pankaj, Paolo Bonzini, Philippe Mathieu-Daudé,
	Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 5/26/25 12:36, Cédric Le Goater wrote:
> On 5/20/25 12:28, Chenyi Qiang wrote:
>> So that the caller can check the result of NotifyRamDiscard() handler if
>> the operation fails.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v5:
>>      - Revert to use of NotifyRamDiscard()
>>
>> Changes in v4:
>>      - Newly added.
>> ---
>>   hw/vfio/listener.c           | 6 ++++--
>>   include/system/memory.h      | 4 ++--
>>   system/ram-block-attribute.c | 3 +--
>>   3 files changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index bfacb3d8d9..06454e0584 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -190,8 +190,8 @@ out:
>>       rcu_read_unlock();
>>   }
>> -static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>> -                                            MemoryRegionSection *section)
>> +static int vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>> +                                           MemoryRegionSection *section)
>>   {
>>       VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
>>                                                   listener);
>> @@ -206,6 +206,8 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>           error_report("%s: vfio_container_dma_unmap() failed: %s", __func__,
>>                        strerror(-ret));
>>       }
>> +
>> +    return ret;
>>   }
> 
> vfio_ram_discard_notify_populate() should also be modified
> to return this value.

Nope. It should not. This is a rollback path in case of error. All good.

Thanks,

C.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-26 11:16       ` Alexey Kardashevskiy
@ 2025-05-27  1:15         ` Chenyi Qiang
  2025-05-27  1:20           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  1:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 7:16 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 26/5/25 19:28, Chenyi Qiang wrote:
>>
>>
>> On 5/26/2025 5:01 PM, David Hildenbrand wrote:
>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>>> discard. However, guest_memfd relies on discard operations for page
>>>> conversion between private and shared memory, potentially leading to
>>>> stale IOMMU mapping issue when assigning hardware devices to
>>>> confidential VMs via shared memory. To address this and allow shared
>>>> device assignement, it is crucial to ensure VFIO system refresh its
>>>> IOMMU mappings.
>>>>
>>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>>> adjust VFIO mappings in relation to VM page assignment. Effectively
>>>> page
>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>> back in the other. Therefore, similar actions are required for page
>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>> facilitate this process.
>>>>
>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>>> have a memory backend while others do not. Notably, virtual BIOS
>>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>>> backend.
>>>>
>>>> To manage RAMBlocks with guest_memfd, define a new object named
>>>> RamBlockAttribute to implement the RamDiscardManager interface. This
>>>> object can store the guest_memfd information such as bitmap for shared
>>>> memory, and handles page conversion notification. In the context of
>>>> RamDiscardManager, shared state is analogous to populated and private
>>>> state is treated as discard. The memory state is tracked at the host
>>>> page size granularity, as minimum memory conversion size can be one
>>>> page
>>>> per request. Additionally, VFIO expects the DMA mapping for a specific
>>>> iova to be mapped and unmapped with the same granularity. Confidential
>>>> VMs may perform partial conversions, such as conversions on small
>>>> regions within larger regions. To prevent such invalid cases and until
>>>> cut_mapping operation support is available, all operations are
>>>> performed
>>>> with 4K granularity.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>> Changes in v5:
>>>>       - Revert to use RamDiscardManager interface instead of
>>>> introducing
>>>>         new hierarchy of class to manage private/shared state, and keep
>>>>         using the new name of RamBlockAttribute compared with the
>>>>         MemoryAttributeManager in v3.
>>>>       - Use *simple* version of object_define and object_declare
>>>> since the
>>>>         state_change() function is changed as an exported function
>>>> instead
>>>>         of a virtual function in later patch.
>>>>       - Move the introduction of RamBlockAttribute field to this
>>>> patch and
>>>>         rename it to ram_shared. (Alexey)
>>>>       - call the exit() when register/unregister failed. (Zhao)
>>>>       - Add the ram-block-attribute.c to Memory API related part in
>>>>         MAINTAINERS.
>>>>
>>>> Changes in v4:
>>>>       - Change the name from memory-attribute-manager to
>>>>         ram-block-attribute.
>>>>       - Implement the newly-introduced PrivateSharedManager instead of
>>>>         RamDiscardManager and change related commit message.
>>>>       - Define the new object in ramblock.h instead of adding a new
>>>> file.
>>>>
>>>> Changes in v3:
>>>>       - Some rename (bitmap_size->shared_bitmap_size,
>>>>         first_one/zero_bit->first_bit, etc.)
>>>>       - Change shared_bitmap_size from uint32_t to unsigned
>>>>       - Return mgr->mr->ram_block->page_size in get_block_size()
>>>>       - Move set_ram_discard_manager() up to avoid a g_free() in
>>>> failure
>>>>         case.
>>>>       - Add const for the memory_attribute_manager_get_block_size()
>>>>       - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>>>         callback.
>>>>
>>>> Changes in v2:
>>>>       - Rename the object name to MemoryAttributeManager
>>>>       - Rename the bitmap to shared_bitmap to make it more clear.
>>>>       - Remove block_size field and get it from a helper. In future, we
>>>>         can get the page_size from RAMBlock if necessary.
>>>>       - Remove the unncessary "struct" before GuestMemfdReplayData
>>>>       - Remove the unncessary g_free() for the bitmap
>>>>       - Add some error report when the callback failure for
>>>>         populated/discarded section.
>>>>       - Move the realize()/unrealize() definition to this patch.
>>>> ---
>>>>    MAINTAINERS                  |   1 +
>>>>    include/system/ramblock.h    |  20 +++
>>>>    system/meson.build           |   1 +
>>>>    system/ram-block-attribute.c | 311 ++++++++++++++++++++++++++++++
>>>> +++++
>>>>    4 files changed, 333 insertions(+)
>>>>    create mode 100644 system/ram-block-attribute.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 6dacd6d004..3b4947dc74 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>>    F: system/memory_mapping.c
>>>>    F: system/physmem.c
>>>>    F: system/memory-internal.h
>>>> +F: system/ram-block-attribute.c
>>>>    F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>>      Memory devices
>>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>>> index d8a116ba99..09255e8495 100644
>>>> --- a/include/system/ramblock.h
>>>> +++ b/include/system/ramblock.h
>>>> @@ -22,6 +22,10 @@
>>>>    #include "exec/cpu-common.h"
>>>>    #include "qemu/rcu.h"
>>>>    #include "exec/ramlist.h"
>>>> +#include "system/hostmem.h"
>>>> +
>>>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>>>      struct RAMBlock {
>>>>        struct rcu_head rcu;
>>>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>>>        int fd;
>>>>        uint64_t fd_offset;
>>>>        int guest_memfd;
>>>> +    /* 1-setting of the bitmap in ram_shared represents ram is
>>>> shared */
>>>
>>> That comment looks misplaced, and the variable misnamed.
>>>
>>> The commet should go into RamBlockAttribute and the variable should
>>> likely be named "attributes".
>>>
>>> Also, "ram_shared" is not used at all in this patch, it should be moved
>>> into the corresponding patch.
>>
>> I thought we only manage the private and shared attribute, so name it as
>> ram_shared. And in the future if managing other attributes, then rename
>> it to attributes. It seems I overcomplicated things.
> 
> 
> We manage populated vs discarded. Right now populated==shared but the
> very next thing I will try doing is flipping this to populated==private.
> Thanks,

Can you elaborate your case why need to do the flip? populated and
discarded are two states represented in the bitmap, is it workable to
just call the related handler based on the bitmap?

> 
>>
>>>
>>>> +    RamBlockAttribute *ram_shared;
>>>>        size_t page_size;
>>>>        /* dirty bitmap used during migration */
>>>>        unsigned long *bmap;
>>>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>>>        ram_addr_t postcopy_length;
>>>>    };
>>>>    +struct RamBlockAttribute {
>>>
>>> Should this actually be "RamBlockAttributes" ?
>>
>> Yes. To match with variable name "attributes", it can be renamed as
>> RamBlockAttributes.
>>
>>>
>>>> +    Object parent;
>>>> +
>>>> +    MemoryRegion *mr;
>>>
>>>
>>> Should we link to the parent RAMBlock instead, and lookup the MR from
>>> there?
>>
>> Good suggestion! It can also help to reduce the long arrow operation in
>> ram_block_attribute_get_block_size().
>>
>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-27  1:15         ` Chenyi Qiang
@ 2025-05-27  1:20           ` Alexey Kardashevskiy
  2025-05-27  3:14             ` Chenyi Qiang
  0 siblings, 1 reply; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  1:20 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 27/5/25 11:15, Chenyi Qiang wrote:
> 
> 
> On 5/26/2025 7:16 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 26/5/25 19:28, Chenyi Qiang wrote:
>>>
>>>
>>> On 5/26/2025 5:01 PM, David Hildenbrand wrote:
>>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>>>> discard. However, guest_memfd relies on discard operations for page
>>>>> conversion between private and shared memory, potentially leading to
>>>>> stale IOMMU mapping issue when assigning hardware devices to
>>>>> confidential VMs via shared memory. To address this and allow shared
>>>>> device assignement, it is crucial to ensure VFIO system refresh its
>>>>> IOMMU mappings.
>>>>>
>>>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>>>> adjust VFIO mappings in relation to VM page assignment. Effectively
>>>>> page
>>>>> conversion is similar to hot-removing a page in one mode and adding it
>>>>> back in the other. Therefore, similar actions are required for page
>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>> facilitate this process.
>>>>>
>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>>>> not appropriate because guest_memfd is per RAMBlock, and some RAMBlocks
>>>>> have a memory backend while others do not. Notably, virtual BIOS
>>>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>>>> backend.
>>>>>
>>>>> To manage RAMBlocks with guest_memfd, define a new object named
>>>>> RamBlockAttribute to implement the RamDiscardManager interface. This
>>>>> object can store the guest_memfd information such as bitmap for shared
>>>>> memory, and handles page conversion notification. In the context of
>>>>> RamDiscardManager, shared state is analogous to populated and private
>>>>> state is treated as discard. The memory state is tracked at the host
>>>>> page size granularity, as minimum memory conversion size can be one
>>>>> page
>>>>> per request. Additionally, VFIO expects the DMA mapping for a specific
>>>>> iova to be mapped and unmapped with the same granularity. Confidential
>>>>> VMs may perform partial conversions, such as conversions on small
>>>>> regions within larger regions. To prevent such invalid cases and until
>>>>> cut_mapping operation support is available, all operations are
>>>>> performed
>>>>> with 4K granularity.
>>>>>
>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>> ---
>>>>> Changes in v5:
>>>>>        - Revert to use RamDiscardManager interface instead of
>>>>> introducing
>>>>>          new hierarchy of class to manage private/shared state, and keep
>>>>>          using the new name of RamBlockAttribute compared with the
>>>>>          MemoryAttributeManager in v3.
>>>>>        - Use *simple* version of object_define and object_declare
>>>>> since the
>>>>>          state_change() function is changed as an exported function
>>>>> instead
>>>>>          of a virtual function in later patch.
>>>>>        - Move the introduction of RamBlockAttribute field to this
>>>>> patch and
>>>>>          rename it to ram_shared. (Alexey)
>>>>>        - call the exit() when register/unregister failed. (Zhao)
>>>>>        - Add the ram-block-attribute.c to Memory API related part in
>>>>>          MAINTAINERS.
>>>>>
>>>>> Changes in v4:
>>>>>        - Change the name from memory-attribute-manager to
>>>>>          ram-block-attribute.
>>>>>        - Implement the newly-introduced PrivateSharedManager instead of
>>>>>          RamDiscardManager and change related commit message.
>>>>>        - Define the new object in ramblock.h instead of adding a new
>>>>> file.
>>>>>
>>>>> Changes in v3:
>>>>>        - Some rename (bitmap_size->shared_bitmap_size,
>>>>>          first_one/zero_bit->first_bit, etc.)
>>>>>        - Change shared_bitmap_size from uint32_t to unsigned
>>>>>        - Return mgr->mr->ram_block->page_size in get_block_size()
>>>>>        - Move set_ram_discard_manager() up to avoid a g_free() in
>>>>> failure
>>>>>          case.
>>>>>        - Add const for the memory_attribute_manager_get_block_size()
>>>>>        - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>>>>          callback.
>>>>>
>>>>> Changes in v2:
>>>>>        - Rename the object name to MemoryAttributeManager
>>>>>        - Rename the bitmap to shared_bitmap to make it more clear.
>>>>>        - Remove block_size field and get it from a helper. In future, we
>>>>>          can get the page_size from RAMBlock if necessary.
>>>>>        - Remove the unncessary "struct" before GuestMemfdReplayData
>>>>>        - Remove the unncessary g_free() for the bitmap
>>>>>        - Add some error report when the callback failure for
>>>>>          populated/discarded section.
>>>>>        - Move the realize()/unrealize() definition to this patch.
>>>>> ---
>>>>>     MAINTAINERS                  |   1 +
>>>>>     include/system/ramblock.h    |  20 +++
>>>>>     system/meson.build           |   1 +
>>>>>     system/ram-block-attribute.c | 311 ++++++++++++++++++++++++++++++
>>>>> +++++
>>>>>     4 files changed, 333 insertions(+)
>>>>>     create mode 100644 system/ram-block-attribute.c
>>>>>
>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>> index 6dacd6d004..3b4947dc74 100644
>>>>> --- a/MAINTAINERS
>>>>> +++ b/MAINTAINERS
>>>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>>>     F: system/memory_mapping.c
>>>>>     F: system/physmem.c
>>>>>     F: system/memory-internal.h
>>>>> +F: system/ram-block-attribute.c
>>>>>     F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>>>       Memory devices
>>>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>>>> index d8a116ba99..09255e8495 100644
>>>>> --- a/include/system/ramblock.h
>>>>> +++ b/include/system/ramblock.h
>>>>> @@ -22,6 +22,10 @@
>>>>>     #include "exec/cpu-common.h"
>>>>>     #include "qemu/rcu.h"
>>>>>     #include "exec/ramlist.h"
>>>>> +#include "system/hostmem.h"
>>>>> +
>>>>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>>>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>>>>       struct RAMBlock {
>>>>>         struct rcu_head rcu;
>>>>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>>>>         int fd;
>>>>>         uint64_t fd_offset;
>>>>>         int guest_memfd;
>>>>> +    /* 1-setting of the bitmap in ram_shared represents ram is
>>>>> shared */
>>>>
>>>> That comment looks misplaced, and the variable misnamed.
>>>>
>>>> The commet should go into RamBlockAttribute and the variable should
>>>> likely be named "attributes".
>>>>
>>>> Also, "ram_shared" is not used at all in this patch, it should be moved
>>>> into the corresponding patch.
>>>
>>> I thought we only manage the private and shared attribute, so name it as
>>> ram_shared. And in the future if managing other attributes, then rename
>>> it to attributes. It seems I overcomplicated things.
>>
>>
>> We manage populated vs discarded. Right now populated==shared but the
>> very next thing I will try doing is flipping this to populated==private.
>> Thanks,
> 
> Can you elaborate your case why need to do the flip? populated and
> discarded are two states represented in the bitmap, is it workable to
> just call the related handler based on the bitmap?


Due to lack of inplace memory conversion in upstream linux, this is the way to allow DMA for TDISP devices. So I'll need to make populated==private opposite to the current populated==shared (+change the kernel too, of course). Not sure I'm going to push real hard though, depending on the inplace private/shared memory conversion work. Thanks,


> 
>>
>>>
>>>>
>>>>> +    RamBlockAttribute *ram_shared;
>>>>>         size_t page_size;
>>>>>         /* dirty bitmap used during migration */
>>>>>         unsigned long *bmap;
>>>>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>>>>         ram_addr_t postcopy_length;
>>>>>     };
>>>>>     +struct RamBlockAttribute {
>>>>
>>>> Should this actually be "RamBlockAttributes" ?
>>>
>>> Yes. To match with variable name "attributes", it can be renamed as
>>> RamBlockAttributes.
>>>
>>>>
>>>>> +    Object parent;
>>>>> +
>>>>> +    MemoryRegion *mr;
>>>>
>>>>
>>>> Should we link to the parent RAMBlock instead, and lookup the MR from
>>>> there?
>>>
>>> Good suggestion! It can also help to reduce the long arrow operation in
>>> ram_block_attribute_get_block_size().
>>>
>>>>
>>>>
>>>
>>
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-27  1:20           ` Alexey Kardashevskiy
@ 2025-05-27  3:14             ` Chenyi Qiang
  2025-05-27  6:06               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  3:14 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/27/2025 9:20 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 27/5/25 11:15, Chenyi Qiang wrote:
>>
>>
>> On 5/26/2025 7:16 PM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 26/5/25 19:28, Chenyi Qiang wrote:
>>>>
>>>>
>>>> On 5/26/2025 5:01 PM, David Hildenbrand wrote:
>>>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>>>>> discard. However, guest_memfd relies on discard operations for page
>>>>>> conversion between private and shared memory, potentially leading to
>>>>>> stale IOMMU mapping issue when assigning hardware devices to
>>>>>> confidential VMs via shared memory. To address this and allow shared
>>>>>> device assignement, it is crucial to ensure VFIO system refresh its
>>>>>> IOMMU mappings.
>>>>>>
>>>>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>>>>> adjust VFIO mappings in relation to VM page assignment. Effectively
>>>>>> page
>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>> adding it
>>>>>> back in the other. Therefore, similar actions are required for page
>>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>>> facilitate this process.
>>>>>>
>>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>>>>> not appropriate because guest_memfd is per RAMBlock, and some
>>>>>> RAMBlocks
>>>>>> have a memory backend while others do not. Notably, virtual BIOS
>>>>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>>>>> backend.
>>>>>>
>>>>>> To manage RAMBlocks with guest_memfd, define a new object named
>>>>>> RamBlockAttribute to implement the RamDiscardManager interface. This
>>>>>> object can store the guest_memfd information such as bitmap for
>>>>>> shared
>>>>>> memory, and handles page conversion notification. In the context of
>>>>>> RamDiscardManager, shared state is analogous to populated and private
>>>>>> state is treated as discard. The memory state is tracked at the host
>>>>>> page size granularity, as minimum memory conversion size can be one
>>>>>> page
>>>>>> per request. Additionally, VFIO expects the DMA mapping for a
>>>>>> specific
>>>>>> iova to be mapped and unmapped with the same granularity.
>>>>>> Confidential
>>>>>> VMs may perform partial conversions, such as conversions on small
>>>>>> regions within larger regions. To prevent such invalid cases and
>>>>>> until
>>>>>> cut_mapping operation support is available, all operations are
>>>>>> performed
>>>>>> with 4K granularity.
>>>>>>
>>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>>> ---
>>>>>> Changes in v5:
>>>>>>        - Revert to use RamDiscardManager interface instead of
>>>>>> introducing
>>>>>>          new hierarchy of class to manage private/shared state,
>>>>>> and keep
>>>>>>          using the new name of RamBlockAttribute compared with the
>>>>>>          MemoryAttributeManager in v3.
>>>>>>        - Use *simple* version of object_define and object_declare
>>>>>> since the
>>>>>>          state_change() function is changed as an exported function
>>>>>> instead
>>>>>>          of a virtual function in later patch.
>>>>>>        - Move the introduction of RamBlockAttribute field to this
>>>>>> patch and
>>>>>>          rename it to ram_shared. (Alexey)
>>>>>>        - call the exit() when register/unregister failed. (Zhao)
>>>>>>        - Add the ram-block-attribute.c to Memory API related part in
>>>>>>          MAINTAINERS.
>>>>>>
>>>>>> Changes in v4:
>>>>>>        - Change the name from memory-attribute-manager to
>>>>>>          ram-block-attribute.
>>>>>>        - Implement the newly-introduced PrivateSharedManager
>>>>>> instead of
>>>>>>          RamDiscardManager and change related commit message.
>>>>>>        - Define the new object in ramblock.h instead of adding a new
>>>>>> file.
>>>>>>
>>>>>> Changes in v3:
>>>>>>        - Some rename (bitmap_size->shared_bitmap_size,
>>>>>>          first_one/zero_bit->first_bit, etc.)
>>>>>>        - Change shared_bitmap_size from uint32_t to unsigned
>>>>>>        - Return mgr->mr->ram_block->page_size in get_block_size()
>>>>>>        - Move set_ram_discard_manager() up to avoid a g_free() in
>>>>>> failure
>>>>>>          case.
>>>>>>        - Add const for the memory_attribute_manager_get_block_size()
>>>>>>        - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>>>>>          callback.
>>>>>>
>>>>>> Changes in v2:
>>>>>>        - Rename the object name to MemoryAttributeManager
>>>>>>        - Rename the bitmap to shared_bitmap to make it more clear.
>>>>>>        - Remove block_size field and get it from a helper. In
>>>>>> future, we
>>>>>>          can get the page_size from RAMBlock if necessary.
>>>>>>        - Remove the unncessary "struct" before GuestMemfdReplayData
>>>>>>        - Remove the unncessary g_free() for the bitmap
>>>>>>        - Add some error report when the callback failure for
>>>>>>          populated/discarded section.
>>>>>>        - Move the realize()/unrealize() definition to this patch.
>>>>>> ---
>>>>>>     MAINTAINERS                  |   1 +
>>>>>>     include/system/ramblock.h    |  20 +++
>>>>>>     system/meson.build           |   1 +
>>>>>>     system/ram-block-attribute.c | 311 ++++++++++++++++++++++++++++++
>>>>>> +++++
>>>>>>     4 files changed, 333 insertions(+)
>>>>>>     create mode 100644 system/ram-block-attribute.c
>>>>>>
>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>> index 6dacd6d004..3b4947dc74 100644
>>>>>> --- a/MAINTAINERS
>>>>>> +++ b/MAINTAINERS
>>>>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>>>>     F: system/memory_mapping.c
>>>>>>     F: system/physmem.c
>>>>>>     F: system/memory-internal.h
>>>>>> +F: system/ram-block-attribute.c
>>>>>>     F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>>>>       Memory devices
>>>>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>>>>> index d8a116ba99..09255e8495 100644
>>>>>> --- a/include/system/ramblock.h
>>>>>> +++ b/include/system/ramblock.h
>>>>>> @@ -22,6 +22,10 @@
>>>>>>     #include "exec/cpu-common.h"
>>>>>>     #include "qemu/rcu.h"
>>>>>>     #include "exec/ramlist.h"
>>>>>> +#include "system/hostmem.h"
>>>>>> +
>>>>>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>>>>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>>>>>       struct RAMBlock {
>>>>>>         struct rcu_head rcu;
>>>>>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>>>>>         int fd;
>>>>>>         uint64_t fd_offset;
>>>>>>         int guest_memfd;
>>>>>> +    /* 1-setting of the bitmap in ram_shared represents ram is
>>>>>> shared */
>>>>>
>>>>> That comment looks misplaced, and the variable misnamed.
>>>>>
>>>>> The commet should go into RamBlockAttribute and the variable should
>>>>> likely be named "attributes".
>>>>>
>>>>> Also, "ram_shared" is not used at all in this patch, it should be
>>>>> moved
>>>>> into the corresponding patch.
>>>>
>>>> I thought we only manage the private and shared attribute, so name
>>>> it as
>>>> ram_shared. And in the future if managing other attributes, then rename
>>>> it to attributes. It seems I overcomplicated things.
>>>
>>>
>>> We manage populated vs discarded. Right now populated==shared but the
>>> very next thing I will try doing is flipping this to populated==private.
>>> Thanks,
>>
>> Can you elaborate your case why need to do the flip? populated and
>> discarded are two states represented in the bitmap, is it workable to
>> just call the related handler based on the bitmap?
> 
> 
> Due to lack of inplace memory conversion in upstream linux, this is the
> way to allow DMA for TDISP devices. So I'll need to make
> populated==private opposite to the current populated==shared (+change
> the kernel too, of course). Not sure I'm going to push real hard though,
> depending on the inplace private/shared memory conversion work. Thanks,

Do you mean to operate only on private mapping? This is workable if you
don't want to manipulate shared mapping. But if you want both, for
example, to_private conversion needs to discard shared mapping and
populate private mapping in IOMMU, it may be possible to pass in a
parameter to indicate the current operation, allowing the listener
callback to decide how to proceed. Or other mechanisms to extend it.

> 
> 
>>
>>>
>>>>
>>>>>
>>>>>> +    RamBlockAttribute *ram_shared;
>>>>>>         size_t page_size;
>>>>>>         /* dirty bitmap used during migration */
>>>>>>         unsigned long *bmap;
>>>>>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>>>>>         ram_addr_t postcopy_length;
>>>>>>     };
>>>>>>     +struct RamBlockAttribute {
>>>>>
>>>>> Should this actually be "RamBlockAttributes" ?
>>>>
>>>> Yes. To match with variable name "attributes", it can be renamed as
>>>> RamBlockAttributes.
>>>>
>>>>>
>>>>>> +    Object parent;
>>>>>> +
>>>>>> +    MemoryRegion *mr;
>>>>>
>>>>>
>>>>> Should we link to the parent RAMBlock instead, and lookup the MR from
>>>>> there?
>>>>
>>>> Good suggestion! It can also help to reduce the long arrow operation in
>>>> ram_block_attribute_get_block_size().
>>>>
>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result
  2025-05-26 12:44     ` Cédric Le Goater
@ 2025-05-27  5:29       ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  5:29 UTC (permalink / raw)
  To: Cédric Le Goater, David Hildenbrand, Alexey Kardashevskiy,
	Peter Xu, Gupta Pankaj, Paolo Bonzini,
	Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 8:44 PM, Cédric Le Goater wrote:
> On 5/26/25 12:36, Cédric Le Goater wrote:
>> On 5/20/25 12:28, Chenyi Qiang wrote:
>>> So that the caller can check the result of NotifyRamDiscard() handler if
>>> the operation fails.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v5:
>>>      - Revert to use of NotifyRamDiscard()
>>>
>>> Changes in v4:
>>>      - Newly added.
>>> ---
>>>   hw/vfio/listener.c           | 6 ++++--
>>>   include/system/memory.h      | 4 ++--
>>>   system/ram-block-attribute.c | 3 +--
>>>   3 files changed, 7 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>> index bfacb3d8d9..06454e0584 100644
>>> --- a/hw/vfio/listener.c
>>> +++ b/hw/vfio/listener.c
>>> @@ -190,8 +190,8 @@ out:
>>>       rcu_read_unlock();
>>>   }
>>> -static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>> -                                            MemoryRegionSection
>>> *section)
>>> +static int vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>> +                                           MemoryRegionSection
>>> *section)
>>>   {
>>>       VFIORamDiscardListener *vrdl = container_of(rdl,
>>> VFIORamDiscardListener,
>>>                                                   listener);
>>> @@ -206,6 +206,8 @@ static void
>>> vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
>>>           error_report("%s: vfio_container_dma_unmap() failed: %s",
>>> __func__,
>>>                        strerror(-ret));
>>>       }
>>> +
>>> +    return ret;
>>>   }
>>
>> vfio_ram_discard_notify_populate() should also be modified
>> to return this value.
> 
> Nope. It should not. This is a rollback path in case of error. All good.

Thanks for your review! Anyway, according to the discussion in patch
#10, I'll revert this patch in next version, since it is a future work
to consider the failure case of notifying discard.

> 
> Thanks,
> 
> C.
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-26  9:08   ` David Hildenbrand
@ 2025-05-27  5:47     ` Chenyi Qiang
  2025-05-27  7:42       ` Alexey Kardashevskiy
  2025-05-27 11:20       ` David Hildenbrand
  0 siblings, 2 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  5:47 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/26/2025 5:08 PM, David Hildenbrand wrote:
> On 20.05.25 12:28, Chenyi Qiang wrote:
>> As guest_memfd is now managed by RamBlockAttribute with
>> RamDiscardManager, only block uncoordinated discard.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Changes in v5:
>>      - Revert to use RamDiscardManager.
>>
>> Changes in v4:
>>      - Modify commit message (RamDiscardManager->PrivateSharedManager).
>>
>> Changes in v3:
>>      - No change.
>>
>> Changes in v2:
>>      - Change the ram_block_discard_require(false) to
>>        ram_block_coordinated_discard_require(false).
>> ---
>>   system/physmem.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index f05f7ff09a..58b7614660 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block,
>> Error **errp)
>>           }
>>           assert(new_block->guest_memfd < 0);
>>   -        ret = ram_block_discard_require(true);
>> +        ret = ram_block_coordinated_discard_require(true);
>>           if (ret < 0) {
>>               error_setg_errno(errp, -ret,
>>                                "cannot set up private guest memory:
>> discard currently blocked");
>> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block,
>> Error **errp)
>>                * ever develops a need to check for errors.
>>                */
>>               close(new_block->guest_memfd);
>> -            ram_block_discard_require(false);
>> +            ram_block_coordinated_discard_require(false);
>>               qemu_mutex_unlock_ramlist();
>>               goto out_free;
>>           }
>> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>>       if (block->guest_memfd >= 0) {
>>           ram_block_attribute_destroy(block->ram_shared);
>>           close(block->guest_memfd);
>> -        ram_block_discard_require(false);
>> +        ram_block_coordinated_discard_require(false);
>>       }
>>         g_free(block);
> 
> 
> I think this patch should be squashed into the previous one, then the
> story in that single patch is consistent.

I think this patch is a gate to allow device assignment with guest_memfd
and want to make it separately. Can we instead add some commit message
in previous one? like:

"Using guest_memfd with vfio is still blocked via
ram_block_discard_disable()/ram_block_discard_require()."

> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd
  2025-05-27  3:14             ` Chenyi Qiang
@ 2025-05-27  6:06               ` Alexey Kardashevskiy
  0 siblings, 0 replies; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  6:06 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 27/5/25 13:14, Chenyi Qiang wrote:
> 
> 
> On 5/27/2025 9:20 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 27/5/25 11:15, Chenyi Qiang wrote:
>>>
>>>
>>> On 5/26/2025 7:16 PM, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 26/5/25 19:28, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 5/26/2025 5:01 PM, David Hildenbrand wrote:
>>>>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>>>>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>>>>>>> discard") highlighted that subsystems like VFIO may disable RAM block
>>>>>>> discard. However, guest_memfd relies on discard operations for page
>>>>>>> conversion between private and shared memory, potentially leading to
>>>>>>> stale IOMMU mapping issue when assigning hardware devices to
>>>>>>> confidential VMs via shared memory. To address this and allow shared
>>>>>>> device assignement, it is crucial to ensure VFIO system refresh its
>>>>>>> IOMMU mappings.
>>>>>>>
>>>>>>> RamDiscardManager is an existing interface (used by virtio-mem) to
>>>>>>> adjust VFIO mappings in relation to VM page assignment. Effectively
>>>>>>> page
>>>>>>> conversion is similar to hot-removing a page in one mode and
>>>>>>> adding it
>>>>>>> back in the other. Therefore, similar actions are required for page
>>>>>>> conversion events. Introduce the RamDiscardManager to guest_memfd to
>>>>>>> facilitate this process.
>>>>>>>
>>>>>>> Since guest_memfd is not an object, it cannot directly implement the
>>>>>>> RamDiscardManager interface. Implementing it in HostMemoryBackend is
>>>>>>> not appropriate because guest_memfd is per RAMBlock, and some
>>>>>>> RAMBlocks
>>>>>>> have a memory backend while others do not. Notably, virtual BIOS
>>>>>>> RAMBlocks using memory_region_init_ram_guest_memfd() do not have a
>>>>>>> backend.
>>>>>>>
>>>>>>> To manage RAMBlocks with guest_memfd, define a new object named
>>>>>>> RamBlockAttribute to implement the RamDiscardManager interface. This
>>>>>>> object can store the guest_memfd information such as bitmap for
>>>>>>> shared
>>>>>>> memory, and handles page conversion notification. In the context of
>>>>>>> RamDiscardManager, shared state is analogous to populated and private
>>>>>>> state is treated as discard. The memory state is tracked at the host
>>>>>>> page size granularity, as minimum memory conversion size can be one
>>>>>>> page
>>>>>>> per request. Additionally, VFIO expects the DMA mapping for a
>>>>>>> specific
>>>>>>> iova to be mapped and unmapped with the same granularity.
>>>>>>> Confidential
>>>>>>> VMs may perform partial conversions, such as conversions on small
>>>>>>> regions within larger regions. To prevent such invalid cases and
>>>>>>> until
>>>>>>> cut_mapping operation support is available, all operations are
>>>>>>> performed
>>>>>>> with 4K granularity.
>>>>>>>
>>>>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>>>>> ---
>>>>>>> Changes in v5:
>>>>>>>         - Revert to use RamDiscardManager interface instead of
>>>>>>> introducing
>>>>>>>           new hierarchy of class to manage private/shared state,
>>>>>>> and keep
>>>>>>>           using the new name of RamBlockAttribute compared with the
>>>>>>>           MemoryAttributeManager in v3.
>>>>>>>         - Use *simple* version of object_define and object_declare
>>>>>>> since the
>>>>>>>           state_change() function is changed as an exported function
>>>>>>> instead
>>>>>>>           of a virtual function in later patch.
>>>>>>>         - Move the introduction of RamBlockAttribute field to this
>>>>>>> patch and
>>>>>>>           rename it to ram_shared. (Alexey)
>>>>>>>         - call the exit() when register/unregister failed. (Zhao)
>>>>>>>         - Add the ram-block-attribute.c to Memory API related part in
>>>>>>>           MAINTAINERS.
>>>>>>>
>>>>>>> Changes in v4:
>>>>>>>         - Change the name from memory-attribute-manager to
>>>>>>>           ram-block-attribute.
>>>>>>>         - Implement the newly-introduced PrivateSharedManager
>>>>>>> instead of
>>>>>>>           RamDiscardManager and change related commit message.
>>>>>>>         - Define the new object in ramblock.h instead of adding a new
>>>>>>> file.
>>>>>>>
>>>>>>> Changes in v3:
>>>>>>>         - Some rename (bitmap_size->shared_bitmap_size,
>>>>>>>           first_one/zero_bit->first_bit, etc.)
>>>>>>>         - Change shared_bitmap_size from uint32_t to unsigned
>>>>>>>         - Return mgr->mr->ram_block->page_size in get_block_size()
>>>>>>>         - Move set_ram_discard_manager() up to avoid a g_free() in
>>>>>>> failure
>>>>>>>           case.
>>>>>>>         - Add const for the memory_attribute_manager_get_block_size()
>>>>>>>         - Unify the ReplayRamPopulate and ReplayRamDiscard and related
>>>>>>>           callback.
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>>         - Rename the object name to MemoryAttributeManager
>>>>>>>         - Rename the bitmap to shared_bitmap to make it more clear.
>>>>>>>         - Remove block_size field and get it from a helper. In
>>>>>>> future, we
>>>>>>>           can get the page_size from RAMBlock if necessary.
>>>>>>>         - Remove the unncessary "struct" before GuestMemfdReplayData
>>>>>>>         - Remove the unncessary g_free() for the bitmap
>>>>>>>         - Add some error report when the callback failure for
>>>>>>>           populated/discarded section.
>>>>>>>         - Move the realize()/unrealize() definition to this patch.
>>>>>>> ---
>>>>>>>      MAINTAINERS                  |   1 +
>>>>>>>      include/system/ramblock.h    |  20 +++
>>>>>>>      system/meson.build           |   1 +
>>>>>>>      system/ram-block-attribute.c | 311 ++++++++++++++++++++++++++++++
>>>>>>> +++++
>>>>>>>      4 files changed, 333 insertions(+)
>>>>>>>      create mode 100644 system/ram-block-attribute.c
>>>>>>>
>>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>>> index 6dacd6d004..3b4947dc74 100644
>>>>>>> --- a/MAINTAINERS
>>>>>>> +++ b/MAINTAINERS
>>>>>>> @@ -3149,6 +3149,7 @@ F: system/memory.c
>>>>>>>      F: system/memory_mapping.c
>>>>>>>      F: system/physmem.c
>>>>>>>      F: system/memory-internal.h
>>>>>>> +F: system/ram-block-attribute.c
>>>>>>>      F: scripts/coccinelle/memory-region-housekeeping.cocci
>>>>>>>        Memory devices
>>>>>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>>>>>> index d8a116ba99..09255e8495 100644
>>>>>>> --- a/include/system/ramblock.h
>>>>>>> +++ b/include/system/ramblock.h
>>>>>>> @@ -22,6 +22,10 @@
>>>>>>>      #include "exec/cpu-common.h"
>>>>>>>      #include "qemu/rcu.h"
>>>>>>>      #include "exec/ramlist.h"
>>>>>>> +#include "system/hostmem.h"
>>>>>>> +
>>>>>>> +#define TYPE_RAM_BLOCK_ATTRIBUTE "ram-block-attribute"
>>>>>>> +OBJECT_DECLARE_SIMPLE_TYPE(RamBlockAttribute, RAM_BLOCK_ATTRIBUTE)
>>>>>>>        struct RAMBlock {
>>>>>>>          struct rcu_head rcu;
>>>>>>> @@ -42,6 +46,8 @@ struct RAMBlock {
>>>>>>>          int fd;
>>>>>>>          uint64_t fd_offset;
>>>>>>>          int guest_memfd;
>>>>>>> +    /* 1-setting of the bitmap in ram_shared represents ram is
>>>>>>> shared */
>>>>>>
>>>>>> That comment looks misplaced, and the variable misnamed.
>>>>>>
>>>>>> The commet should go into RamBlockAttribute and the variable should
>>>>>> likely be named "attributes".
>>>>>>
>>>>>> Also, "ram_shared" is not used at all in this patch, it should be
>>>>>> moved
>>>>>> into the corresponding patch.
>>>>>
>>>>> I thought we only manage the private and shared attribute, so name
>>>>> it as
>>>>> ram_shared. And in the future if managing other attributes, then rename
>>>>> it to attributes. It seems I overcomplicated things.
>>>>
>>>>
>>>> We manage populated vs discarded. Right now populated==shared but the
>>>> very next thing I will try doing is flipping this to populated==private.
>>>> Thanks,
>>>
>>> Can you elaborate your case why need to do the flip? populated and
>>> discarded are two states represented in the bitmap, is it workable to
>>> just call the related handler based on the bitmap?
>>
>>
>> Due to lack of inplace memory conversion in upstream linux, this is the
>> way to allow DMA for TDISP devices. So I'll need to make
>> populated==private opposite to the current populated==shared (+change
>> the kernel too, of course). Not sure I'm going to push real hard though,
>> depending on the inplace private/shared memory conversion work. Thanks,
> 
> Do you mean to operate only on private mapping? This is workable if you
> don't want to manipulate shared mapping. But if you want both,

But I do not want both at the moment as I only have a big knob to make all DMA trafic either private or shared but not both (well, I can have split the guest RAM in 2 halves by some bar address but that's it).

> for
> example, to_private conversion needs to discard shared mapping and
> populate private mapping in IOMMU, it may be possible to pass in a
> parameter to indicate the current operation, allowing the listener
> callback to decide how to proceed. Or other mechanisms to extend it.

True. Thanks,

> 
>>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> +    RamBlockAttribute *ram_shared;
>>>>>>>          size_t page_size;
>>>>>>>          /* dirty bitmap used during migration */
>>>>>>>          unsigned long *bmap;
>>>>>>> @@ -91,4 +97,18 @@ struct RAMBlock {
>>>>>>>          ram_addr_t postcopy_length;
>>>>>>>      };
>>>>>>>      +struct RamBlockAttribute {
>>>>>>
>>>>>> Should this actually be "RamBlockAttributes" ?
>>>>>
>>>>> Yes. To match with variable name "attributes", it can be renamed as
>>>>> RamBlockAttributes.
>>>>>
>>>>>>
>>>>>>> +    Object parent;
>>>>>>> +
>>>>>>> +    MemoryRegion *mr;
>>>>>>
>>>>>>
>>>>>> Should we link to the parent RAMBlock instead, and lookup the MR from
>>>>>> there?
>>>>>
>>>>> Good suggestion! It can also help to reduce the long arrow operation in
>>>>> ram_block_attribute_get_block_size().
>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard()
  2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
  2025-05-26  8:42   ` David Hildenbrand
  2025-05-26  9:35   ` Philippe Mathieu-Daudé
@ 2025-05-27  6:56   ` Alexey Kardashevskiy
  2 siblings, 0 replies; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  6:56 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 20/5/25 20:28, Chenyi Qiang wrote:
> Update ReplayRamDiscard() function to return the result and unify the
> ReplayRamPopulate() and ReplayRamDiscard() to ReplayRamDiscardState() at
> the same time due to their identical definitions. This unification
> simplifies related structures, such as VirtIOMEMReplayData, which makes
> it cleaner.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>

Reviewed-by: Alexey Kardashevskiy <aik@amd.com>

> ---
> Changes in v5:
>      - Rename ReplayRamStateChange to ReplayRamDiscardState (David)
>      - return data->fn(s, data->opaque) instead of 0 in
>        virtio_mem_rdm_replay_discarded_cb(). (Alexey)
> 
> Changes in v4:
>      - Modify the commit message. We won't use Replay() operation when
>        doing the attribute change like v3.
> 
> Changes in v3:
>      - Newly added.
> ---
>   hw/virtio/virtio-mem.c  | 21 ++++++++++-----------
>   include/system/memory.h | 36 +++++++++++++++++++-----------------
>   migration/ram.c         |  5 +++--
>   system/memory.c         | 12 ++++++------
>   4 files changed, 38 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 2e491e8c44..c46f6f9c3e 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1732,7 +1732,7 @@ static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
>   }
>   
>   struct VirtIOMEMReplayData {
> -    void *fn;
> +    ReplayRamDiscardState fn;
>       void *opaque;
>   };
>   
> @@ -1740,12 +1740,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
>   {
>       struct VirtIOMEMReplayData *data = arg;
>   
> -    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
> +    return data->fn(s, data->opaque);
>   }
>   
>   static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
>                                              MemoryRegionSection *s,
> -                                           ReplayRamPopulate replay_fn,
> +                                           ReplayRamDiscardState replay_fn,
>                                              void *opaque)
>   {
>       const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
> @@ -1764,14 +1764,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
>   {
>       struct VirtIOMEMReplayData *data = arg;
>   
> -    ((ReplayRamDiscard)data->fn)(s, data->opaque);
> -    return 0;
> +    return data->fn(s, data->opaque);
>   }
>   
> -static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
> -                                            MemoryRegionSection *s,
> -                                            ReplayRamDiscard replay_fn,
> -                                            void *opaque)
> +static int virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
> +                                           MemoryRegionSection *s,
> +                                           ReplayRamDiscardState replay_fn,
> +                                           void *opaque)
>   {
>       const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
>       struct VirtIOMEMReplayData data = {
> @@ -1780,8 +1779,8 @@ static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
>       };
>   
>       g_assert(s->mr == &vmem->memdev->mr);
> -    virtio_mem_for_each_unplugged_section(vmem, s, &data,
> -                                          virtio_mem_rdm_replay_discarded_cb);
> +    return virtio_mem_for_each_unplugged_section(vmem, s, &data,
> +                                                 virtio_mem_rdm_replay_discarded_cb);
>   }
>   
>   static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
> diff --git a/include/system/memory.h b/include/system/memory.h
> index 896948deb1..83b28551c4 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -575,8 +575,8 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
>       rdl->double_discard_supported = double_discard_supported;
>   }
>   
> -typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
> -typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
> +typedef int (*ReplayRamDiscardState)(MemoryRegionSection *section,
> +                                     void *opaque);
>   
>   /*
>    * RamDiscardManagerClass:
> @@ -650,36 +650,38 @@ struct RamDiscardManagerClass {
>       /**
>        * @replay_populated:
>        *
> -     * Call the #ReplayRamPopulate callback for all populated parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all populated parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * In case any call fails, no further calls are made.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamPopulate callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
>        *
>        * Returns 0 on success, or a negative error if any notification failed.
>        */
>       int (*replay_populated)(const RamDiscardManager *rdm,
>                               MemoryRegionSection *section,
> -                            ReplayRamPopulate replay_fn, void *opaque);
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @replay_discarded:
>        *
> -     * Call the #ReplayRamDiscard callback for all discarded parts within the
> -     * #MemoryRegionSection via the #RamDiscardManager.
> +     * Call the #ReplayRamDiscardState callback for all discarded parts within
> +     * the #MemoryRegionSection via the #RamDiscardManager.
>        *
>        * @rdm: the #RamDiscardManager
>        * @section: the #MemoryRegionSection
> -     * @replay_fn: the #ReplayRamDiscard callback
> +     * @replay_fn: the #ReplayRamDiscardState callback
>        * @opaque: pointer to forward to the callback
> +     *
> +     * Returns 0 on success, or a negative error if any notification failed.
>        */
> -    void (*replay_discarded)(const RamDiscardManager *rdm,
> -                             MemoryRegionSection *section,
> -                             ReplayRamDiscard replay_fn, void *opaque);
> +    int (*replay_discarded)(const RamDiscardManager *rdm,
> +                            MemoryRegionSection *section,
> +                            ReplayRamDiscardState replay_fn, void *opaque);
>   
>       /**
>        * @register_listener:
> @@ -722,13 +724,13 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
>   
>   int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>                                            MemoryRegionSection *section,
> -                                         ReplayRamPopulate replay_fn,
> +                                         ReplayRamDiscardState replay_fn,
>                                            void *opaque);
>   
> -void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> -                                          MemoryRegionSection *section,
> -                                          ReplayRamDiscard replay_fn,
> -                                          void *opaque);
> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> +                                         MemoryRegionSection *section,
> +                                         ReplayRamDiscardState replay_fn,
> +                                         void *opaque);
>   
>   void ram_discard_manager_register_listener(RamDiscardManager *rdm,
>                                              RamDiscardListener *rdl,
> diff --git a/migration/ram.c b/migration/ram.c
> index e12913b43e..c004f37060 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -848,8 +848,8 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>       return ret;
>   }
>   
> -static void dirty_bitmap_clear_section(MemoryRegionSection *section,
> -                                       void *opaque)
> +static int dirty_bitmap_clear_section(MemoryRegionSection *section,
> +                                      void *opaque)
>   {
>       const hwaddr offset = section->offset_within_region;
>       const hwaddr size = int128_get64(section->size);
> @@ -868,6 +868,7 @@ static void dirty_bitmap_clear_section(MemoryRegionSection *section,
>       }
>       *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
>       bitmap_clear(rb->bmap, start, npages);
> +    return 0;
>   }
>   
>   /*
> diff --git a/system/memory.c b/system/memory.c
> index b45b508dce..de45fbdd3f 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2138,7 +2138,7 @@ bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
>   
>   int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>                                            MemoryRegionSection *section,
> -                                         ReplayRamPopulate replay_fn,
> +                                         ReplayRamDiscardState replay_fn,
>                                            void *opaque)
>   {
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
> @@ -2147,15 +2147,15 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
>       return rdmc->replay_populated(rdm, section, replay_fn, opaque);
>   }
>   
> -void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> -                                          MemoryRegionSection *section,
> -                                          ReplayRamDiscard replay_fn,
> -                                          void *opaque)
> +int ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
> +                                         MemoryRegionSection *section,
> +                                         ReplayRamDiscardState replay_fn,
> +                                         void *opaque)
>   {
>       RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
>   
>       g_assert(rdmc->replay_discarded);
> -    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
> +    return rdmc->replay_discarded(rdm, section, replay_fn, opaque);
>   }
>   
>   void ram_discard_manager_register_listener(RamDiscardManager *rdm,

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result
  2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
  2025-05-26  8:40   ` David Hildenbrand
@ 2025-05-27  6:56   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  6:56 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 20/5/25 20:28, Chenyi Qiang wrote:
> Modify memory_region_set_ram_discard_manager() to return -EBUSY if a
> RamDiscardManager is already set in the MemoryRegion. The caller must
> handle this failure, such as having virtio-mem undo its actions and fail
> the realize() process. Opportunistically move the call earlier to avoid
> complex error handling.
> 
> This change is beneficial when introducing a new RamDiscardManager
> instance besides virtio-mem. After
> ram_block_coordinated_discard_require(true) unlocks all
> RamDiscardManager instances, only one instance is allowed to be set for
> one MemoryRegion at present.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Nit in commit message (return false -> -EBUSY)
>      - Add set_ram_discard_manager(NULL) when ram_block_discard_range()
>        fails.

Reviewed-by: Alexey Kardashevskiy <aik@amd.com>


> 
> Changes in v4:
>      - No change.
> 
> Changes in v3:
>      - Move set_ram_discard_manager() up to avoid a g_free()
>      - Clean up set_ram_discard_manager() definition
> 
> Changes in v2:
>      - newly added.
> ---
>   hw/virtio/virtio-mem.c  | 30 +++++++++++++++++-------------
>   include/system/memory.h |  6 +++---
>   system/memory.c         | 10 +++++++---
>   3 files changed, 27 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index b3c126ea1e..2e491e8c44 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1047,6 +1047,17 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           return;
>       }
>   
> +    /*
> +     * Set ourselves as RamDiscardManager before the plug handler maps the
> +     * memory region and exposes it via an address space.
> +     */
> +    if (memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> +                                              RAM_DISCARD_MANAGER(vmem))) {
> +        error_setg(errp, "Failed to set RamDiscardManager");
> +        ram_block_coordinated_discard_require(false);
> +        return;
> +    }
> +
>       /*
>        * We don't know at this point whether shared RAM is migrated using
>        * QEMU or migrated using the file content. "x-ignore-shared" will be
> @@ -1061,6 +1072,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>           ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
>           if (ret) {
>               error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> +            memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>               ram_block_coordinated_discard_require(false);
>               return;
>           }
> @@ -1122,13 +1134,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>       vmem->system_reset = VIRTIO_MEM_SYSTEM_RESET(obj);
>       vmem->system_reset->vmem = vmem;
>       qemu_register_resettable(obj);
> -
> -    /*
> -     * Set ourselves as RamDiscardManager before the plug handler maps the
> -     * memory region and exposes it via an address space.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr,
> -                                          RAM_DISCARD_MANAGER(vmem));
>   }
>   
>   static void virtio_mem_device_unrealize(DeviceState *dev)
> @@ -1136,12 +1141,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>       VirtIOMEM *vmem = VIRTIO_MEM(dev);
>   
> -    /*
> -     * The unplug handler unmapped the memory region, it cannot be
> -     * found via an address space anymore. Unset ourselves.
> -     */
> -    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
> -
>       qemu_unregister_resettable(OBJECT(vmem->system_reset));
>       object_unref(OBJECT(vmem->system_reset));
>   
> @@ -1154,6 +1153,11 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
>       virtio_del_queue(vdev, 0);
>       virtio_cleanup(vdev);
>       g_free(vmem->bitmap);
> +    /*
> +     * The unplug handler unmapped the memory region, it cannot be
> +     * found via an address space anymore. Unset ourselves.
> +     */
> +    memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
>       ram_block_coordinated_discard_require(false);
>   }
>   
> diff --git a/include/system/memory.h b/include/system/memory.h
> index b961c4076a..896948deb1 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -2499,13 +2499,13 @@ static inline bool memory_region_has_ram_discard_manager(MemoryRegion *mr)
>    *
>    * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
>    * that does not cover RAM, or a #MemoryRegion that already has a
> - * #RamDiscardManager assigned.
> + * #RamDiscardManager assigned. Return 0 if the rdm is set successfully.
>    *
>    * @mr: the #MemoryRegion
>    * @rdm: #RamDiscardManager to set
>    */
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm);
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm);
>   
>   /**
>    * memory_region_find: translate an address/size relative to a
> diff --git a/system/memory.c b/system/memory.c
> index 63b983efcd..b45b508dce 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2106,12 +2106,16 @@ RamDiscardManager *memory_region_get_ram_discard_manager(MemoryRegion *mr)
>       return mr->rdm;
>   }
>   
> -void memory_region_set_ram_discard_manager(MemoryRegion *mr,
> -                                           RamDiscardManager *rdm)
> +int memory_region_set_ram_discard_manager(MemoryRegion *mr,
> +                                          RamDiscardManager *rdm)
>   {
>       g_assert(memory_region_is_ram(mr));
> -    g_assert(!rdm || !mr->rdm);
> +    if (mr->rdm && rdm) {
> +        return -EBUSY;
> +    }
> +
>       mr->rdm = rdm;
> +    return 0;
>   }
>   
>   uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
  2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
  2025-05-26  9:02   ` David Hildenbrand
@ 2025-05-27  7:35   ` Alexey Kardashevskiy
  2025-05-27  9:06     ` Chenyi Qiang
  1 sibling, 1 reply; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  7:35 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 20/5/25 20:28, Chenyi Qiang wrote:
> A new state_change() helper is introduced for RamBlockAttribute
> to efficiently notify all registered RamDiscardListeners, including
> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
> listener can dynamically DMA map/unmap shared pages based on conversion
> types:
> - For conversions from shared to private, the VFIO system ensures the
>    discarding of shared mapping from the IOMMU.
> - For conversions from private to shared, it triggers the population of
>    the shared mapping into the IOMMU.
> 
> Currently, memory conversion failures cause QEMU to quit instead of
> resuming the guest or retrying the operation. It would be a future work
> to add more error handling or rollback mechanisms once conversion
> failures are allowed. For example, in-place conversion of guest_memfd
> could retry the unmap operation during the conversion from shared to
> private. However, for now, keep the complex error handling out of the
> picture as it is not required:
> 
> - If a conversion request is made for a page already in the desired
>    state, the helper simply returns success.
> - For requests involving a range partially in the desired state, there
>    is no such scenario in practice at present. Simply return error.
> - If a conversion request is declined by other systems, such as a
>    failure from VFIO during notify_to_populated(), the failure is
>    returned directly. As for notify_to_discard(), VFIO cannot fail
>    unmap/unpin, so no error is returned.
> 
> Note that the bitmap status is updated before callbacks, allowing
> listeners to handle memory based on the latest status.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Change in v5:
>      - Move the state_change() back to a helper instead of a callback of
>        the class since there's no child for the RamBlockAttributeClass.
>      - Remove the error handling and move them to an individual patch for
>        simple management.
> 
> Changes in v4:
>      - Add the state_change() callback in PrivateSharedManagerClass
>        instead of the RamBlockAttribute.
> 
> Changes in v3:
>      - Move the bitmap update before notifier callbacks.
>      - Call the notifier callbacks directly in notify_discard/populate()
>        with the expectation that the request memory range is in the
>        desired attribute.
>      - For the case that only partial range in the desire status, handle
>        the range with block_size granularity for ease of rollback
>        (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-fb7a125cf54e@redhat.com/)
> 
> Changes in v2:
>      - Do the alignment changes due to the rename to MemoryAttributeManager
>      - Move the state_change() helper definition in this patch.
> ---
>   include/system/ramblock.h    |   2 +
>   system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>   2 files changed, 136 insertions(+)
> 
> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
> index 09255e8495..270dffb2f3 100644
> --- a/include/system/ramblock.h
> +++ b/include/system/ramblock.h
> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>       QLIST_HEAD(, RamDiscardListener) rdl_list;
>   };
>   
> +int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
> +                                     uint64_t size, bool to_private);

Not sure about the "to_private" name. I'd think private/shared is something KVM operates with and here, in RamBlock, it is discarded/populated.

>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>   void ram_block_attribute_destroy(RamBlockAttribute *attr);
>   
> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
> index 8d4a24738c..f12dd4b881 100644
> --- a/system/ram-block-attribute.c
> +++ b/system/ram-block-attribute.c
> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const RamDiscardManager *rdm,
>                                               ram_block_attribute_rdm_replay_cb);
>   }
>   
> +static bool ram_block_attribute_is_valid_range(RamBlockAttribute *attr,
> +                                               uint64_t offset, uint64_t size)
> +{
> +    MemoryRegion *mr = attr->mr;
> +
> +    g_assert(mr);
> +
> +    uint64_t region_size = memory_region_size(mr);
> +    int block_size = ram_block_attribute_get_block_size(attr);

It is size_t, not int.

> +
> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {

Does not the @size have to be aligned too?

> +        return false;
> +    }
> +    if (offset + size < offset || !size) {

This could be just (offset + size <= offset).
(these overflow checks always blow up my little brain)

> +        return false;
> +    }
> +    if (offset >= region_size || offset + size > region_size) {

Just (offset + size > region_size) should do.

> +        return false;
> +    }
> +    return true;
> +}
> +
> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
> +                                                  uint64_t offset,
> +                                                  uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        rdl->notify_discard(rdl, &tmp);
> +    }
> +}
> +
> +static int
> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
> +                                        uint64_t offset, uint64_t size)
> +{
> +    RamDiscardListener *rdl;
> +    int ret = 0;
> +
> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> +        MemoryRegionSection tmp = *rdl->section;
> +
> +        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +            continue;
> +        }
> +        ret = rdl->notify_populate(rdl, &tmp);
> +        if (ret) {
> +            break;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute *attr,
> +                                                   uint64_t offset,
> +                                                   uint64_t size)
> +{
> +    const int block_size = ram_block_attribute_get_block_size(attr);

size_t.

> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    /* We fake a shorter bitmap to avoid searching too far. */

What is "fake" about it? We truthfully check here that every bit in [first_bit, last_bit] is set.

> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
> +                                   first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +static bool
> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
> +                                     uint64_t offset, uint64_t size)
> +{
> +    const int block_size = ram_block_attribute_get_block_size(attr);

size_t.

> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
> +    unsigned long found_bit;
> +
> +    /* We fake a shorter bitmap to avoid searching too far. */
> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
> +    return found_bit > last_bit;
> +}
> +
> +int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
> +                                     uint64_t size, bool to_private)
> +{
> +    const int block_size = ram_block_attribute_get_block_size(attr);

size_t.

> +    const unsigned long first_bit = offset / block_size;
> +    const unsigned long nbits = size / block_size;
> +    int ret = 0;
> +
> +    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
> +                     __func__, offset, size);
> +        return -1;

May be -EINVAL?

> +    }
> +
> +    /* Already discard/populated */
> +    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
> +         to_private) ||
> +        (ram_block_attribute_is_range_populated(attr, offset, size) &&
> +         !to_private)) {

A tracepoint would be useful here imho.

> +        return 0;
> +    }
> +
> +    /* Unexpected mixture */
> +    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
> +         to_private) ||
> +        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
> +         !to_private)) {
> +        error_report("%s, the range is not all in the desired state: "
> +                     "(offset 0x%lx, size 0x%lx), %s",
> +                     __func__, offset, size,
> +                     to_private ? "private" : "shared");
> +        return -1;

-EBUSY?

> +    }
> +
> +    if (to_private) {
> +        bitmap_clear(attr->bitmap, first_bit, nbits);
> +        ram_block_attribute_notify_to_discard(attr, offset, size);
> +    } else {
> +        bitmap_set(attr->bitmap, first_bit, nbits);
> +        ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> +    }

and a successful tracepoint here may be?

> +
> +    return ret;
> +}
> +
>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>   {
>       uint64_t bitmap_size;

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-27  5:47     ` Chenyi Qiang
@ 2025-05-27  7:42       ` Alexey Kardashevskiy
  2025-05-27  8:12         ` Chenyi Qiang
  2025-05-27 11:20       ` David Hildenbrand
  1 sibling, 1 reply; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  7:42 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 27/5/25 15:47, Chenyi Qiang wrote:
> 
> 
> On 5/26/2025 5:08 PM, David Hildenbrand wrote:
>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>> As guest_memfd is now managed by RamBlockAttribute with
>>> RamDiscardManager, only block uncoordinated discard.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v5:
>>>       - Revert to use RamDiscardManager.
>>>
>>> Changes in v4:
>>>       - Modify commit message (RamDiscardManager->PrivateSharedManager).
>>>
>>> Changes in v3:
>>>       - No change.
>>>
>>> Changes in v2:
>>>       - Change the ram_block_discard_require(false) to
>>>         ram_block_coordinated_discard_require(false).
>>> ---
>>>    system/physmem.c | 6 +++---
>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index f05f7ff09a..58b7614660 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>            }
>>>            assert(new_block->guest_memfd < 0);
>>>    -        ret = ram_block_discard_require(true);
>>> +        ret = ram_block_coordinated_discard_require(true);
>>>            if (ret < 0) {
>>>                error_setg_errno(errp, -ret,
>>>                                 "cannot set up private guest memory:
>>> discard currently blocked");
>>> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>                 * ever develops a need to check for errors.
>>>                 */
>>>                close(new_block->guest_memfd);
>>> -            ram_block_discard_require(false);
>>> +            ram_block_coordinated_discard_require(false);
>>>                qemu_mutex_unlock_ramlist();
>>>                goto out_free;
>>>            }
>>> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>>>        if (block->guest_memfd >= 0) {
>>>            ram_block_attribute_destroy(block->ram_shared);
>>>            close(block->guest_memfd);
>>> -        ram_block_discard_require(false);
>>> +        ram_block_coordinated_discard_require(false);
>>>        }
>>>          g_free(block);
>>
>>
>> I think this patch should be squashed into the previous one, then the
>> story in that single patch is consistent.
> 
> I think this patch is a gate to allow device assignment with guest_memfd
> and want to make it separately. 

It is not good for bisecability - whatever problem 06/10 may have - git bisect will point to this one.
And it is confusing when within the same patchset lines are added and then removed.
And 06/10 (especially after removing LiveMigration checks) and 07/10 are too small and too related to separate. Thanks,

> Can we instead add some commit message
> in previous one? like:
> 
> "Using guest_memfd with vfio is still blocked via
> ram_block_discard_disable()/ram_block_discard_require()."
> 
>>
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions
  2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
  2025-05-26  9:22   ` David Hildenbrand
@ 2025-05-27  8:01   ` Alexey Kardashevskiy
  1 sibling, 0 replies; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  8:01 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 20/5/25 20:28, Chenyi Qiang wrote:
> With the introduction of the RamBlockAttribute object to manage
> RAMBlocks with guest_memfd, it is more elegant to move KVM set attribute
> into a RamDiscardListener.
> 
> The KVM attribute change RamDiscardListener is registered/unregistered
> for each memory region section during kvm_region_add/del(). The listener
> handler performs attribute change upon receiving notifications from
> ram_block_attribute_state_change() calls. After this change, the
> operations in kvm_convert_memory() can be removed.
> 
> Note that, errors can be returned in
> ram_block_attribute_notify_to_discard() by KVM attribute changes,
> although it is currently unlikely to happen. With in-place conversion
> guest_memfd in the future, it would be more likely to encounter errors
> and require error handling. For now, simply return the result, and
> kvm_convert_memory() will cause QEMU to quit if any issue arises.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
> Changes in v5:
>      - Revert to use RamDiscardListener
> 
> Changes in v4:
>      - Newly added.
> ---
>   accel/kvm/kvm-all.c                         | 72 ++++++++++++++++++---
>   include/system/confidential-guest-support.h |  9 +++
>   system/ram-block-attribute.c                | 16 +++--
>   target/i386/kvm/tdx.c                       |  1 +
>   target/i386/sev.c                           |  1 +

imho this diffstat disagrees with the "more elegant" :)
+1 for ditching it from this patchset. Thanks,


>   5 files changed, 85 insertions(+), 14 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 2d7ecaeb6a..ca4ef8062b 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -49,6 +49,7 @@
>   #include "kvm-cpus.h"
>   #include "system/dirtylimit.h"
>   #include "qemu/range.h"
> +#include "system/confidential-guest-support.h"
>   
>   #include "hw/boards.h"
>   #include "system/stats.h"
> @@ -1689,28 +1690,90 @@ static int kvm_dirty_ring_init(KVMState *s)
>       return 0;
>   }
>   
> +static int kvm_private_shared_notify(RamDiscardListener *rdl,
> +                                     MemoryRegionSection *section,
> +                                     bool to_private)
> +{
> +    hwaddr start = section->offset_within_address_space;
> +    hwaddr size = section->size;
> +
> +    if (to_private) {
> +        return kvm_set_memory_attributes_private(start, size);
> +    } else {
> +        return kvm_set_memory_attributes_shared(start, size);
> +    }
> +}
> +
> +static int kvm_ram_discard_notify_to_shared(RamDiscardListener *rdl,
> +                                            MemoryRegionSection *section)
> +{
> +    return kvm_private_shared_notify(rdl, section, false);
> +}
> +
> +static int kvm_ram_discard_notify_to_private(RamDiscardListener *rdl,
> +                                             MemoryRegionSection *section)
> +{
> +    return kvm_private_shared_notify(rdl, section, true);
> +}
> +
>   static void kvm_region_add(MemoryListener *listener,
>                              MemoryRegionSection *section)
>   {
>       KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
> +    ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
> +    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
>       KVMMemoryUpdate *update;
> +    CGSRamDiscardListener *crdl;
> +    RamDiscardListener *rdl;
> +
>   
>       update = g_new0(KVMMemoryUpdate, 1);
>       update->section = *section;
>   
>       QSIMPLEQ_INSERT_TAIL(&kml->transaction_add, update, next);
> +
> +    if (!memory_region_has_guest_memfd(section->mr) || !rdm) {
> +        return;
> +    }
> +
> +    crdl = g_new0(CGSRamDiscardListener, 1);
> +    crdl->mr = section->mr;
> +    crdl->offset_within_address_space = section->offset_within_address_space;
> +    rdl = &crdl->listener;
> +    QLIST_INSERT_HEAD(&cgs->cgs_rdl_list, crdl, next);
> +    ram_discard_listener_init(rdl, kvm_ram_discard_notify_to_shared,
> +                              kvm_ram_discard_notify_to_private, true);
> +    ram_discard_manager_register_listener(rdm, rdl, section);
>   }
>   
>   static void kvm_region_del(MemoryListener *listener,
>                              MemoryRegionSection *section)
>   {
>       KVMMemoryListener *kml = container_of(listener, KVMMemoryListener, listener);
> +    ConfidentialGuestSupport *cgs = MACHINE(qdev_get_machine())->cgs;
> +    RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
>       KVMMemoryUpdate *update;
> +    CGSRamDiscardListener *crdl;
> +    RamDiscardListener *rdl;
>   
>       update = g_new0(KVMMemoryUpdate, 1);
>       update->section = *section;
>   
>       QSIMPLEQ_INSERT_TAIL(&kml->transaction_del, update, next);
> +    if (!memory_region_has_guest_memfd(section->mr) || !rdm) {
> +        return;
> +    }
> +
> +    QLIST_FOREACH(crdl, &cgs->cgs_rdl_list, next) {
> +        if (crdl->mr == section->mr &&
> +            crdl->offset_within_address_space == section->offset_within_address_space) {
> +            rdl = &crdl->listener;
> +            ram_discard_manager_unregister_listener(rdm, rdl);
> +            QLIST_REMOVE(crdl, next);
> +            g_free(crdl);
> +            break;
> +        }
> +    }
>   }
>   
>   static void kvm_region_commit(MemoryListener *listener)
> @@ -3077,15 +3140,6 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>           goto out_unref;
>       }
>   
> -    if (to_private) {
> -        ret = kvm_set_memory_attributes_private(start, size);
> -    } else {
> -        ret = kvm_set_memory_attributes_shared(start, size);
> -    }
> -    if (ret) {
> -        goto out_unref;
> -    }
> -
>       addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
>       rb = qemu_ram_block_from_host(addr, false, &offset);
>   
> diff --git a/include/system/confidential-guest-support.h b/include/system/confidential-guest-support.h
> index ea46b50c56..974abdbf6b 100644
> --- a/include/system/confidential-guest-support.h
> +++ b/include/system/confidential-guest-support.h
> @@ -19,12 +19,19 @@
>   #define QEMU_CONFIDENTIAL_GUEST_SUPPORT_H
>   
>   #include "qom/object.h"
> +#include "system/memory.h"
>   
>   #define TYPE_CONFIDENTIAL_GUEST_SUPPORT "confidential-guest-support"
>   OBJECT_DECLARE_TYPE(ConfidentialGuestSupport,
>                       ConfidentialGuestSupportClass,
>                       CONFIDENTIAL_GUEST_SUPPORT)
>   
> +typedef struct CGSRamDiscardListener {
> +    MemoryRegion *mr;
> +    hwaddr offset_within_address_space;
> +    RamDiscardListener listener;
> +    QLIST_ENTRY(CGSRamDiscardListener) next;
> +} CGSRamDiscardListener;
>   
>   struct ConfidentialGuestSupport {
>       Object parent;
> @@ -34,6 +41,8 @@ struct ConfidentialGuestSupport {
>        */
>       bool require_guest_memfd;
>   
> +    QLIST_HEAD(, CGSRamDiscardListener) cgs_rdl_list;
> +
>       /*
>        * ready: flag set by CGS initialization code once it's ready to
>        *        start executing instructions in a potentially-secure
> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
> index 896c3d7543..387501b569 100644
> --- a/system/ram-block-attribute.c
> +++ b/system/ram-block-attribute.c
> @@ -274,11 +274,12 @@ static bool ram_block_attribute_is_valid_range(RamBlockAttribute *attr,
>       return true;
>   }
>   
> -static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
> -                                                  uint64_t offset,
> -                                                  uint64_t size)
> +static int ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
> +                                                 uint64_t offset,
> +                                                 uint64_t size)
>   {
>       RamDiscardListener *rdl;
> +    int ret = 0;
>   
>       QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>           MemoryRegionSection tmp = *rdl->section;
> @@ -286,8 +287,13 @@ static void ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>           if (!memory_region_section_intersect_range(&tmp, offset, size)) {
>               continue;
>           }
> -        rdl->notify_discard(rdl, &tmp);
> +        ret = rdl->notify_discard(rdl, &tmp);
> +        if (ret) {
> +            break;
> +        }
>       }
> +
> +    return ret;
>   }
>   
>   static int
> @@ -377,7 +383,7 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
>   
>       if (to_private) {
>           bitmap_clear(attr->bitmap, first_bit, nbits);
> -        ram_block_attribute_notify_to_discard(attr, offset, size);
> +        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
>       } else {
>           bitmap_set(attr->bitmap, first_bit, nbits);
>           ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 7ef49690bd..17b360059c 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -1492,6 +1492,7 @@ static void tdx_guest_init(Object *obj)
>       qemu_mutex_init(&tdx->lock);
>   
>       cgs->require_guest_memfd = true;
> +    QLIST_INIT(&cgs->cgs_rdl_list);
>       tdx->attributes = TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
>   
>       object_property_add_uint64_ptr(obj, "attributes", &tdx->attributes,
> diff --git a/target/i386/sev.c b/target/i386/sev.c
> index adf787797e..f1b9c35fc3 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -2430,6 +2430,7 @@ sev_snp_guest_instance_init(Object *obj)
>       SevSnpGuestState *sev_snp_guest = SEV_SNP_GUEST(obj);
>   
>       cgs->require_guest_memfd = true;
> +    QLIST_INIT(&cgs->cgs_rdl_list);
>   
>       /* default init/start/finish params for kvm */
>       sev_snp_guest->kvm_start_conf.policy = DEFAULT_SEV_SNP_POLICY;

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-27  7:42       ` Alexey Kardashevskiy
@ 2025-05-27  8:12         ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  8:12 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/27/2025 3:42 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 27/5/25 15:47, Chenyi Qiang wrote:
>>
>>
>> On 5/26/2025 5:08 PM, David Hildenbrand wrote:
>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>> As guest_memfd is now managed by RamBlockAttribute with
>>>> RamDiscardManager, only block uncoordinated discard.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>> Changes in v5:
>>>>       - Revert to use RamDiscardManager.
>>>>
>>>> Changes in v4:
>>>>       - Modify commit message (RamDiscardManager-
>>>> >PrivateSharedManager).
>>>>
>>>> Changes in v3:
>>>>       - No change.
>>>>
>>>> Changes in v2:
>>>>       - Change the ram_block_discard_require(false) to
>>>>         ram_block_coordinated_discard_require(false).
>>>> ---
>>>>    system/physmem.c | 6 +++---
>>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index f05f7ff09a..58b7614660 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>            }
>>>>            assert(new_block->guest_memfd < 0);
>>>>    -        ret = ram_block_discard_require(true);
>>>> +        ret = ram_block_coordinated_discard_require(true);
>>>>            if (ret < 0) {
>>>>                error_setg_errno(errp, -ret,
>>>>                                 "cannot set up private guest memory:
>>>> discard currently blocked");
>>>> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>                 * ever develops a need to check for errors.
>>>>                 */
>>>>                close(new_block->guest_memfd);
>>>> -            ram_block_discard_require(false);
>>>> +            ram_block_coordinated_discard_require(false);
>>>>                qemu_mutex_unlock_ramlist();
>>>>                goto out_free;
>>>>            }
>>>> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>>>>        if (block->guest_memfd >= 0) {
>>>>            ram_block_attribute_destroy(block->ram_shared);
>>>>            close(block->guest_memfd);
>>>> -        ram_block_discard_require(false);
>>>> +        ram_block_coordinated_discard_require(false);
>>>>        }
>>>>          g_free(block);
>>>
>>>
>>> I think this patch should be squashed into the previous one, then the
>>> story in that single patch is consistent.
>>
>> I think this patch is a gate to allow device assignment with guest_memfd
>> and want to make it separately. 
> 
> It is not good for bisecability - whatever problem 06/10 may have - git
> bisect will point to this one.

Bisecability seems not a strong reason, since what problem of patch
04,05,06 may have, git bisect will point to this one as they won't take
effect until allowing coordinated discard

> And it is confusing when within the same patchset lines are added and
> then removed.
> And 06/10 (especially after removing LiveMigration checks) and 07/10 are
> too small and too related to separate. Thanks,

Fair enough. I'll squash it. Thanks for elaboration.

> 
>> Can we instead add some commit message
>> in previous one? like:
>>
>> "Using guest_memfd with vfio is still blocked via
>> ram_block_discard_disable()/ram_block_discard_require()."
>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
  2025-05-27  7:35   ` Alexey Kardashevskiy
@ 2025-05-27  9:06     ` Chenyi Qiang
  2025-05-27  9:19       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27  9:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 20/5/25 20:28, Chenyi Qiang wrote:
>> A new state_change() helper is introduced for RamBlockAttribute
>> to efficiently notify all registered RamDiscardListeners, including
>> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
>> listener can dynamically DMA map/unmap shared pages based on conversion
>> types:
>> - For conversions from shared to private, the VFIO system ensures the
>>    discarding of shared mapping from the IOMMU.
>> - For conversions from private to shared, it triggers the population of
>>    the shared mapping into the IOMMU.
>>
>> Currently, memory conversion failures cause QEMU to quit instead of
>> resuming the guest or retrying the operation. It would be a future work
>> to add more error handling or rollback mechanisms once conversion
>> failures are allowed. For example, in-place conversion of guest_memfd
>> could retry the unmap operation during the conversion from shared to
>> private. However, for now, keep the complex error handling out of the
>> picture as it is not required:
>>
>> - If a conversion request is made for a page already in the desired
>>    state, the helper simply returns success.
>> - For requests involving a range partially in the desired state, there
>>    is no such scenario in practice at present. Simply return error.
>> - If a conversion request is declined by other systems, such as a
>>    failure from VFIO during notify_to_populated(), the failure is
>>    returned directly. As for notify_to_discard(), VFIO cannot fail
>>    unmap/unpin, so no error is returned.
>>
>> Note that the bitmap status is updated before callbacks, allowing
>> listeners to handle memory based on the latest status.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>> Change in v5:
>>      - Move the state_change() back to a helper instead of a callback of
>>        the class since there's no child for the RamBlockAttributeClass.
>>      - Remove the error handling and move them to an individual patch for
>>        simple management.
>>
>> Changes in v4:
>>      - Add the state_change() callback in PrivateSharedManagerClass
>>        instead of the RamBlockAttribute.
>>
>> Changes in v3:
>>      - Move the bitmap update before notifier callbacks.
>>      - Call the notifier callbacks directly in notify_discard/populate()
>>        with the expectation that the request memory range is in the
>>        desired attribute.
>>      - For the case that only partial range in the desire status, handle
>>        the range with block_size granularity for ease of rollback
>>        (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-
>> fb7a125cf54e@redhat.com/)
>>
>> Changes in v2:
>>      - Do the alignment changes due to the rename to
>> MemoryAttributeManager
>>      - Move the state_change() helper definition in this patch.
>> ---
>>   include/system/ramblock.h    |   2 +
>>   system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>>   2 files changed, 136 insertions(+)
>>
>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>> index 09255e8495..270dffb2f3 100644
>> --- a/include/system/ramblock.h
>> +++ b/include/system/ramblock.h
>> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>>       QLIST_HEAD(, RamDiscardListener) rdl_list;
>>   };
>>   +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> +                                     uint64_t size, bool to_private);
> 
> Not sure about the "to_private" name. I'd think private/shared is
> something KVM operates with and here, in RamBlock, it is discarded/
> populated.

Make sense. To keep consistent, I will rename it as to_discard.

> 
>>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>>   void ram_block_attribute_destroy(RamBlockAttribute *attr);
>>   diff --git a/system/ram-block-attribute.c b/system/ram-block-
>> attribute.c
>> index 8d4a24738c..f12dd4b881 100644
>> --- a/system/ram-block-attribute.c
>> +++ b/system/ram-block-attribute.c
>> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const
>> RamDiscardManager *rdm,
>>                                              
>> ram_block_attribute_rdm_replay_cb);
>>   }
>>   +static bool ram_block_attribute_is_valid_range(RamBlockAttribute
>> *attr,
>> +                                               uint64_t offset,
>> uint64_t size)
>> +{
>> +    MemoryRegion *mr = attr->mr;
>> +
>> +    g_assert(mr);
>> +
>> +    uint64_t region_size = memory_region_size(mr);
>> +    int block_size = ram_block_attribute_get_block_size(attr);
> 
> It is size_t, not int.

Fixed this and all below. Thanks!

> 
>> +
>> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {
> 
> Does not the @size have to be aligned too?

Yes. Actually, the "start" and "size" are already do the alignment check
in kvm_convert_memory(). I doubt if we still need it here. Anyway, in
case of other users in the future, I'll add it.

> 
>> +        return false;
>> +    }
>> +    if (offset + size < offset || !size) {
> 
> This could be just (offset + size <= offset).
> (these overflow checks always blow up my little brain)

Modified.

> 
>> +        return false;
>> +    }
>> +    if (offset >= region_size || offset + size > region_size) {
> 
> Just (offset + size > region_size) should do.

Ditto.

> 
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute
>> *attr,
>> +                                                  uint64_t offset,
>> +                                                  uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        rdl->notify_discard(rdl, &tmp);
>> +    }
>> +}
>> +
>> +static int
>> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>> +                                        uint64_t offset, uint64_t size)
>> +{
>> +    RamDiscardListener *rdl;
>> +    int ret = 0;
>> +
>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> +        MemoryRegionSection tmp = *rdl->section;
>> +
>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +            continue;
>> +        }
>> +        ret = rdl->notify_populate(rdl, &tmp);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute
>> *attr,
>> +                                                   uint64_t offset,
>> +                                                   uint64_t size)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    /* We fake a shorter bitmap to avoid searching too far. */
> 
> What is "fake" about it? We truthfully check here that every bit in
> [first_bit, last_bit] is set.

Aha, you ask this question again :)
(https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112ef05@intel.com/)

If it is really confusing, let me remove this comment in next version.

> 
>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>> +                                   first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +static bool
>> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
>> +                                     uint64_t offset, uint64_t size)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>> +    unsigned long found_bit;
>> +
>> +    /* We fake a shorter bitmap to avoid searching too far. */
>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>> +    return found_bit > last_bit;
>> +}
>> +
>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>> uint64_t offset,
>> +                                     uint64_t size, bool to_private)
>> +{
>> +    const int block_size = ram_block_attribute_get_block_size(attr);
> 
> size_t.
> 
>> +    const unsigned long first_bit = offset / block_size;
>> +    const unsigned long nbits = size / block_size;
>> +    int ret = 0;
>> +
>> +    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>> +                     __func__, offset, size);
>> +        return -1;
> 
> May be -EINVAL?

Modified.

> 
>> +    }
>> +
>> +    /* Already discard/populated */
>> +    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>> +         to_private) ||
>> +        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>> +         !to_private)) {
> 
> A tracepoint would be useful here imho.

[...]

> 
>> +        return 0;
>> +    }
>> +
>> +    /* Unexpected mixture */
>> +    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>> +         to_private) ||
>> +        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>> +         !to_private)) {
>> +        error_report("%s, the range is not all in the desired state: "
>> +                     "(offset 0x%lx, size 0x%lx), %s",
>> +                     __func__, offset, size,
>> +                     to_private ? "private" : "shared");
>> +        return -1;
> 
> -EBUSY?

Maybe also -EINVAL since it is due to the invalid provided mixture
range? But Anyway, according to the discussion in patch #10, I'll add
the support for this mixture scenario. No need to return the error.

> 
>> +    }
>> +
>> +    if (to_private) {
>> +        bitmap_clear(attr->bitmap, first_bit, nbits);
>> +        ram_block_attribute_notify_to_discard(attr, offset, size);
>> +    } else {
>> +        bitmap_set(attr->bitmap, first_bit, nbits);
>> +        ret = ram_block_attribute_notify_to_populated(attr, offset,
>> size);
>> +    }
> 
> and a successful tracepoint here may be?

Good suggestion! I'll add tracepoint in next version.

> 
>> +
>> +    return ret;
>> +}
>> +
>>   RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>>   {
>>       uint64_t bitmap_size;
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
  2025-05-26  9:17   ` David Hildenbrand
@ 2025-05-27  9:11   ` Alexey Kardashevskiy
  2025-05-27 10:18     ` Chenyi Qiang
  1 sibling, 1 reply; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  9:11 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 20/5/25 20:28, Chenyi Qiang wrote:
> The current error handling is simple with the following assumption:
> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>    fails, thus no need to do rollback.
> - The convert range is required to be in the desired state. It is not
>    allowed to handle the mixture case.
> - The conversion from shared to private is a non-failure operation.
> 
> This is sufficient for now as complext error handling is not required.
> For future extension, add some potential error handling.
> - For private to shared conversion, do the rollback operation if
>    ram_block_attribute_notify_to_populated() fails.
> - For shared to private conversion, still assert it as a non-failure
>    operation for now. It could be an easy fail path with in-place
>    conversion, which will likely have to retry the conversion until it
>    works in the future.
> - For mixture case, process individual blocks for ease of rollback.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
>   1 file changed, 90 insertions(+), 26 deletions(-)
> 
> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
> index 387501b569..0af3396aa4 100644
> --- a/system/ram-block-attribute.c
> +++ b/system/ram-block-attribute.c
> @@ -289,7 +289,12 @@ static int ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>           }
>           ret = rdl->notify_discard(rdl, &tmp);
>           if (ret) {
> -            break;
> +            /*
> +             * The current to_private listeners (VFIO dma_unmap and
> +             * KVM set_attribute_private) are non-failing operations.
> +             * TODO: add rollback operations if it is allowed to fail.
> +             */
> +            g_assert(ret);
>           }
>       }
>   
> @@ -300,7 +305,7 @@ static int
>   ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>                                           uint64_t offset, uint64_t size)
>   {
> -    RamDiscardListener *rdl;
> +    RamDiscardListener *rdl, *rdl2;
>       int ret = 0;
>   
>       QLIST_FOREACH(rdl, &attr->rdl_list, next) {
> @@ -315,6 +320,20 @@ ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>           }
>       }
>   
> +    if (ret) {
> +        /* Notify all already-notified listeners. */
> +        QLIST_FOREACH(rdl2, &attr->rdl_list, next) {
> +            MemoryRegionSection tmp = *rdl2->section;
> +
> +            if (rdl == rdl2) {
> +                break;
> +            }
> +            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> +                continue;
> +            }
> +            rdl2->notify_discard(rdl2, &tmp);
> +        }
> +    }
>       return ret;
>   }
>   
> @@ -353,6 +372,9 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
>       const int block_size = ram_block_attribute_get_block_size(attr);
>       const unsigned long first_bit = offset / block_size;
>       const unsigned long nbits = size / block_size;
> +    const uint64_t end = offset + size;
> +    unsigned long bit;
> +    uint64_t cur;
>       int ret = 0;
>   
>       if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
> @@ -361,32 +383,74 @@ int ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t offset,
>           return -1;
>       }
>   
> -    /* Already discard/populated */
> -    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
> -         to_private) ||
> -        (ram_block_attribute_is_range_populated(attr, offset, size) &&
> -         !to_private)) {
> -        return 0;
> -    }
> -
> -    /* Unexpected mixture */
> -    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
> -         to_private) ||
> -        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
> -         !to_private)) {
> -        error_report("%s, the range is not all in the desired state: "
> -                     "(offset 0x%lx, size 0x%lx), %s",
> -                     __func__, offset, size,
> -                     to_private ? "private" : "shared");
> -        return -1;
> -    }

David is right, this needs to be squashed where you added the above hunk.

> -
>       if (to_private) {
> -        bitmap_clear(attr->bitmap, first_bit, nbits);
> -        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
> +        if (ram_block_attribute_is_range_discard(attr, offset, size)) {
> +            /* Already private */
> +        } else if (!ram_block_attribute_is_range_populated(attr, offset,
> +                                                           size)) {
> +            /* Unexpected mixture: process individual blocks */


Is an "expected mix" situation possible?
May be just always run the code for "unexpected mix", or refuse mixing and let the VM deal with it?


> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (!test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                ram_block_attribute_notify_to_discard(attr, cur, block_size);
> +            }
> +        } else {
> +            /* Completely shared */
> +            bitmap_clear(attr->bitmap, first_bit, nbits);
> +            ram_block_attribute_notify_to_discard(attr, offset, size);
> +        }
>       } else {
> -        bitmap_set(attr->bitmap, first_bit, nbits);
> -        ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> +        if (ram_block_attribute_is_range_populated(attr, offset, size)) {
> +            /* Already shared */
> +        } else if (!ram_block_attribute_is_range_discard(attr, offset, size)) {
> +            /* Unexpected mixture: process individual blocks */
> +            unsigned long *modified_bitmap = bitmap_new(nbits);
> +
> +            for (cur = offset; cur < end; cur += block_size) {
> +                bit = cur / block_size;
> +                if (test_bit(bit, attr->bitmap)) {
> +                    continue;
> +                }
> +                set_bit(bit, attr->bitmap);
> +                ret = ram_block_attribute_notify_to_populated(attr, cur,
> +                                                           block_size);
> +                if (!ret) {
> +                    set_bit(bit - first_bit, modified_bitmap);
> +                    continue;
> +                }
> +                clear_bit(bit, attr->bitmap);
> +                break;
> +            }
> +
> +            if (ret) {
> +                /*
> +                 * Very unexpected: something went wrong. Revert to the old
> +                 * state, marking only the blocks as private that we converted
> +                 * to shared.


If something went wrong... well, on my AMD machine this usually means the fw is really unhappy and recovery is hardly possible and the machine needs reboot. Probably stopping the VM would make more sense for now (or stop the device so the user could save work from the VM, dunno).


> +                 */
> +                for (cur = offset; cur < end; cur += block_size) {
> +                    bit = cur / block_size;
> +                    if (!test_bit(bit - first_bit, modified_bitmap)) {
> +                        continue;
> +                    }
> +                    assert(test_bit(bit, attr->bitmap));
> +                    clear_bit(bit, attr->bitmap);
> +                    ram_block_attribute_notify_to_discard(attr, cur,
> +                                                          block_size);
> +                }
> +            }
> +            g_free(modified_bitmap);
> +        } else {
> +            /* Complete private */

I'd swap this hunk with the previous one. Thanks,

> +            bitmap_set(attr->bitmap, first_bit, nbits);
> +            ret = ram_block_attribute_notify_to_populated(attr, offset, size);
> +            if (ret) {
> +                bitmap_clear(attr->bitmap, first_bit, nbits);
> +            }
> +        }
>       }
>   
>       return ret;

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes
  2025-05-27  9:06     ` Chenyi Qiang
@ 2025-05-27  9:19       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 51+ messages in thread
From: Alexey Kardashevskiy @ 2025-05-27  9:19 UTC (permalink / raw)
  To: Chenyi Qiang, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 27/5/25 19:06, Chenyi Qiang wrote:
> 
> 
> On 5/27/2025 3:35 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 20/5/25 20:28, Chenyi Qiang wrote:
>>> A new state_change() helper is introduced for RamBlockAttribute
>>> to efficiently notify all registered RamDiscardListeners, including
>>> VFIO listeners, about memory conversion events in guest_memfd. The VFIO
>>> listener can dynamically DMA map/unmap shared pages based on conversion
>>> types:
>>> - For conversions from shared to private, the VFIO system ensures the
>>>     discarding of shared mapping from the IOMMU.
>>> - For conversions from private to shared, it triggers the population of
>>>     the shared mapping into the IOMMU.
>>>
>>> Currently, memory conversion failures cause QEMU to quit instead of
>>> resuming the guest or retrying the operation. It would be a future work
>>> to add more error handling or rollback mechanisms once conversion
>>> failures are allowed. For example, in-place conversion of guest_memfd
>>> could retry the unmap operation during the conversion from shared to
>>> private. However, for now, keep the complex error handling out of the
>>> picture as it is not required:
>>>
>>> - If a conversion request is made for a page already in the desired
>>>     state, the helper simply returns success.
>>> - For requests involving a range partially in the desired state, there
>>>     is no such scenario in practice at present. Simply return error.
>>> - If a conversion request is declined by other systems, such as a
>>>     failure from VFIO during notify_to_populated(), the failure is
>>>     returned directly. As for notify_to_discard(), VFIO cannot fail
>>>     unmap/unpin, so no error is returned.
>>>
>>> Note that the bitmap status is updated before callbacks, allowing
>>> listeners to handle memory based on the latest status.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Change in v5:
>>>       - Move the state_change() back to a helper instead of a callback of
>>>         the class since there's no child for the RamBlockAttributeClass.
>>>       - Remove the error handling and move them to an individual patch for
>>>         simple management.
>>>
>>> Changes in v4:
>>>       - Add the state_change() callback in PrivateSharedManagerClass
>>>         instead of the RamBlockAttribute.
>>>
>>> Changes in v3:
>>>       - Move the bitmap update before notifier callbacks.
>>>       - Call the notifier callbacks directly in notify_discard/populate()
>>>         with the expectation that the request memory range is in the
>>>         desired attribute.
>>>       - For the case that only partial range in the desire status, handle
>>>         the range with block_size granularity for ease of rollback
>>>         (https://lore.kernel.org/qemu-devel/812768d7-a02d-4b29-95f3-
>>> fb7a125cf54e@redhat.com/)
>>>
>>> Changes in v2:
>>>       - Do the alignment changes due to the rename to
>>> MemoryAttributeManager
>>>       - Move the state_change() helper definition in this patch.
>>> ---
>>>    include/system/ramblock.h    |   2 +
>>>    system/ram-block-attribute.c | 134 +++++++++++++++++++++++++++++++++++
>>>    2 files changed, 136 insertions(+)
>>>
>>> diff --git a/include/system/ramblock.h b/include/system/ramblock.h
>>> index 09255e8495..270dffb2f3 100644
>>> --- a/include/system/ramblock.h
>>> +++ b/include/system/ramblock.h
>>> @@ -108,6 +108,8 @@ struct RamBlockAttribute {
>>>        QLIST_HEAD(, RamDiscardListener) rdl_list;
>>>    };
>>>    +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>>> uint64_t offset,
>>> +                                     uint64_t size, bool to_private);
>>
>> Not sure about the "to_private" name. I'd think private/shared is
>> something KVM operates with and here, in RamBlock, it is discarded/
>> populated.
> 
> Make sense. To keep consistent, I will rename it as to_discard.
> 
>>
>>>    RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr);
>>>    void ram_block_attribute_destroy(RamBlockAttribute *attr);
>>>    diff --git a/system/ram-block-attribute.c b/system/ram-block-
>>> attribute.c
>>> index 8d4a24738c..f12dd4b881 100644
>>> --- a/system/ram-block-attribute.c
>>> +++ b/system/ram-block-attribute.c
>>> @@ -253,6 +253,140 @@ ram_block_attribute_rdm_replay_discard(const
>>> RamDiscardManager *rdm,
>>>                                               
>>> ram_block_attribute_rdm_replay_cb);
>>>    }
>>>    +static bool ram_block_attribute_is_valid_range(RamBlockAttribute
>>> *attr,
>>> +                                               uint64_t offset,
>>> uint64_t size)
>>> +{
>>> +    MemoryRegion *mr = attr->mr;
>>> +
>>> +    g_assert(mr);
>>> +
>>> +    uint64_t region_size = memory_region_size(mr);
>>> +    int block_size = ram_block_attribute_get_block_size(attr);
>>
>> It is size_t, not int.
> 
> Fixed this and all below. Thanks!
> 
>>
>>> +
>>> +    if (!QEMU_IS_ALIGNED(offset, block_size)) {
>>
>> Does not the @size have to be aligned too?
> 
> Yes. Actually, the "start" and "size" are already do the alignment check
> in kvm_convert_memory(). I doubt if we still need it here.

Sure. My point is either check them both or neither.

> Anyway, in
> case of other users in the future, I'll add it.

Ok.

>>
>>> +        return false;
>>> +    }
>>> +    if (offset + size < offset || !size) {
>>
>> This could be just (offset + size <= offset).
>> (these overflow checks always blow up my little brain)
> 
> Modified.
> 
>>
>>> +        return false;
>>> +    }
>>> +    if (offset >= region_size || offset + size > region_size) {
>>
>> Just (offset + size > region_size) should do.
> 
> Ditto.
> 
>>
>>> +        return false;
>>> +    }
>>> +    return true;
>>> +}
>>> +
>>> +static void ram_block_attribute_notify_to_discard(RamBlockAttribute
>>> *attr,
>>> +                                                  uint64_t offset,
>>> +                                                  uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        rdl->notify_discard(rdl, &tmp);
>>> +    }
>>> +}
>>> +
>>> +static int
>>> +ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>> +                                        uint64_t offset, uint64_t size)
>>> +{
>>> +    RamDiscardListener *rdl;
>>> +    int ret = 0;
>>> +
>>> +    QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>>> +        MemoryRegionSection tmp = *rdl->section;
>>> +
>>> +        if (!memory_region_section_intersect_range(&tmp, offset,
>>> size)) {
>>> +            continue;
>>> +        }
>>> +        ret = rdl->notify_populate(rdl, &tmp);
>>> +        if (ret) {
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool ram_block_attribute_is_range_populated(RamBlockAttribute
>>> *attr,
>>> +                                                   uint64_t offset,
>>> +                                                   uint64_t size)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    /* We fake a shorter bitmap to avoid searching too far. */
>>
>> What is "fake" about it? We truthfully check here that every bit in
>> [first_bit, last_bit] is set.
> 
> Aha, you ask this question again :)
> (https://lore.kernel.org/qemu-devel/7131b4a3-a836-4efd-bcfc-982a0112ef05@intel.com/)

ah sorry :)

> If it is really confusing, let me remove this comment in next version.

yes please. Quite obvious if the helper takes the size, then this is what the caller wants to search within.

> 
>>
>>> +    found_bit = find_next_zero_bit(attr->bitmap, last_bit + 1,
>>> +                                   first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +static bool
>>> +ram_block_attribute_is_range_discard(RamBlockAttribute *attr,
>>> +                                     uint64_t offset, uint64_t size)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long last_bit = first_bit + (size / block_size) - 1;
>>> +    unsigned long found_bit;
>>> +
>>> +    /* We fake a shorter bitmap to avoid searching too far. */
>>> +    found_bit = find_next_bit(attr->bitmap, last_bit + 1, first_bit);
>>> +    return found_bit > last_bit;
>>> +}
>>> +
>>> +int ram_block_attribute_state_change(RamBlockAttribute *attr,
>>> uint64_t offset,
>>> +                                     uint64_t size, bool to_private)
>>> +{
>>> +    const int block_size = ram_block_attribute_get_block_size(attr);
>>
>> size_t.
>>
>>> +    const unsigned long first_bit = offset / block_size;
>>> +    const unsigned long nbits = size / block_size;
>>> +    int ret = 0;
>>> +
>>> +    if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>>> +        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>> +                     __func__, offset, size);
>>> +        return -1;
>>
>> May be -EINVAL?
> 
> Modified.
> 
>>
>>> +    }
>>> +
>>> +    /* Already discard/populated */
>>> +    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>>> +         to_private) ||
>>> +        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>>> +         !to_private)) {
>>
>> A tracepoint would be useful here imho.
> 
> [...]
> 
>>
>>> +        return 0;
>>> +    }
>>> +
>>> +    /* Unexpected mixture */
>>> +    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>>> +         to_private) ||
>>> +        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>>> +         !to_private)) {
>>> +        error_report("%s, the range is not all in the desired state: "
>>> +                     "(offset 0x%lx, size 0x%lx), %s",
>>> +                     __func__, offset, size,
>>> +                     to_private ? "private" : "shared");
>>> +        return -1;
>>
>> -EBUSY?
> 
> Maybe also -EINVAL since it is due to the invalid provided mixture
> range?

May be, I just prefer them different - might save some time on gdb-ing or adding printf's. Thanks,

> But Anyway, according to the discussion in patch #10, I'll add
> the support for this mixture scenario. No need to return the error.
Yeah, chunk from 10/10 should be here really.

>>
>>> +    }
>>> +
>>> +    if (to_private) {
>>> +        bitmap_clear(attr->bitmap, first_bit, nbits);
>>> +        ram_block_attribute_notify_to_discard(attr, offset, size);
>>> +    } else {
>>> +        bitmap_set(attr->bitmap, first_bit, nbits);
>>> +        ret = ram_block_attribute_notify_to_populated(attr, offset,
>>> size);
>>> +    }
>>
>> and a successful tracepoint here may be?
> 
> Good suggestion! I'll add tracepoint in next version.
> 
>>
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    RamBlockAttribute *ram_block_attribute_create(MemoryRegion *mr)
>>>    {
>>>        uint64_t bitmap_size;
>>
> 

-- 
Alexey


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-27  9:11   ` Alexey Kardashevskiy
@ 2025-05-27 10:18     ` Chenyi Qiang
  2025-05-27 11:21       ` David Hildenbrand
  0 siblings, 1 reply; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-27 10:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy, David Hildenbrand, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/27/2025 5:11 PM, Alexey Kardashevskiy wrote:
> 
> 
> On 20/5/25 20:28, Chenyi Qiang wrote:
>> The current error handling is simple with the following assumption:
>> - QEMU will quit instead of resuming the guest if kvm_convert_memory()
>>    fails, thus no need to do rollback.
>> - The convert range is required to be in the desired state. It is not
>>    allowed to handle the mixture case.
>> - The conversion from shared to private is a non-failure operation.
>>
>> This is sufficient for now as complext error handling is not required.
>> For future extension, add some potential error handling.
>> - For private to shared conversion, do the rollback operation if
>>    ram_block_attribute_notify_to_populated() fails.
>> - For shared to private conversion, still assert it as a non-failure
>>    operation for now. It could be an easy fail path with in-place
>>    conversion, which will likely have to retry the conversion until it
>>    works in the future.
>> - For mixture case, process individual blocks for ease of rollback.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   system/ram-block-attribute.c | 116 +++++++++++++++++++++++++++--------
>>   1 file changed, 90 insertions(+), 26 deletions(-)
>>
>> diff --git a/system/ram-block-attribute.c b/system/ram-block-attribute.c
>> index 387501b569..0af3396aa4 100644
>> --- a/system/ram-block-attribute.c
>> +++ b/system/ram-block-attribute.c
>> @@ -289,7 +289,12 @@ static int
>> ram_block_attribute_notify_to_discard(RamBlockAttribute *attr,
>>           }
>>           ret = rdl->notify_discard(rdl, &tmp);
>>           if (ret) {
>> -            break;
>> +            /*
>> +             * The current to_private listeners (VFIO dma_unmap and
>> +             * KVM set_attribute_private) are non-failing operations.
>> +             * TODO: add rollback operations if it is allowed to fail.
>> +             */
>> +            g_assert(ret);
>>           }
>>       }
>>   @@ -300,7 +305,7 @@ static int
>>   ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>                                           uint64_t offset, uint64_t size)
>>   {
>> -    RamDiscardListener *rdl;
>> +    RamDiscardListener *rdl, *rdl2;
>>       int ret = 0;
>>         QLIST_FOREACH(rdl, &attr->rdl_list, next) {
>> @@ -315,6 +320,20 @@
>> ram_block_attribute_notify_to_populated(RamBlockAttribute *attr,
>>           }
>>       }
>>   +    if (ret) {
>> +        /* Notify all already-notified listeners. */
>> +        QLIST_FOREACH(rdl2, &attr->rdl_list, next) {
>> +            MemoryRegionSection tmp = *rdl2->section;
>> +
>> +            if (rdl == rdl2) {
>> +                break;
>> +            }
>> +            if (!memory_region_section_intersect_range(&tmp, offset,
>> size)) {
>> +                continue;
>> +            }
>> +            rdl2->notify_discard(rdl2, &tmp);
>> +        }
>> +    }
>>       return ret;
>>   }
>>   @@ -353,6 +372,9 @@ int
>> ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t
>> offset,
>>       const int block_size = ram_block_attribute_get_block_size(attr);
>>       const unsigned long first_bit = offset / block_size;
>>       const unsigned long nbits = size / block_size;
>> +    const uint64_t end = offset + size;
>> +    unsigned long bit;
>> +    uint64_t cur;
>>       int ret = 0;
>>         if (!ram_block_attribute_is_valid_range(attr, offset, size)) {
>> @@ -361,32 +383,74 @@ int
>> ram_block_attribute_state_change(RamBlockAttribute *attr, uint64_t
>> offset,
>>           return -1;
>>       }
>>   -    /* Already discard/populated */
>> -    if ((ram_block_attribute_is_range_discard(attr, offset, size) &&
>> -         to_private) ||
>> -        (ram_block_attribute_is_range_populated(attr, offset, size) &&
>> -         !to_private)) {
>> -        return 0;
>> -    }
>> -
>> -    /* Unexpected mixture */
>> -    if ((!ram_block_attribute_is_range_populated(attr, offset, size) &&
>> -         to_private) ||
>> -        (!ram_block_attribute_is_range_discard(attr, offset, size) &&
>> -         !to_private)) {
>> -        error_report("%s, the range is not all in the desired state: "
>> -                     "(offset 0x%lx, size 0x%lx), %s",
>> -                     __func__, offset, size,
>> -                     to_private ? "private" : "shared");
>> -        return -1;
>> -    }
> 
> David is right, this needs to be squashed where you added the above hunk.
> 
>> -
>>       if (to_private) {
>> -        bitmap_clear(attr->bitmap, first_bit, nbits);
>> -        ret = ram_block_attribute_notify_to_discard(attr, offset, size);
>> +        if (ram_block_attribute_is_range_discard(attr, offset, size)) {
>> +            /* Already private */
>> +        } else if (!ram_block_attribute_is_range_populated(attr, offset,
>> +                                                           size)) {
>> +            /* Unexpected mixture: process individual blocks */
> 
> 
> Is an "expected mix" situation possible?

I didn't see such real case. And TDX GHCI spec also doesn't mention how
to handle such situation.

> May be just always run the code for "unexpected mix", or refuse mixing
> and let the VM deal with it?

[...]

> 
> 
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (!test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                ram_block_attribute_notify_to_discard(attr, cur,
>> block_size);
>> +            }
>> +        } else {
>> +            /* Completely shared */
>> +            bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            ram_block_attribute_notify_to_discard(attr, offset, size);
>> +        }
>>       } else {
>> -        bitmap_set(attr->bitmap, first_bit, nbits);
>> -        ret = ram_block_attribute_notify_to_populated(attr, offset,
>> size);
>> +        if (ram_block_attribute_is_range_populated(attr, offset,
>> size)) {
>> +            /* Already shared */
>> +        } else if (!ram_block_attribute_is_range_discard(attr,
>> offset, size)) {
>> +            /* Unexpected mixture: process individual blocks */
>> +            unsigned long *modified_bitmap = bitmap_new(nbits);
>> +
>> +            for (cur = offset; cur < end; cur += block_size) {
>> +                bit = cur / block_size;
>> +                if (test_bit(bit, attr->bitmap)) {
>> +                    continue;
>> +                }
>> +                set_bit(bit, attr->bitmap);
>> +                ret = ram_block_attribute_notify_to_populated(attr, cur,
>> +                                                           block_size);
>> +                if (!ret) {
>> +                    set_bit(bit - first_bit, modified_bitmap);
>> +                    continue;
>> +                }
>> +                clear_bit(bit, attr->bitmap);
>> +                break;
>> +            }
>> +
>> +            if (ret) {
>> +                /*
>> +                 * Very unexpected: something went wrong. Revert to
>> the old
>> +                 * state, marking only the blocks as private that we
>> converted
>> +                 * to shared.
> 
> 
> If something went wrong... well, on my AMD machine this usually means
> the fw is really unhappy and recovery is hardly possible and the machine
> needs reboot. Probably stopping the VM would make more sense for now (or
> stop the device so the user could save work from the VM, dunno).

My current plan (in next version) is to squash the mixture handling in
previous patch to always run the code for "unexpected mix", and return
error without rollback if it fails in kvm_convert_memory(), which will
cause QEMU to quit. I think it can meet what you want.

As for the rollback handling, maybe keep it as an attached patch for
future reference or just remove it.

> 
> 
>> +                 */
>> +                for (cur = offset; cur < end; cur += block_size) {
>> +                    bit = cur / block_size;
>> +                    if (!test_bit(bit - first_bit, modified_bitmap)) {
>> +                        continue;
>> +                    }
>> +                    assert(test_bit(bit, attr->bitmap));
>> +                    clear_bit(bit, attr->bitmap);
>> +                    ram_block_attribute_notify_to_discard(attr, cur,
>> +                                                          block_size);
>> +                }
>> +            }
>> +            g_free(modified_bitmap);
>> +        } else {
>> +            /* Complete private */
> 
> I'd swap this hunk with the previous one. Thanks,

Fine to change. Thanks.

> 
>> +            bitmap_set(attr->bitmap, first_bit, nbits);
>> +            ret = ram_block_attribute_notify_to_populated(attr,
>> offset, size);
>> +            if (ret) {
>> +                bitmap_clear(attr->bitmap, first_bit, nbits);
>> +            }
>> +        }
>>       }
>>         return ret;
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-27  5:47     ` Chenyi Qiang
  2025-05-27  7:42       ` Alexey Kardashevskiy
@ 2025-05-27 11:20       ` David Hildenbrand
  2025-05-28  1:57         ` Chenyi Qiang
  1 sibling, 1 reply; 51+ messages in thread
From: David Hildenbrand @ 2025-05-27 11:20 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao

On 27.05.25 07:47, Chenyi Qiang wrote:
> 
> 
> On 5/26/2025 5:08 PM, David Hildenbrand wrote:
>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>> As guest_memfd is now managed by RamBlockAttribute with
>>> RamDiscardManager, only block uncoordinated discard.
>>>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> ---
>>> Changes in v5:
>>>       - Revert to use RamDiscardManager.
>>>
>>> Changes in v4:
>>>       - Modify commit message (RamDiscardManager->PrivateSharedManager).
>>>
>>> Changes in v3:
>>>       - No change.
>>>
>>> Changes in v2:
>>>       - Change the ram_block_discard_require(false) to
>>>         ram_block_coordinated_discard_require(false).
>>> ---
>>>    system/physmem.c | 6 +++---
>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index f05f7ff09a..58b7614660 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>            }
>>>            assert(new_block->guest_memfd < 0);
>>>    -        ret = ram_block_discard_require(true);
>>> +        ret = ram_block_coordinated_discard_require(true);
>>>            if (ret < 0) {
>>>                error_setg_errno(errp, -ret,
>>>                                 "cannot set up private guest memory:
>>> discard currently blocked");
>>> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>                 * ever develops a need to check for errors.
>>>                 */
>>>                close(new_block->guest_memfd);
>>> -            ram_block_discard_require(false);
>>> +            ram_block_coordinated_discard_require(false);
>>>                qemu_mutex_unlock_ramlist();
>>>                goto out_free;
>>>            }
>>> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>>>        if (block->guest_memfd >= 0) {
>>>            ram_block_attribute_destroy(block->ram_shared);
>>>            close(block->guest_memfd);
>>> -        ram_block_discard_require(false);
>>> +        ram_block_coordinated_discard_require(false);
>>>        }
>>>          g_free(block);
>>
>>
>> I think this patch should be squashed into the previous one, then the
>> story in that single patch is consistent.
> 
> I think this patch is a gate to allow device assignment with guest_memfd
> and want to make it separately. Can we instead add some commit message
> in previous one? like:
> 
> "Using guest_memfd with vfio is still blocked via
> ram_block_discard_disable()/ram_block_discard_require()."

For the title it should probably be something like:

"physmem: support coordinated discarding of RAM with guest_memdfd"

Then explain how we install the RAMDiscardManager that will notify 
listeners (esp. vfio).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes
  2025-05-27 10:18     ` Chenyi Qiang
@ 2025-05-27 11:21       ` David Hildenbrand
  0 siblings, 0 replies; 51+ messages in thread
From: David Hildenbrand @ 2025-05-27 11:21 UTC (permalink / raw)
  To: Chenyi Qiang, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao


>>
>> If something went wrong... well, on my AMD machine this usually means
>> the fw is really unhappy and recovery is hardly possible and the machine
>> needs reboot. Probably stopping the VM would make more sense for now (or
>> stop the device so the user could save work from the VM, dunno).
> 
> My current plan (in next version) is to squash the mixture handling in
> previous patch to always run the code for "unexpected mix", and return
> error without rollback if it fails in kvm_convert_memory(), which will
> cause QEMU to quit. I think it can meet what you want.
> 
> As for the rollback handling, maybe keep it as an attached patch for
> future reference or just remove it.

probably best to remove it for now. The patch is in the mailing list 
archives for future reference :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard
  2025-05-27 11:20       ` David Hildenbrand
@ 2025-05-28  1:57         ` Chenyi Qiang
  0 siblings, 0 replies; 51+ messages in thread
From: Chenyi Qiang @ 2025-05-28  1:57 UTC (permalink / raw)
  To: David Hildenbrand, Alexey Kardashevskiy, Peter Xu, Gupta Pankaj,
	Paolo Bonzini, Philippe Mathieu-Daudé, Michael Roth
  Cc: qemu-devel, kvm, Williams Dan J, Zhao Liu, Baolu Lu, Gao Chao,
	Xu Yilun, Li Xiaoyao



On 5/27/2025 7:20 PM, David Hildenbrand wrote:
> On 27.05.25 07:47, Chenyi Qiang wrote:
>>
>>
>> On 5/26/2025 5:08 PM, David Hildenbrand wrote:
>>> On 20.05.25 12:28, Chenyi Qiang wrote:
>>>> As guest_memfd is now managed by RamBlockAttribute with
>>>> RamDiscardManager, only block uncoordinated discard.
>>>>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> ---
>>>> Changes in v5:
>>>>       - Revert to use RamDiscardManager.
>>>>
>>>> Changes in v4:
>>>>       - Modify commit message (RamDiscardManager-
>>>> >PrivateSharedManager).
>>>>
>>>> Changes in v3:
>>>>       - No change.
>>>>
>>>> Changes in v2:
>>>>       - Change the ram_block_discard_require(false) to
>>>>         ram_block_coordinated_discard_require(false).
>>>> ---
>>>>    system/physmem.c | 6 +++---
>>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index f05f7ff09a..58b7614660 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -1916,7 +1916,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>            }
>>>>            assert(new_block->guest_memfd < 0);
>>>>    -        ret = ram_block_discard_require(true);
>>>> +        ret = ram_block_coordinated_discard_require(true);
>>>>            if (ret < 0) {
>>>>                error_setg_errno(errp, -ret,
>>>>                                 "cannot set up private guest memory:
>>>> discard currently blocked");
>>>> @@ -1939,7 +1939,7 @@ static void ram_block_add(RAMBlock *new_block,
>>>> Error **errp)
>>>>                 * ever develops a need to check for errors.
>>>>                 */
>>>>                close(new_block->guest_memfd);
>>>> -            ram_block_discard_require(false);
>>>> +            ram_block_coordinated_discard_require(false);
>>>>                qemu_mutex_unlock_ramlist();
>>>>                goto out_free;
>>>>            }
>>>> @@ -2302,7 +2302,7 @@ static void reclaim_ramblock(RAMBlock *block)
>>>>        if (block->guest_memfd >= 0) {
>>>>            ram_block_attribute_destroy(block->ram_shared);
>>>>            close(block->guest_memfd);
>>>> -        ram_block_discard_require(false);
>>>> +        ram_block_coordinated_discard_require(false);
>>>>        }
>>>>          g_free(block);
>>>
>>>
>>> I think this patch should be squashed into the previous one, then the
>>> story in that single patch is consistent.
>>
>> I think this patch is a gate to allow device assignment with guest_memfd
>> and want to make it separately. Can we instead add some commit message
>> in previous one? like:
>>
>> "Using guest_memfd with vfio is still blocked via
>> ram_block_discard_disable()/ram_block_discard_require()."
> 
> For the title it should probably be something like:
> 
> "physmem: support coordinated discarding of RAM with guest_memdfd"
> 
> Then explain how we install the RAMDiscardManager that will notify
> listeners (esp. vfio).

Make sense. Will do the squash and change the title. Thanks!

> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-05-28  1:58 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 10:28 [PATCH v5 00/10] Enable shared device assignment Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 01/10] memory: Export a helper to get intersection of a MemoryRegionSection with a given range Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 02/10] memory: Change memory_region_set_ram_discard_manager() to return the result Chenyi Qiang
2025-05-26  8:40   ` David Hildenbrand
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 03/10] memory: Unify the definiton of ReplayRamPopulate() and ReplayRamDiscard() Chenyi Qiang
2025-05-26  8:42   ` David Hildenbrand
2025-05-26  9:35   ` Philippe Mathieu-Daudé
2025-05-26 10:21     ` Chenyi Qiang
2025-05-27  6:56   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 04/10] ram-block-attribute: Introduce RamBlockAttribute to manage RAMBlock with guest_memfd Chenyi Qiang
2025-05-26  9:01   ` David Hildenbrand
2025-05-26  9:28     ` Chenyi Qiang
2025-05-26 11:16       ` Alexey Kardashevskiy
2025-05-27  1:15         ` Chenyi Qiang
2025-05-27  1:20           ` Alexey Kardashevskiy
2025-05-27  3:14             ` Chenyi Qiang
2025-05-27  6:06               ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 05/10] ram-block-attribute: Introduce a helper to notify shared/private state changes Chenyi Qiang
2025-05-26  9:02   ` David Hildenbrand
2025-05-27  7:35   ` Alexey Kardashevskiy
2025-05-27  9:06     ` Chenyi Qiang
2025-05-27  9:19       ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 06/10] memory: Attach RamBlockAttribute to guest_memfd-backed RAMBlocks Chenyi Qiang
2025-05-26  9:06   ` David Hildenbrand
2025-05-26  9:46     ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 07/10] RAMBlock: Make guest_memfd require coordinate discard Chenyi Qiang
2025-05-26  9:08   ` David Hildenbrand
2025-05-27  5:47     ` Chenyi Qiang
2025-05-27  7:42       ` Alexey Kardashevskiy
2025-05-27  8:12         ` Chenyi Qiang
2025-05-27 11:20       ` David Hildenbrand
2025-05-28  1:57         ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 08/10] memory: Change NotifyRamDiscard() definition to return the result Chenyi Qiang
2025-05-26  9:31   ` Philippe Mathieu-Daudé
2025-05-26 10:36   ` Cédric Le Goater
2025-05-26 12:44     ` Cédric Le Goater
2025-05-27  5:29       ` Chenyi Qiang
2025-05-20 10:28 ` [PATCH v5 09/10] KVM: Introduce RamDiscardListener for attribute changes during memory conversions Chenyi Qiang
2025-05-26  9:22   ` David Hildenbrand
2025-05-27  8:01   ` Alexey Kardashevskiy
2025-05-20 10:28 ` [PATCH v5 10/10] ram-block-attribute: Add more error handling during state changes Chenyi Qiang
2025-05-26  9:17   ` David Hildenbrand
2025-05-26 10:19     ` Chenyi Qiang
2025-05-26 12:10       ` David Hildenbrand
2025-05-26 12:39         ` Chenyi Qiang
2025-05-27  9:11   ` Alexey Kardashevskiy
2025-05-27 10:18     ` Chenyi Qiang
2025-05-27 11:21       ` David Hildenbrand
2025-05-26 11:37 ` [PATCH v5 00/10] Enable shared device assignment Cédric Le Goater
2025-05-26 12:16   ` Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).