[PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration
@ 2023-07-06  7:56 David Hildenbrand
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michael S. Tsirkin, Juan Quintela, Peter Xu,
	Leonardo Bras, Paolo Bonzini, Philippe Mathieu-Daudé,
	Peng Tao, Mario Casquero

If there is no further feedback, I'll queue this myself shortly.

Stumbling over "x-ignore-shared" migration support for virtio-mem on
my todo list, I remember talking to Dave G. a while ago about how
ram_block_discard_range() in MAP_PIRVATE file mappings is possibly
harmful when the file is used somewhere else -- for example, with VM
templating in multiple VMs.

This series adds a warning to ram_block_discard_range() in that problematic
case and adds "x-ignore-shared" migration support for virtio-mem, which
is pretty straight-forward. The last patch also documents how VM templating
interacts with virtio-mem.

v1 -> v2:
- Pick up tags
- "virtio-mem: Support "x-ignore-shared" migration"
 -> Fix spelling mistake

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Philippe Mathieu-Daudé" <philmd@linaro.org>
Cc: Peng Tao <tao.peng@linux.alibaba.com>
Cc: Mario Casquero <mcasquer@redhat.com>

David Hildenbrand (4):
  softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE
    file mapping
  virtio-mem: Skip most of virtio_mem_unplug_all() without plugged
    memory
  migration/ram: Expose ramblock_is_ignored() as
    migrate_ram_is_ignored()
  virtio-mem: Support "x-ignore-shared" migration

 hw/virtio/virtio-mem.c   | 67 ++++++++++++++++++++++++++++------------
 include/migration/misc.h |  1 +
 migration/postcopy-ram.c |  2 +-
 migration/ram.c          | 14 ++++-----
 migration/ram.h          |  3 +-
 softmmu/physmem.c        | 18 +++++++++++
 6 files changed, 76 insertions(+), 29 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
@ 2023-07-06  7:56 ` David Hildenbrand
  2023-07-06  8:10   ` Juan Quintela
                     ` (2 more replies)
  2023-07-06  7:56 ` [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michael S. Tsirkin, Juan Quintela, Peter Xu,
	Leonardo Bras, Paolo Bonzini, Philippe Mathieu-Daudé,
	Peng Tao, Mario Casquero

ram_block_discard_range() cannot possibly do the right thing in
MAP_PRIVATE file mappings in the general case.

To achieve the documented semantics, we also have to punch a hole into
the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
of such a file.

For example, using VM templating -- see commit b17fbbe55cba ("migration:
allow private destination ram with x-ignore-shared") -- in combination with
any mechanism that relies on discarding of RAM is problematic. This
includes:
* Postcopy live migration
* virtio-balloon inflation/deflation or free-page-reporting
* virtio-mem

So at least warn that there is something possibly dangerous is going on
when using ram_block_discard_range() in these cases.

Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index bda475a719..4ee157bda4 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
              * so a userfault will trigger.
              */
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+            /*
+             * We'll discard data from the actual file, even though we only
+             * have a MAP_PRIVATE mapping, possibly messing with other
+             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
+             * change that behavior whithout violating the promised
+             * semantics of ram_block_discard_range().
+             *
+             * Only warn, because it work as long as nobody else uses that
+             * file.
+             */
+            if (!qemu_ram_is_shared(rb)) {
+                warn_report_once("ram_block_discard_range: Discarding RAM"
+                                 " in private file mappings is possibly"
+                                 " dangerous, because it will modify the"
+                                 " underlying file and will affect other"
+                                 " users of the file");
+            }
+
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                             start, length);
             if (ret) {
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory
  2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
@ 2023-07-06  7:56 ` David Hildenbrand
  2023-07-06  8:15   ` Juan Quintela
  2023-07-06  7:56 ` [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored() David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michael S. Tsirkin, Juan Quintela, Peter Xu,
	Leonardo Bras, Paolo Bonzini, Philippe Mathieu-Daudé,
	Peng Tao, Mario Casquero

Already when starting QEMU we perform one system reset that ends up
triggering virtio_mem_unplug_all() with no actual memory plugged yet.
That, in turn will trigger ram_block_discard_range() and perform some
other actions that are not required in that case.

Let's optimize virtio_mem_unplug_all() for the case that no memory is
plugged. This will be beneficial for x-ignore-shared support as well.

Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index ec0ae32589..a922c21380 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -621,20 +621,20 @@ static int virtio_mem_unplug_all(VirtIOMEM *vmem)
 {
     RAMBlock *rb = vmem->memdev->mr.ram_block;
 
-    if (virtio_mem_is_busy()) {
-        return -EBUSY;
-    }
-
-    if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
-        return -EBUSY;
-    }
-    virtio_mem_notify_unplug_all(vmem);
-
-    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
     if (vmem->size) {
+        if (virtio_mem_is_busy()) {
+            return -EBUSY;
+        }
+        if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
+            return -EBUSY;
+        }
+        virtio_mem_notify_unplug_all(vmem);
+
+        bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
         vmem->size = 0;
         notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
     }
+
     trace_virtio_mem_unplugged_all();
     virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
     return 0;
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored()
  2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
  2023-07-06  7:56 ` [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory David Hildenbrand
@ 2023-07-06  7:56 ` David Hildenbrand
  2023-07-06  8:16   ` Juan Quintela
  2023-07-06  7:56 ` [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
  2023-07-06 14:03 ` [PATCH v2 0/4] " Michael S. Tsirkin
  4 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michael S. Tsirkin, Juan Quintela, Peter Xu,
	Leonardo Bras, Paolo Bonzini, Philippe Mathieu-Daudé,
	Peng Tao, Mario Casquero

virtio-mem wants to know whether it should not mess with the RAMBlock
content (e.g., discard RAM, preallocate memory) on incoming migration.

So let's expose that function as migrate_ram_is_ignored() in
migration/misc.h

Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/migration/misc.h |  1 +
 migration/postcopy-ram.c |  2 +-
 migration/ram.c          | 14 +++++++-------
 migration/ram.h          |  3 +--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 5ebe13b4b9..7dcc0b5c2c 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -40,6 +40,7 @@ int precopy_notify(PrecopyNotifyReason reason, Error **errp);
 
 void ram_mig_init(void);
 void qemu_guest_free_page_hint(void *addr, size_t len);
+bool migrate_ram_is_ignored(RAMBlock *block);
 
 /* migration/block.c */
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 5615ec29eb..29aea9456d 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -408,7 +408,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis, Error **errp)
     /*
      * We don't support postcopy with some type of ramblocks.
      *
-     * NOTE: we explicitly ignored ramblock_is_ignored() instead we checked
+     * NOTE: we explicitly ignored migrate_ram_is_ignored() instead we checked
      * all possible ramblocks.  This is because this function can be called
      * when creating the migration object, during the phase RAM_MIGRATABLE
      * is not even properly set for all the ramblocks.
diff --git a/migration/ram.c b/migration/ram.c
index 5283a75f02..0ada6477e8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -194,7 +194,7 @@ static bool postcopy_preempt_active(void)
     return migrate_postcopy_preempt() && migration_in_postcopy();
 }
 
-bool ramblock_is_ignored(RAMBlock *block)
+bool migrate_ram_is_ignored(RAMBlock *block)
 {
     return !qemu_ram_is_migratable(block) ||
            (migrate_ignore_shared() && qemu_ram_is_shared(block)
@@ -696,7 +696,7 @@ static void pss_find_next_dirty(PageSearchStatus *pss)
     unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
     unsigned long *bitmap = rb->bmap;
 
-    if (ramblock_is_ignored(rb)) {
+    if (migrate_ram_is_ignored(rb)) {
         /* Points directly to the end, so we know no dirty page */
         pss->page = size;
         return;
@@ -780,7 +780,7 @@ unsigned long colo_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
 
     *num = 0;
 
-    if (ramblock_is_ignored(rb)) {
+    if (migrate_ram_is_ignored(rb)) {
         return size;
     }
 
@@ -2260,7 +2260,7 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
     unsigned long start_page = pss->page;
     int res;
 
-    if (ramblock_is_ignored(pss->block)) {
+    if (migrate_ram_is_ignored(pss->block)) {
         error_report("block %s should not be migrated !", pss->block->idstr);
         return 0;
     }
@@ -3347,7 +3347,7 @@ static inline RAMBlock *ram_block_from_stream(MigrationIncomingState *mis,
         return NULL;
     }
 
-    if (ramblock_is_ignored(block)) {
+    if (migrate_ram_is_ignored(block)) {
         error_report("block %s should not be migrated !", id);
         return NULL;
     }
@@ -3958,7 +3958,7 @@ static int ram_load_precopy(QEMUFile *f)
                     }
                     if (migrate_ignore_shared()) {
                         hwaddr addr = qemu_get_be64(f);
-                        if (ramblock_is_ignored(block) &&
+                        if (migrate_ram_is_ignored(block) &&
                             block->mr->addr != addr) {
                             error_report("Mismatched GPAs for block %s "
                                          "%" PRId64 "!= %" PRId64,
@@ -4254,7 +4254,7 @@ static void ram_mig_ram_block_resized(RAMBlockNotifier *n, void *host,
     RAMBlock *rb = qemu_ram_block_from_host(host, false, &offset);
     Error *err = NULL;
 
-    if (ramblock_is_ignored(rb)) {
+    if (migrate_ram_is_ignored(rb)) {
         return;
     }
 
diff --git a/migration/ram.h b/migration/ram.h
index ea1f3c25b5..145c915ca7 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -36,11 +36,10 @@
 extern XBZRLECacheStats xbzrle_counters;
 extern CompressionStats compression_counters;
 
-bool ramblock_is_ignored(RAMBlock *block);
 /* Should be holding either ram_list.mutex, or the RCU lock. */
 #define RAMBLOCK_FOREACH_NOT_IGNORED(block)            \
     INTERNAL_RAMBLOCK_FOREACH(block)                   \
-        if (ramblock_is_ignored(block)) {} else
+        if (migrate_ram_is_ignored(block)) {} else
 
 #define RAMBLOCK_FOREACH_MIGRATABLE(block)             \
     INTERNAL_RAMBLOCK_FOREACH(block)                   \
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
                   ` (2 preceding siblings ...)
  2023-07-06  7:56 ` [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored() David Hildenbrand
@ 2023-07-06  7:56 ` David Hildenbrand
  2023-07-06 11:06   ` Juan Quintela
  2023-07-06 11:59   ` Juan Quintela
  2023-07-06 14:03 ` [PATCH v2 0/4] " Michael S. Tsirkin
  4 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  7:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: David Hildenbrand, Michael S. Tsirkin, Juan Quintela, Peter Xu,
	Leonardo Bras, Paolo Bonzini, Philippe Mathieu-Daudé,
	Peng Tao, Mario Casquero

To achieve desired "x-ignore-shared" functionality, we should not
discard all RAM when realizing the device and not mess with
preallocation/postcopy when loading device state. In essence, we should
not touch RAM content.

As "x-ignore-shared" gets set after realizing the device, we cannot
rely on that. Let's simply skip discarding of RAM on incoming migration.
Note that virtio_mem_post_load() will call
virtio_mem_restore_unplugged() -- unless "x-ignore-shared" is set. So
once migration finished we'll have a consistent state.

The initial system reset will also not discard any RAM, because
virtio_mem_unplug_all() will not call virtio_mem_unplug_all() when no
memory is plugged (which is the case before loading the device state).

Note that something like VM templating -- see commit b17fbbe55cba
("migration: allow private destination ram with x-ignore-shared") -- is
currently incompatible with virtio-mem and ram_block_discard_range() will
warn in case a private file mapping is supplied by virtio-mem.

For VM templating with virtio-mem, it makes more sense to either
(a) Create the template without the virtio-mem device and hotplug a
    virtio-mem device to the new VM instances using proper own memory
    backend.
(b) Use a virtio-mem device that doesn't provide any memory in the
    template (requested-size=0) and use private anonymous memory.

Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 47 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index a922c21380..3f41e00e74 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -18,6 +18,7 @@
 #include "sysemu/numa.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/reset.h"
+#include "sysemu/runstate.h"
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-mem.h"
@@ -901,11 +902,23 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
-    if (ret) {
-        error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
-        ram_block_coordinated_discard_require(false);
-        return;
+    /*
+     * We don't know at this point whether shared RAM is migrated using
+     * QEMU or migrated using the file content. "x-ignore-shared" will be
+     * configured after realizing the device. So in case we have an
+     * incoming migration, simply always skip the discard step.
+     *
+     * Otherwise, make sure that we start with a clean slate: either the
+     * memory backend might get reused or the shared file might still have
+     * memory allocated.
+     */
+    if (!runstate_check(RUN_STATE_INMIGRATE)) {
+        ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
+        if (ret) {
+            error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
+            ram_block_coordinated_discard_require(false);
+            return;
+        }
     }
 
     virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
@@ -977,10 +990,6 @@ static int virtio_mem_post_load(void *opaque, int version_id)
     RamDiscardListener *rdl;
     int ret;
 
-    if (vmem->prealloc && !vmem->early_migration) {
-        warn_report("Proper preallocation with migration requires a newer QEMU machine");
-    }
-
     /*
      * We started out with all memory discarded and our memory region is mapped
      * into an address space. Replay, now that we updated the bitmap.
@@ -993,6 +1002,18 @@ static int virtio_mem_post_load(void *opaque, int version_id)
         }
     }
 
+    /*
+     * If shared RAM is migrated using the file content and not using QEMU,
+     * don't mess with preallocation and postcopy.
+     */
+    if (migrate_ram_is_ignored(vmem->memdev->mr.ram_block)) {
+        return 0;
+    }
+
+    if (vmem->prealloc && !vmem->early_migration) {
+        warn_report("Proper preallocation with migration requires a newer QEMU machine");
+    }
+
     if (migration_in_incoming_postcopy()) {
         return 0;
     }
@@ -1025,6 +1046,14 @@ static int virtio_mem_post_load_early(void *opaque, int version_id)
         return 0;
     }
 
+    /*
+     * If shared RAM is migrated using the file content and not using QEMU,
+     * don't mess with preallocation and postcopy.
+     */
+    if (migrate_ram_is_ignored(rb)) {
+        return 0;
+    }
+
     /*
      * We restored the bitmap and verified that the basic properties
      * match on source and destination, so we can go ahead and preallocate
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
@ 2023-07-06  8:10   ` Juan Quintela
  2023-07-06  8:31     ` David Hildenbrand
  2023-07-06  8:49   ` David Hildenbrand
  2023-10-18  3:02   ` Xiaoyao Li
  2 siblings, 1 reply; 27+ messages in thread
From: Juan Quintela @ 2023-07-06  8:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> ram_block_discard_range() cannot possibly do the right thing in
> MAP_PRIVATE file mappings in the general case.
>
> To achieve the documented semantics, we also have to punch a hole into
> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
> of such a file.
>
> For example, using VM templating -- see commit b17fbbe55cba ("migration:
> allow private destination ram with x-ignore-shared") -- in combination with
> any mechanism that relies on discarding of RAM is problematic. This
> includes:
> * Postcopy live migration
> * virtio-balloon inflation/deflation or free-page-reporting
> * virtio-mem
>
> So at least warn that there is something possibly dangerous is going on
> when using ram_block_discard_range() in these cases.
>
> Acked-by: Peter Xu <peterx@redhat.com>
> Tested-by: Mario Casquero <mcasquer@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

(at least we give a warning)

But I wonder if we can do better and test that:
 * Postcopy live migration

   We can check if we are on postcopy, or put a marker so we know that
   postcopy can have problems when started.

 * virtio-balloon inflation/deflation or free-page-reporting

   We can check if we have ever used virtio-balloon.

 * virtio-mem

   We can check if we have used virtio-men


I am just wondering if that is even possible?

Thanks, Juan.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory
  2023-07-06  7:56 ` [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory David Hildenbrand
@ 2023-07-06  8:15   ` Juan Quintela
  2023-07-06  8:38     ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Juan Quintela @ 2023-07-06  8:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> Already when starting QEMU we perform one system reset that ends up
> triggering virtio_mem_unplug_all() with no actual memory plugged yet.
> That, in turn will trigger ram_block_discard_range() and perform some
> other actions that are not required in that case.
>
> Let's optimize virtio_mem_unplug_all() for the case that no memory is
> plugged. This will be beneficial for x-ignore-shared support as well.
>
> Tested-by: Mario Casquero <mcasquer@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

It works, so ...

Reviewed-by: Juan Quintela <quintela@redhat.com>

>      RAMBlock *rb = vmem->memdev->mr.ram_block;
>  
> -    if (virtio_mem_is_busy()) {
> -        return -EBUSY;
> -    }
> -
> -    if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
> -        return -EBUSY;
> -    }
> -    virtio_mem_notify_unplug_all(vmem);
> -
> -    bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>      if (vmem->size) {
> +        if (virtio_mem_is_busy()) {
> +            return -EBUSY;

I see that the only way that virtio_men_is_busy() is true is if we are
in the middle of a migration.  In the case that vmem is 0, we don't
care.  So we are good.



> +        }
> +        if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
> +            return -EBUSY;
> +        }

Nothing to discard, so also good.

> +        virtio_mem_notify_unplug_all(vmem);

Nothing to notify, so also good.

> +        bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>          vmem->size = 0;
>          notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
>      }
> +
>      trace_virtio_mem_unplugged_all();
>      virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>      return 0;

Once that we are here.  Do you remember _why_ do we allow virtio-mem
plug/unplug in the middle of a migration.

We forbid to plug/unplug everything else.  Why do we need to plug/unplug
virtio-mem during migration?

Thanks, Juan.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored()
  2023-07-06  7:56 ` [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored() David Hildenbrand
@ 2023-07-06  8:16   ` Juan Quintela
  0 siblings, 0 replies; 27+ messages in thread
From: Juan Quintela @ 2023-07-06  8:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> virtio-mem wants to know whether it should not mess with the RAMBlock
> content (e.g., discard RAM, preallocate memory) on incoming migration.
>
> So let's expose that function as migrate_ram_is_ignored() in
> migration/misc.h

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  8:10   ` Juan Quintela
@ 2023-07-06  8:31     ` David Hildenbrand
  2023-07-06 13:20       ` Juan Quintela
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  8:31 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 06.07.23 10:10, Juan Quintela wrote:
> David Hildenbrand <david@redhat.com> wrote:
>> ram_block_discard_range() cannot possibly do the right thing in
>> MAP_PRIVATE file mappings in the general case.
>>
>> To achieve the documented semantics, we also have to punch a hole into
>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>> of such a file.
>>
>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>> allow private destination ram with x-ignore-shared") -- in combination with
>> any mechanism that relies on discarding of RAM is problematic. This
>> includes:
>> * Postcopy live migration
>> * virtio-balloon inflation/deflation or free-page-reporting
>> * virtio-mem
>>
>> So at least warn that there is something possibly dangerous is going on
>> when using ram_block_discard_range() in these cases.
>>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> 
> (at least we give a warning)
> 
> But I wonder if we can do better and test that:
>   * Postcopy live migration
> 
>     We can check if we are on postcopy, or put a marker so we know that
>     postcopy can have problems when started.
> 
>   * virtio-balloon inflation/deflation or free-page-reporting
> 
>     We can check if we have ever used virtio-balloon.
> 
>   * virtio-mem
> 
>     We can check if we have used virtio-men
> 
> 
> I am just wondering if that is even possible?

Now we warn when any of these features actually tries discarding RAM 
(calling ram_block_discard_range()).

As these features trigger discarding of RAM, once we reach this point we 
know that they are getting used. (in comparison to default libvirt 
attaching a virtio-balloon device without anybody ever using it)


The alternative would be checking the RAM for compatibility when 
configuring each features. I decided to warn at a central place for now, 
which covers any users.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory
  2023-07-06  8:15   ` Juan Quintela
@ 2023-07-06  8:38     ` David Hildenbrand
  2023-07-06 13:27       ` Juan Quintela
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  8:38 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 06.07.23 10:15, Juan Quintela wrote:
> David Hildenbrand <david@redhat.com> wrote:
>> Already when starting QEMU we perform one system reset that ends up
>> triggering virtio_mem_unplug_all() with no actual memory plugged yet.
>> That, in turn will trigger ram_block_discard_range() and perform some
>> other actions that are not required in that case.
>>
>> Let's optimize virtio_mem_unplug_all() for the case that no memory is
>> plugged. This will be beneficial for x-ignore-shared support as well.
>>
>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> It works, so ...
> 
> Reviewed-by: Juan Quintela <quintela@redhat.com>

Thanks!

[...]

>> +        bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>>           vmem->size = 0;
>>           notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
>>       }
>> +
>>       trace_virtio_mem_unplugged_all();
>>       virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>>       return 0;
> 
> Once that we are here.  Do you remember _why_ do we allow virtio-mem
> plug/unplug in the middle of a migration.
> 
> We forbid to plug/unplug everything else.  Why do we need to plug/unplug
> virtio-mem during migration?

With virtio-mem you tell the VM the desired size for the device 
(requested-size), and the VM will select blocks to (un)plug and send 
(un)plug requests to the hypervisor in order to reach the requested size.

So changing the requested size in the hypervisor (by the QEMU user) and 
the VM processing that resize request is asynchronous -- similar to 
memory ballooning.

As the VM can send these (un)plug requests any time, and we exactly 
don't want to allow (un)plug during migration, we have 
virtio_mem_is_busy() to reject any such requests to tell the VM "please 
try again later".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
  2023-07-06  8:10   ` Juan Quintela
@ 2023-07-06  8:49   ` David Hildenbrand
  2023-10-18  3:02   ` Xiaoyao Li
  2 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06  8:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +            /*
> +             * We'll discard data from the actual file, even though we only
> +             * have a MAP_PRIVATE mapping, possibly messing with other
> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
> +             * change that behavior whithout violating the promised
> +             * semantics of ram_block_discard_range().
> +             *
> +             * Only warn, because it work as long as nobody else uses that
>

I'll fixup

s/work/works/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06  7:56 ` [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
@ 2023-07-06 11:06   ` Juan Quintela
  2023-07-06 11:27     ` David Hildenbrand
  2023-07-06 11:59   ` Juan Quintela
  1 sibling, 1 reply; 27+ messages in thread
From: Juan Quintela @ 2023-07-06 11:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> To achieve desired "x-ignore-shared" functionality, we should not
> discard all RAM when realizing the device and not mess with
> preallocation/postcopy when loading device state. In essence, we should
> not touch RAM content.
>
> As "x-ignore-shared" gets set after realizing the device, we cannot
> rely on that. Let's simply skip discarding of RAM on incoming migration.
> Note that virtio_mem_post_load() will call
> virtio_mem_restore_unplugged() -- unless "x-ignore-shared" is set. So
> once migration finished we'll have a consistent state.
>
> The initial system reset will also not discard any RAM, because
> virtio_mem_unplug_all() will not call virtio_mem_unplug_all() when no
> memory is plugged (which is the case before loading the device state).
>
> Note that something like VM templating -- see commit b17fbbe55cba
> ("migration: allow private destination ram with x-ignore-shared")

And here I am, I reviewed the patch, and 4 years later I don't remember
anything about it O:-)

> -- is
> currently incompatible with virtio-mem and ram_block_discard_range() will
> warn in case a private file mapping is supplied by virtio-mem.

If it is incompatible, only a warning is not enough.

>
> For VM templating with virtio-mem, it makes more sense to either
> (a) Create the template without the virtio-mem device and hotplug a
>     virtio-mem device to the new VM instances using proper own memory
>     backend.
> (b) Use a virtio-mem device that doesn't provide any memory in the
>     template (requested-size=0) and use private anonymous memory.
>
> Tested-by: Mario Casquero <mcasquer@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  hw/virtio/virtio-mem.c | 47 ++++++++++++++++++++++++++++++++++--------
>  1 file changed, 38 insertions(+), 9 deletions(-)
>
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index a922c21380..3f41e00e74 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -18,6 +18,7 @@
>  #include "sysemu/numa.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/reset.h"
> +#include "sysemu/runstate.h"
>  #include "hw/virtio/virtio.h"
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/virtio/virtio-mem.h"
> @@ -901,11 +902,23 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> -    if (ret) {
> -        error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> -        ram_block_coordinated_discard_require(false);
> -        return;
> +    /*
> +     * We don't know at this point whether shared RAM is migrated using
> +     * QEMU or migrated using the file content. "x-ignore-shared" will be
> +     * configured after realizing the device. So in case we have an
> +     * incoming migration, simply always skip the discard step.
> +     *
> +     * Otherwise, make sure that we start with a clean slate: either the
> +     * memory backend might get reused or the shared file might still have
> +     * memory allocated.
> +     */
> +    if (!runstate_check(RUN_STATE_INMIGRATE)) {
> +        ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
> +            ram_block_coordinated_discard_require(false);
> +            return;
> +        }
>      }

Makes sense.

>  
>      virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
> @@ -977,10 +990,6 @@ static int virtio_mem_post_load(void *opaque, int version_id)
>      RamDiscardListener *rdl;
>      int ret;
>  
> -    if (vmem->prealloc && !vmem->early_migration) {
> -        warn_report("Proper preallocation with migration requires a newer QEMU machine");
> -    }
> -
>      /*
>       * We started out with all memory discarded and our memory region is mapped
>       * into an address space. Replay, now that we updated the bitmap.
> @@ -993,6 +1002,18 @@ static int virtio_mem_post_load(void *opaque, int version_id)
>          }
>      }
>  
> +    /*
> +     * If shared RAM is migrated using the file content and not using QEMU,
> +     * don't mess with preallocation and postcopy.
> +     */
> +    if (migrate_ram_is_ignored(vmem->memdev->mr.ram_block)) {
> +        return 0;
> +    }
> +
> +    if (vmem->prealloc && !vmem->early_migration) {
> +        warn_report("Proper preallocation with migration requires a newer QEMU machine");
> +    }
> +

Could you explain why you are putting the check after calling
virtio_mem_notify_populate_cb()?

What is it expected to for file memory backed RAM?  I got lost when I
saw that it just calls:

static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
{
    RamDiscardListener *rdl = arg;

    return rdl->notify_populate(rdl, s);
}


I end in vfio, and got completely confused about what is going on there.


>      if (migration_in_incoming_postcopy()) {
>          return 0;
>      }
> @@ -1025,6 +1046,14 @@ static int virtio_mem_post_load_early(void *opaque, int version_id)
>          return 0;
>      }
>  
> +    /*
> +     * If shared RAM is migrated using the file content and not using QEMU,
> +     * don't mess with preallocation and postcopy.
> +     */
> +    if (migrate_ram_is_ignored(rb)) {
> +        return 0;
> +    }
> +
>      /*
>       * We restored the bitmap and verified that the basic properties
>       * match on source and destination, so we can go ahead and preallocate

OK.

Thanks, Juan.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06 11:06   ` Juan Quintela
@ 2023-07-06 11:27     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06 11:27 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 06.07.23 13:06, Juan Quintela wrote:
> David Hildenbrand <david@redhat.com> wrote:
>> To achieve desired "x-ignore-shared" functionality, we should not
>> discard all RAM when realizing the device and not mess with
>> preallocation/postcopy when loading device state. In essence, we should
>> not touch RAM content.
>>
>> As "x-ignore-shared" gets set after realizing the device, we cannot
>> rely on that. Let's simply skip discarding of RAM on incoming migration.
>> Note that virtio_mem_post_load() will call
>> virtio_mem_restore_unplugged() -- unless "x-ignore-shared" is set. So
>> once migration finished we'll have a consistent state.
>>
>> The initial system reset will also not discard any RAM, because
>> virtio_mem_unplug_all() will not call virtio_mem_unplug_all() when no
>> memory is plugged (which is the case before loading the device state).
>>
>> Note that something like VM templating -- see commit b17fbbe55cba
>> ("migration: allow private destination ram with x-ignore-shared")
> 
> And here I am, I reviewed the patch, and 4 years later I don't remember
> anything about it O:-)

:)

[...]

>> +    /*
>> +     * If shared RAM is migrated using the file content and not using QEMU,
>> +     * don't mess with preallocation and postcopy.
>> +     */
>> +    if (migrate_ram_is_ignored(vmem->memdev->mr.ram_block)) {
>> +        return 0;
>> +    }
>> +
>> +    if (vmem->prealloc && !vmem->early_migration) {
>> +        warn_report("Proper preallocation with migration requires a newer QEMU machine");
>> +    }
>> +
> 
> Could you explain why you are putting the check after calling
> virtio_mem_notify_populate_cb()?
> 
> What is it expected to for file memory backed RAM?  I got lost when I
> saw that it just calls:
> 
> static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
> {
>      RamDiscardListener *rdl = arg;
> 
>      return rdl->notify_populate(rdl, s);
> }
> 
> 
> I end in vfio, and got completely confused about what is going on there.

:)

Once we reached virtio_mem_post_load(), we restored the bitmap that 
contains the state of all device blocks (plugged vs. unplugged).

Whenever we modify the bitmap (plug / unplug), we have to notify 
(RamDiscardManager) listeners, such that they are aware of the state 
change and can perform according action.

For example, vfio will go ahead and register the newly plugged blocks 
with the kernel (DMA map it into the vfio), where the kernel will end up 
long-term pinning these pages. Effectively, we only end up DMA-mapping 
plugged memory blocks, so only these get pinned by the kernel (and we 
can actually release the memory of unplugged blocks).

So here (virtio_mem_post_load()), we just restored the bitmap from the 
migration stream and effectively went from 0 plugged blocks (bitmap 
empty) before migration to "maybe some plugged blocks in the bitmap".

So we go over the bitmap and tell the world (vfio) to go ahead and 
DMA-map these blocks that are suddenly plugged.

And that part is independent of the actual RAM migration / 
x-ignore-shared, sow have to do it unconditional.

Thanks for the thorough review!

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06  7:56 ` [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
  2023-07-06 11:06   ` Juan Quintela
@ 2023-07-06 11:59   ` Juan Quintela
  1 sibling, 0 replies; 27+ messages in thread
From: Juan Quintela @ 2023-07-06 11:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> To achieve desired "x-ignore-shared" functionality, we should not
> discard all RAM when realizing the device and not mess with
> preallocation/postcopy when loading device state. In essence, we should
> not touch RAM content.
>
> As "x-ignore-shared" gets set after realizing the device, we cannot
> rely on that. Let's simply skip discarding of RAM on incoming migration.
> Note that virtio_mem_post_load() will call
> virtio_mem_restore_unplugged() -- unless "x-ignore-shared" is set. So
> once migration finished we'll have a consistent state.
>
> The initial system reset will also not discard any RAM, because
> virtio_mem_unplug_all() will not call virtio_mem_unplug_all() when no
> memory is plugged (which is the case before loading the device state).
>
> Note that something like VM templating -- see commit b17fbbe55cba
> ("migration: allow private destination ram with x-ignore-shared") -- is
> currently incompatible with virtio-mem and ram_block_discard_range() will
> warn in case a private file mapping is supplied by virtio-mem.
>
> For VM templating with virtio-mem, it makes more sense to either
> (a) Create the template without the virtio-mem device and hotplug a
>     virtio-mem device to the new VM instances using proper own memory
>     backend.
> (b) Use a virtio-mem device that doesn't provide any memory in the
>     template (requested-size=0) and use private anonymous memory.
>
> Tested-by: Mario Casquero <mcasquer@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

After very nice explanation.

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  8:31     ` David Hildenbrand
@ 2023-07-06 13:20       ` Juan Quintela
  2023-07-06 13:23         ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Juan Quintela @ 2023-07-06 13:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> On 06.07.23 10:10, Juan Quintela wrote:
>> David Hildenbrand <david@redhat.com> wrote:
>>> ram_block_discard_range() cannot possibly do the right thing in
>>> MAP_PRIVATE file mappings in the general case.
>>>
>>> To achieve the documented semantics, we also have to punch a hole into
>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>> of such a file.
>>>
>>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>>> allow private destination ram with x-ignore-shared") -- in combination with
>>> any mechanism that relies on discarding of RAM is problematic. This
>>> includes:
>>> * Postcopy live migration
>>> * virtio-balloon inflation/deflation or free-page-reporting
>>> * virtio-mem
>>>
>>> So at least warn that there is something possibly dangerous is going on
>>> when using ram_block_discard_range() in these cases.
>>>
>>> Acked-by: Peter Xu <peterx@redhat.com>
>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>> (at least we give a warning)
>> But I wonder if we can do better and test that:
>>   * Postcopy live migration
>>     We can check if we are on postcopy, or put a marker so we know
>> that
>>     postcopy can have problems when started.
>>   * virtio-balloon inflation/deflation or free-page-reporting
>>     We can check if we have ever used virtio-balloon.
>>   * virtio-mem
>>     We can check if we have used virtio-men
>> I am just wondering if that is even possible?
>
> Now we warn when any of these features actually tries discarding RAM
> (calling ram_block_discard_range()).
>
> As these features trigger discarding of RAM, once we reach this point
> we know that they are getting used. (in comparison to default libvirt
> attaching a virtio-balloon device without anybody ever using it)
>
>
> The alternative would be checking the RAM for compatibility when
> configuring each features. I decided to warn at a central place for
> now, which covers any users.

I think this is the right thing to do.

Patient: It hurts when I do this.
Doctor: Don't do that.

O:-)

Now more seriously, at this point we are very late to do anything
sensible.  I think that the normal thing when we are configuring
incompatible things just flag it.

We are following that approach with migration for some time now, and
everybody is happier.

Later, Juan.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06 13:20       ` Juan Quintela
@ 2023-07-06 13:23         ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-06 13:23 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 06.07.23 15:20, Juan Quintela wrote:
> David Hildenbrand <david@redhat.com> wrote:
>> On 06.07.23 10:10, Juan Quintela wrote:
>>> David Hildenbrand <david@redhat.com> wrote:
>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>> MAP_PRIVATE file mappings in the general case.
>>>>
>>>> To achieve the documented semantics, we also have to punch a hole into
>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>>> of such a file.
>>>>
>>>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>>>> allow private destination ram with x-ignore-shared") -- in combination with
>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>> includes:
>>>> * Postcopy live migration
>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>> * virtio-mem
>>>>
>>>> So at least warn that there is something possibly dangerous is going on
>>>> when using ram_block_discard_range() in these cases.
>>>>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>>> (at least we give a warning)
>>> But I wonder if we can do better and test that:
>>>    * Postcopy live migration
>>>      We can check if we are on postcopy, or put a marker so we know
>>> that
>>>      postcopy can have problems when started.
>>>    * virtio-balloon inflation/deflation or free-page-reporting
>>>      We can check if we have ever used virtio-balloon.
>>>    * virtio-mem
>>>      We can check if we have used virtio-men
>>> I am just wondering if that is even possible?
>>
>> Now we warn when any of these features actually tries discarding RAM
>> (calling ram_block_discard_range()).
>>
>> As these features trigger discarding of RAM, once we reach this point
>> we know that they are getting used. (in comparison to default libvirt
>> attaching a virtio-balloon device without anybody ever using it)
>>
>>
>> The alternative would be checking the RAM for compatibility when
>> configuring each features. I decided to warn at a central place for
>> now, which covers any users.
> 
> I think this is the right thing to do.
> 
> Patient: It hurts when I do this.
> Doctor: Don't do that.
> 
> O:-)

:)

> 
> Now more seriously, at this point we are very late to do anything
> sensible.  I think that the normal thing when we are configuring
> incompatible things just flag it.
> 
> We are following that approach with migration for some time now, and
> everybody is happier.

For the time being I'll move forward with this patch.

I agree that warning early is nicer (but warning for example for 
virtio-balloon early doesn't make too much sense: libvirt adds it 
blindly to each VM just to query guest statistics and never inflate the 
balloon).

In any case we'll want to warn here as well, because we know that new 
callers will easily ignore that limitation / checks.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory
  2023-07-06  8:38     ` David Hildenbrand
@ 2023-07-06 13:27       ` Juan Quintela
  0 siblings, 0 replies; 27+ messages in thread
From: Juan Quintela @ 2023-07-06 13:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Michael S. Tsirkin, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David Hildenbrand <david@redhat.com> wrote:
> On 06.07.23 10:15, Juan Quintela wrote:
>> David Hildenbrand <david@redhat.com> wrote:
>>> Already when starting QEMU we perform one system reset that ends up
>>> triggering virtio_mem_unplug_all() with no actual memory plugged yet.
>>> That, in turn will trigger ram_block_discard_range() and perform some
>>> other actions that are not required in that case.
>>>
>>> Let's optimize virtio_mem_unplug_all() for the case that no memory is
>>> plugged. This will be beneficial for x-ignore-shared support as well.
>>>
>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> It works, so ...
>> Reviewed-by: Juan Quintela <quintela@redhat.com>
>
> Thanks!
>
> [...]
>
>>> +        bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
>>>           vmem->size = 0;
>>>           notifier_list_notify(&vmem->size_change_notifiers, &vmem->size);
>>>       }
>>> +
>>>       trace_virtio_mem_unplugged_all();
>>>       virtio_mem_resize_usable_region(vmem, vmem->requested_size, true);
>>>       return 0;
>> Once that we are here.  Do you remember _why_ do we allow virtio-mem
>> plug/unplug in the middle of a migration.
>> We forbid to plug/unplug everything else.  Why do we need to
>> plug/unplug
>> virtio-mem during migration?
>
> With virtio-mem you tell the VM the desired size for the device
> (requested-size), and the VM will select blocks to (un)plug and send
> (un)plug requests to the hypervisor in order to reach the requested
> size.
>
> So changing the requested size in the hypervisor (by the QEMU user)
> and the VM processing that resize request is asynchronous -- similar
> to memory ballooning.
>
> As the VM can send these (un)plug requests any time, and we exactly
> don't want to allow (un)plug during migration, we have
> virtio_mem_is_busy() to reject any such requests to tell the VM
> "please try again later".

Ahh.

I see it now.

Thanks, Juan.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
                   ` (3 preceding siblings ...)
  2023-07-06  7:56 ` [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
@ 2023-07-06 14:03 ` Michael S. Tsirkin
  2023-07-07 12:21   ` David Hildenbrand
  4 siblings, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2023-07-06 14:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, Juan Quintela, Peter Xu, Leonardo Bras, Paolo Bonzini,
	Philippe Mathieu-Daudé, Peng Tao, Mario Casquero

On Thu, Jul 06, 2023 at 09:56:05AM +0200, David Hildenbrand wrote:
> If there is no further feedback, I'll queue this myself shortly.
> 
> Stumbling over "x-ignore-shared" migration support for virtio-mem on
> my todo list, I remember talking to Dave G. a while ago about how
> ram_block_discard_range() in MAP_PIRVATE file mappings is possibly
> harmful when the file is used somewhere else -- for example, with VM
> templating in multiple VMs.
> 
> This series adds a warning to ram_block_discard_range() in that problematic
> case and adds "x-ignore-shared" migration support for virtio-mem, which
> is pretty straight-forward. The last patch also documents how VM templating
> interacts with virtio-mem.
> 
> v1 -> v2:
> - Pick up tags
> - "virtio-mem: Support "x-ignore-shared" migration"
>  -> Fix spelling mistake

Reviewed-by: Michael S. Tsirkin <mst@redhat.com>


> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Leonardo Bras <leobras@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Philippe Mathieu-Daudé" <philmd@linaro.org>
> Cc: Peng Tao <tao.peng@linux.alibaba.com>
> Cc: Mario Casquero <mcasquer@redhat.com>
> 
> David Hildenbrand (4):
>   softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE
>     file mapping
>   virtio-mem: Skip most of virtio_mem_unplug_all() without plugged
>     memory
>   migration/ram: Expose ramblock_is_ignored() as
>     migrate_ram_is_ignored()
>   virtio-mem: Support "x-ignore-shared" migration
> 
>  hw/virtio/virtio-mem.c   | 67 ++++++++++++++++++++++++++++------------
>  include/migration/misc.h |  1 +
>  migration/postcopy-ram.c |  2 +-
>  migration/ram.c          | 14 ++++-----
>  migration/ram.h          |  3 +-
>  softmmu/physmem.c        | 18 +++++++++++
>  6 files changed, 76 insertions(+), 29 deletions(-)
> 
> -- 
> 2.41.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration
  2023-07-06 14:03 ` [PATCH v2 0/4] " Michael S. Tsirkin
@ 2023-07-07 12:21   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-07 12:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Juan Quintela, Peter Xu, Leonardo Bras, Paolo Bonzini,
	Philippe Mathieu-Daudé, Peng Tao, Mario Casquero

On 06.07.23 16:03, Michael S. Tsirkin wrote:
> On Thu, Jul 06, 2023 at 09:56:05AM +0200, David Hildenbrand wrote:
>> If there is no further feedback, I'll queue this myself shortly.
>>
>> Stumbling over "x-ignore-shared" migration support for virtio-mem on
>> my todo list, I remember talking to Dave G. a while ago about how
>> ram_block_discard_range() in MAP_PIRVATE file mappings is possibly
>> harmful when the file is used somewhere else -- for example, with VM
>> templating in multiple VMs.
>>
>> This series adds a warning to ram_block_discard_range() in that problematic
>> case and adds "x-ignore-shared" migration support for virtio-mem, which
>> is pretty straight-forward. The last patch also documents how VM templating
>> interacts with virtio-mem.
>>
>> v1 -> v2:
>> - Pick up tags
>> - "virtio-mem: Support "x-ignore-shared" migration"
>>   -> Fix spelling mistake
> 
> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>

Thanks, I queued this to

https://github.com/davidhildenbrand/qemu.git mem-next

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
  2023-07-06  8:10   ` Juan Quintela
  2023-07-06  8:49   ` David Hildenbrand
@ 2023-10-18  3:02   ` Xiaoyao Li
  2023-10-18  7:42     ` David Hildenbrand
  2 siblings, 1 reply; 27+ messages in thread
From: Xiaoyao Li @ 2023-10-18  3:02 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

David,

On 7/6/2023 3:56 PM, David Hildenbrand wrote:
> ram_block_discard_range() cannot possibly do the right thing in
> MAP_PRIVATE file mappings in the general case.
> 
> To achieve the documented semantics, we also have to punch a hole into
> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
> of such a file.
> 
> For example, using VM templating -- see commit b17fbbe55cba ("migration:
> allow private destination ram with x-ignore-shared") -- in combination with
> any mechanism that relies on discarding of RAM is problematic. This
> includes:
> * Postcopy live migration
> * virtio-balloon inflation/deflation or free-page-reporting
> * virtio-mem
> 
> So at least warn that there is something possibly dangerous is going on
> when using ram_block_discard_range() in these cases.
> 
> Acked-by: Peter Xu <peterx@redhat.com>
> Tested-by: Mario Casquero <mcasquer@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   softmmu/physmem.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index bda475a719..4ee157bda4 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>                * so a userfault will trigger.
>                */
>   #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +            /*
> +             * We'll discard data from the actual file, even though we only
> +             * have a MAP_PRIVATE mapping, possibly messing with other
> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
> +             * change that behavior whithout violating the promised
> +             * semantics of ram_block_discard_range().
> +             *
> +             * Only warn, because it work as long as nobody else uses that
> +             * file.
> +             */
> +            if (!qemu_ram_is_shared(rb)) {
> +                warn_report_once("ram_block_discard_range: Discarding RAM"
> +                                 " in private file mappings is possibly"
> +                                 " dangerous, because it will modify the"
> +                                 " underlying file and will affect other"
> +                                 " users of the file");
> +            }
> +

TDX has two types of memory backend for each RAM, shared memory and 
private memory. Private memory is serviced by guest memfd and shared 
memory can also be backed with a fd.

At any time, only one type needs to be valid, which means the opposite 
can be discarded. We do implement the memory discard when TDX converts 
the memory[1]. It will trigger this warning 100% because by default the 
guest memfd is not mapped as shared (MAP_SHARED).

Simply remove the warning will fail the purpose of this patch. The other 
option is to skip the warning for TDX case, which looks vary hacky. Do 
you have any idea?

[1] 
https://lore.kernel.org/all/20230914035117.3285885-18-xiaoyao.li@intel.com/

>               ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                               start, length);
>               if (ret) {



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-18  3:02   ` Xiaoyao Li
@ 2023-10-18  7:42     ` David Hildenbrand
  2023-10-18  9:02       ` Xiaoyao Li
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-10-18  7:42 UTC (permalink / raw)
  To: Xiaoyao Li, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 18.10.23 05:02, Xiaoyao Li wrote:
> David,
> 
> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>> ram_block_discard_range() cannot possibly do the right thing in
>> MAP_PRIVATE file mappings in the general case.
>>
>> To achieve the documented semantics, we also have to punch a hole into
>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>> of such a file.
>>
>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>> allow private destination ram with x-ignore-shared") -- in combination with
>> any mechanism that relies on discarding of RAM is problematic. This
>> includes:
>> * Postcopy live migration
>> * virtio-balloon inflation/deflation or free-page-reporting
>> * virtio-mem
>>
>> So at least warn that there is something possibly dangerous is going on
>> when using ram_block_discard_range() in these cases.
>>
>> Acked-by: Peter Xu <peterx@redhat.com>
>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    softmmu/physmem.c | 18 ++++++++++++++++++
>>    1 file changed, 18 insertions(+)
>>
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index bda475a719..4ee157bda4 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>>                 * so a userfault will trigger.
>>                 */
>>    #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> +            /*
>> +             * We'll discard data from the actual file, even though we only
>> +             * have a MAP_PRIVATE mapping, possibly messing with other
>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
>> +             * change that behavior whithout violating the promised
>> +             * semantics of ram_block_discard_range().
>> +             *
>> +             * Only warn, because it work as long as nobody else uses that
>> +             * file.
>> +             */
>> +            if (!qemu_ram_is_shared(rb)) {
>> +                warn_report_once("ram_block_discard_range: Discarding RAM"
>> +                                 " in private file mappings is possibly"
>> +                                 " dangerous, because it will modify the"
>> +                                 " underlying file and will affect other"
>> +                                 " users of the file");
>> +            }
>> +
> 
> TDX has two types of memory backend for each RAM, shared memory and
> private memory. Private memory is serviced by guest memfd and shared
> memory can also be backed with a fd.
> 
> At any time, only one type needs to be valid, which means the opposite
> can be discarded. We do implement the memory discard when TDX converts
> the memory[1]. It will trigger this warning 100% because by default the
> guest memfd is not mapped as shared (MAP_SHARED).

If MAP_PRIVATE is not involved and you are taking the pages directly out 
of the memfd, you should mark that thing as shared. Anonymous memory is 
never involved.

"Private memory" is only private from the guest POV, not from a mmap() 
point of view.

Two different concepts of "private".

> 
> Simply remove the warning will fail the purpose of this patch. The other
> option is to skip the warning for TDX case, which looks vary hacky. Do
> you have any idea?

For TDX, all memory backends / RAMBlocks should be marked as "shared", 
and you should fail if that is not provided by the user.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-18  7:42     ` David Hildenbrand
@ 2023-10-18  9:02       ` Xiaoyao Li
  2023-10-18  9:26         ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Xiaoyao Li @ 2023-10-18  9:02 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 10/18/2023 3:42 PM, David Hildenbrand wrote:
> On 18.10.23 05:02, Xiaoyao Li wrote:
>> David,
>>
>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>> ram_block_discard_range() cannot possibly do the right thing in
>>> MAP_PRIVATE file mappings in the general case.
>>>
>>> To achieve the documented semantics, we also have to punch a hole into
>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>> of such a file.
>>>
>>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>>> allow private destination ram with x-ignore-shared") -- in 
>>> combination with
>>> any mechanism that relies on discarding of RAM is problematic. This
>>> includes:
>>> * Postcopy live migration
>>> * virtio-balloon inflation/deflation or free-page-reporting
>>> * virtio-mem
>>>
>>> So at least warn that there is something possibly dangerous is going on
>>> when using ram_block_discard_range() in these cases.
>>>
>>> Acked-by: Peter Xu <peterx@redhat.com>
>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    softmmu/physmem.c | 18 ++++++++++++++++++
>>>    1 file changed, 18 insertions(+)
>>>
>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>> index bda475a719..4ee157bda4 100644
>>> --- a/softmmu/physmem.c
>>> +++ b/softmmu/physmem.c
>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb, 
>>> uint64_t start, size_t length)
>>>                 * so a userfault will trigger.
>>>                 */
>>>    #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> +            /*
>>> +             * We'll discard data from the actual file, even though 
>>> we only
>>> +             * have a MAP_PRIVATE mapping, possibly messing with other
>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
>>> +             * change that behavior whithout violating the promised
>>> +             * semantics of ram_block_discard_range().
>>> +             *
>>> +             * Only warn, because it work as long as nobody else 
>>> uses that
>>> +             * file.
>>> +             */
>>> +            if (!qemu_ram_is_shared(rb)) {
>>> +                warn_report_once("ram_block_discard_range: 
>>> Discarding RAM"
>>> +                                 " in private file mappings is 
>>> possibly"
>>> +                                 " dangerous, because it will modify 
>>> the"
>>> +                                 " underlying file and will affect 
>>> other"
>>> +                                 " users of the file");
>>> +            }
>>> +
>>
>> TDX has two types of memory backend for each RAM, shared memory and
>> private memory. Private memory is serviced by guest memfd and shared
>> memory can also be backed with a fd.
>>
>> At any time, only one type needs to be valid, which means the opposite
>> can be discarded. We do implement the memory discard when TDX converts
>> the memory[1]. It will trigger this warning 100% because by default the
>> guest memfd is not mapped as shared (MAP_SHARED).
> 
> If MAP_PRIVATE is not involved and you are taking the pages directly out 
> of the memfd, you should mark that thing as shared. 

Is it the general rule of Linux? Of just the rule of QEMU memory discard?

> Anonymous memory is never involved.

Could you please elaborate more on this? What do you want to express 
here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)

> 
> "Private memory" is only private from the guest POV, not from a mmap() 
> point of view.
> 
> Two different concepts of "private".
> 
>>
>> Simply remove the warning will fail the purpose of this patch. The other
>> option is to skip the warning for TDX case, which looks vary hacky. Do
>> you have any idea?
> 
> For TDX, all memory backends / RAMBlocks should be marked as "shared", 
> and you should fail if that is not provided by the user.

As I asked above, I want to understand the logic clearly. Is mapped as 
shared is a must to support the memory discard? i.e., if we want to 
support memory discard after memory type change, then the memory must be 
mapped with MAP_SHARED?




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-18  9:02       ` Xiaoyao Li
@ 2023-10-18  9:26         ` David Hildenbrand
  2023-10-18 16:27           ` Xiaoyao Li
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-10-18  9:26 UTC (permalink / raw)
  To: Xiaoyao Li, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 18.10.23 11:02, Xiaoyao Li wrote:
> On 10/18/2023 3:42 PM, David Hildenbrand wrote:
>> On 18.10.23 05:02, Xiaoyao Li wrote:
>>> David,
>>>
>>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>> MAP_PRIVATE file mappings in the general case.
>>>>
>>>> To achieve the documented semantics, we also have to punch a hole into
>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>>> of such a file.
>>>>
>>>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>>>> allow private destination ram with x-ignore-shared") -- in
>>>> combination with
>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>> includes:
>>>> * Postcopy live migration
>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>> * virtio-mem
>>>>
>>>> So at least warn that there is something possibly dangerous is going on
>>>> when using ram_block_discard_range() in these cases.
>>>>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>     softmmu/physmem.c | 18 ++++++++++++++++++
>>>>     1 file changed, 18 insertions(+)
>>>>
>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>> index bda475a719..4ee157bda4 100644
>>>> --- a/softmmu/physmem.c
>>>> +++ b/softmmu/physmem.c
>>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb,
>>>> uint64_t start, size_t length)
>>>>                  * so a userfault will trigger.
>>>>                  */
>>>>     #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>> +            /*
>>>> +             * We'll discard data from the actual file, even though
>>>> we only
>>>> +             * have a MAP_PRIVATE mapping, possibly messing with other
>>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
>>>> +             * change that behavior whithout violating the promised
>>>> +             * semantics of ram_block_discard_range().
>>>> +             *
>>>> +             * Only warn, because it work as long as nobody else
>>>> uses that
>>>> +             * file.
>>>> +             */
>>>> +            if (!qemu_ram_is_shared(rb)) {
>>>> +                warn_report_once("ram_block_discard_range:
>>>> Discarding RAM"
>>>> +                                 " in private file mappings is
>>>> possibly"
>>>> +                                 " dangerous, because it will modify
>>>> the"
>>>> +                                 " underlying file and will affect
>>>> other"
>>>> +                                 " users of the file");
>>>> +            }
>>>> +
>>>
>>> TDX has two types of memory backend for each RAM, shared memory and
>>> private memory. Private memory is serviced by guest memfd and shared
>>> memory can also be backed with a fd.
>>>
>>> At any time, only one type needs to be valid, which means the opposite
>>> can be discarded. We do implement the memory discard when TDX converts
>>> the memory[1]. It will trigger this warning 100% because by default the
>>> guest memfd is not mapped as shared (MAP_SHARED).
>>
>> If MAP_PRIVATE is not involved and you are taking the pages directly out
>> of the memfd, you should mark that thing as shared.
> 
> Is it the general rule of Linux? Of just the rule of QEMU memory discard?
> 

MAP_SHARED vs. MAP_PRIVATE is a common UNIX principle, and that's what 
this flag and the check is about.

 From mmap(2)

MAP_SHARED: Share this mapping.  Updates to the mapping are visible to 
other processes mapping the same region, and (in the case of file-backed 
mappings) are carried through to the underlying file.

MAP_PRIVATE: Create a private copy-on-write mapping.  Updates to the 
mapping are not visible to other processes mapping the same file, and 
are not carried through to the underlying file.  It is unspecified 
whether changes made  to the file after the mmap() call are visible in 
the mapped region.

For your purpose (no mmap() at all), we behave like MAP_SHARED -- as if 
nothing special is done. No Copy-on-write, no anonymous memory.

>> Anonymous memory is never involved.
> 
> Could you please elaborate more on this? What do you want to express
> here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)

Anonymous memory is memory that is private to a specific process, and 
(see MAP_PRIVATE) modifications remain private to the process and are 
not reflected to the file.

If you have a MAP_PRIVATE file mapping and write to a virtual memory 
location, you'll get a process-private copy of the underlying pagecache 
page. that's what we call anonymous memory, because it does not belong 
to a specific file. fallocate(punch) would not free up that anonymous 
memory.

> 
>>
>> "Private memory" is only private from the guest POV, not from a mmap()
>> point of view.
>>
>> Two different concepts of "private".
>>
>>>
>>> Simply remove the warning will fail the purpose of this patch. The other
>>> option is to skip the warning for TDX case, which looks vary hacky. Do
>>> you have any idea?
>>
>> For TDX, all memory backends / RAMBlocks should be marked as "shared",
>> and you should fail if that is not provided by the user.
> 
> As I asked above, I want to understand the logic clearly. Is mapped as
> shared is a must to support the memory discard? i.e., if we want to
> support memory discard after memory type change, then the memory must be
> mapped with MAP_SHARED?

MAP_PIRVATE means that it's not sufficient to only fallocate(punch) the 
fd to free up all memory for a virtual address, because there might be 
anonymous memory in a private mapping that has to be freed up using 
MADV_DONTNEED. That's why this functions handles both cases differently, 
and warns if something possibly nasty is being done.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-18  9:26         ` David Hildenbrand
@ 2023-10-18 16:27           ` Xiaoyao Li
  2023-10-19  8:26             ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Xiaoyao Li @ 2023-10-18 16:27 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 10/18/2023 5:26 PM, David Hildenbrand wrote:
> On 18.10.23 11:02, Xiaoyao Li wrote:
>> On 10/18/2023 3:42 PM, David Hildenbrand wrote:
>>> On 18.10.23 05:02, Xiaoyao Li wrote:
>>>> David,
>>>>
>>>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>>> MAP_PRIVATE file mappings in the general case.
>>>>>
>>>>> To achieve the documented semantics, we also have to punch a hole into
>>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>>>> of such a file.
>>>>>
>>>>> For example, using VM templating -- see commit b17fbbe55cba 
>>>>> ("migration:
>>>>> allow private destination ram with x-ignore-shared") -- in
>>>>> combination with
>>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>>> includes:
>>>>> * Postcopy live migration
>>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>>> * virtio-mem
>>>>>
>>>>> So at least warn that there is something possibly dangerous is 
>>>>> going on
>>>>> when using ram_block_discard_range() in these cases.
>>>>>
>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>     softmmu/physmem.c | 18 ++++++++++++++++++
>>>>>     1 file changed, 18 insertions(+)
>>>>>
>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>> index bda475a719..4ee157bda4 100644
>>>>> --- a/softmmu/physmem.c
>>>>> +++ b/softmmu/physmem.c
>>>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>> uint64_t start, size_t length)
>>>>>                  * so a userfault will trigger.
>>>>>                  */
>>>>>     #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>> +            /*
>>>>> +             * We'll discard data from the actual file, even though
>>>>> we only
>>>>> +             * have a MAP_PRIVATE mapping, possibly messing with 
>>>>> other
>>>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy 
>>>>> way to
>>>>> +             * change that behavior whithout violating the promised
>>>>> +             * semantics of ram_block_discard_range().
>>>>> +             *
>>>>> +             * Only warn, because it work as long as nobody else
>>>>> uses that
>>>>> +             * file.
>>>>> +             */
>>>>> +            if (!qemu_ram_is_shared(rb)) {
>>>>> +                warn_report_once("ram_block_discard_range:
>>>>> Discarding RAM"
>>>>> +                                 " in private file mappings is
>>>>> possibly"
>>>>> +                                 " dangerous, because it will modify
>>>>> the"
>>>>> +                                 " underlying file and will affect
>>>>> other"
>>>>> +                                 " users of the file");
>>>>> +            }
>>>>> +
>>>>
>>>> TDX has two types of memory backend for each RAM, shared memory and
>>>> private memory. Private memory is serviced by guest memfd and shared
>>>> memory can also be backed with a fd.
>>>>
>>>> At any time, only one type needs to be valid, which means the opposite
>>>> can be discarded. We do implement the memory discard when TDX converts
>>>> the memory[1]. It will trigger this warning 100% because by default the
>>>> guest memfd is not mapped as shared (MAP_SHARED).
>>>
>>> If MAP_PRIVATE is not involved and you are taking the pages directly out
>>> of the memfd, you should mark that thing as shared.
>>
>> Is it the general rule of Linux? Of just the rule of QEMU memory discard?
>>
> 
> MAP_SHARED vs. MAP_PRIVATE is a common UNIX principle, and that's what 
> this flag and the check is about.
> 
>  From mmap(2)
> 
> MAP_SHARED: Share this mapping.  Updates to the mapping are visible to 
> other processes mapping the same region, and (in the case of file-backed 
> mappings) are carried through to the underlying file.
> 
> MAP_PRIVATE: Create a private copy-on-write mapping.  Updates to the 
> mapping are not visible to other processes mapping the same file, and 
> are not carried through to the underlying file.  It is unspecified 
> whether changes made  to the file after the mmap() call are visible in 
> the mapped region.
> 
> For your purpose (no mmap() at all), we behave like MAP_SHARED -- as if 
> nothing special is done. No Copy-on-write, no anonymous memory.
> 
>>> Anonymous memory is never involved.
>>
>> Could you please elaborate more on this? What do you want to express
>> here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)
> 
> Anonymous memory is memory that is private to a specific process, and 
> (see MAP_PRIVATE) modifications remain private to the process and are 
> not reflected to the file.
> 
> If you have a MAP_PRIVATE file mapping and write to a virtual memory 
> location, you'll get a process-private copy of the underlying pagecache 
> page. that's what we call anonymous memory, because it does not belong 
> to a specific file. fallocate(punch) would not free up that anonymous 
> memory.

For guest memfd, it does implement kvm_gmem_fallocate as .fallocate() 
callback, which calls truncate_inode_pages_range() [*].

I'm not sure if it frees up the memory. I need to learn it.

[*] 
https://github.com/kvm-x86/linux/blob/911b515af3ec5f53992b9cc162cf7d3893c2fbe2/virt/kvm/guest_memfd.c#L147C73-L147C73

>>
>>>
>>> "Private memory" is only private from the guest POV, not from a mmap()
>>> point of view.
>>>
>>> Two different concepts of "private".
>>>
>>>>
>>>> Simply remove the warning will fail the purpose of this patch. The 
>>>> other
>>>> option is to skip the warning for TDX case, which looks vary hacky. Do
>>>> you have any idea?
>>>
>>> For TDX, all memory backends / RAMBlocks should be marked as "shared",
>>> and you should fail if that is not provided by the user.
>>
>> As I asked above, I want to understand the logic clearly. Is mapped as
>> shared is a must to support the memory discard? i.e., if we want to
>> support memory discard after memory type change, then the memory must be
>> mapped with MAP_SHARED?
> 
> MAP_PIRVATE means that it's not sufficient to only fallocate(punch) the 
> fd to free up all memory for a virtual address, because there might be 
> anonymous memory in a private mapping that has to be freed up using 
> MADV_DONTNEED. 

I can understand this. But it seems unrelated to my question: Is mapped 
as shared is a must to support the memory discard?

e.g., if use below parameters to specify the RAM for a VM

	-object memory-backend-memfd,id=mem0,size=2G	\
	-machine memory-backend=mem0

since not specifying "share" property, the ram_block doesn't have 
RAM_SHARED set. If want to discard some range of this memfd, it triggers 
the warning. Is this warning expected?

I know it is not a possible case for current QEMU, but it's true for 
future TDX VM. TDX VM can have memory-backend-memfd as the backend for 
shared memory and kvm gmem memfd for private memory. At any time, for 
any memory range, only one type is valid, thus the range in opposite 
memfd can be fallocated.

Here I get your message as "the ramblock needs to have RAM_SHARED flag 
to allow the fallocate of the memfd". This is what I don't understand.

That's why this functions handles both cases differently,
> and warns if something possibly nasty is being done.
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-18 16:27           ` Xiaoyao Li
@ 2023-10-19  8:26             ` David Hildenbrand
  2023-10-19  9:26               ` Xiaoyao Li
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-10-19  8:26 UTC (permalink / raw)
  To: Xiaoyao Li, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 18.10.23 18:27, Xiaoyao Li wrote:
> On 10/18/2023 5:26 PM, David Hildenbrand wrote:
>> On 18.10.23 11:02, Xiaoyao Li wrote:
>>> On 10/18/2023 3:42 PM, David Hildenbrand wrote:
>>>> On 18.10.23 05:02, Xiaoyao Li wrote:
>>>>> David,
>>>>>
>>>>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>>>> MAP_PRIVATE file mappings in the general case.
>>>>>>
>>>>>> To achieve the documented semantics, we also have to punch a hole into
>>>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>>>>>> of such a file.
>>>>>>
>>>>>> For example, using VM templating -- see commit b17fbbe55cba
>>>>>> ("migration:
>>>>>> allow private destination ram with x-ignore-shared") -- in
>>>>>> combination with
>>>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>>>> includes:
>>>>>> * Postcopy live migration
>>>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>>>> * virtio-mem
>>>>>>
>>>>>> So at least warn that there is something possibly dangerous is
>>>>>> going on
>>>>>> when using ram_block_discard_range() in these cases.
>>>>>>
>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>      softmmu/physmem.c | 18 ++++++++++++++++++
>>>>>>      1 file changed, 18 insertions(+)
>>>>>>
>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>> index bda475a719..4ee157bda4 100644
>>>>>> --- a/softmmu/physmem.c
>>>>>> +++ b/softmmu/physmem.c
>>>>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>>> uint64_t start, size_t length)
>>>>>>                   * so a userfault will trigger.
>>>>>>                   */
>>>>>>      #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>>> +            /*
>>>>>> +             * We'll discard data from the actual file, even though
>>>>>> we only
>>>>>> +             * have a MAP_PRIVATE mapping, possibly messing with
>>>>>> other
>>>>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy
>>>>>> way to
>>>>>> +             * change that behavior whithout violating the promised
>>>>>> +             * semantics of ram_block_discard_range().
>>>>>> +             *
>>>>>> +             * Only warn, because it work as long as nobody else
>>>>>> uses that
>>>>>> +             * file.
>>>>>> +             */
>>>>>> +            if (!qemu_ram_is_shared(rb)) {
>>>>>> +                warn_report_once("ram_block_discard_range:
>>>>>> Discarding RAM"
>>>>>> +                                 " in private file mappings is
>>>>>> possibly"
>>>>>> +                                 " dangerous, because it will modify
>>>>>> the"
>>>>>> +                                 " underlying file and will affect
>>>>>> other"
>>>>>> +                                 " users of the file");
>>>>>> +            }
>>>>>> +
>>>>>
>>>>> TDX has two types of memory backend for each RAM, shared memory and
>>>>> private memory. Private memory is serviced by guest memfd and shared
>>>>> memory can also be backed with a fd.
>>>>>
>>>>> At any time, only one type needs to be valid, which means the opposite
>>>>> can be discarded. We do implement the memory discard when TDX converts
>>>>> the memory[1]. It will trigger this warning 100% because by default the
>>>>> guest memfd is not mapped as shared (MAP_SHARED).
>>>>
>>>> If MAP_PRIVATE is not involved and you are taking the pages directly out
>>>> of the memfd, you should mark that thing as shared.
>>>
>>> Is it the general rule of Linux? Of just the rule of QEMU memory discard?
>>>
>>
>> MAP_SHARED vs. MAP_PRIVATE is a common UNIX principle, and that's what
>> this flag and the check is about.
>>
>>   From mmap(2)
>>
>> MAP_SHARED: Share this mapping.  Updates to the mapping are visible to
>> other processes mapping the same region, and (in the case of file-backed
>> mappings) are carried through to the underlying file.
>>
>> MAP_PRIVATE: Create a private copy-on-write mapping.  Updates to the
>> mapping are not visible to other processes mapping the same file, and
>> are not carried through to the underlying file.  It is unspecified
>> whether changes made  to the file after the mmap() call are visible in
>> the mapped region.
>>
>> For your purpose (no mmap() at all), we behave like MAP_SHARED -- as if
>> nothing special is done. No Copy-on-write, no anonymous memory.
>>
>>>> Anonymous memory is never involved.
>>>
>>> Could you please elaborate more on this? What do you want to express
>>> here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)
>>
>> Anonymous memory is memory that is private to a specific process, and
>> (see MAP_PRIVATE) modifications remain private to the process and are
>> not reflected to the file.
>>
>> If you have a MAP_PRIVATE file mapping and write to a virtual memory
>> location, you'll get a process-private copy of the underlying pagecache
>> page. that's what we call anonymous memory, because it does not belong
>> to a specific file. fallocate(punch) would not free up that anonymous
>> memory.
> 
> For guest memfd, it does implement kvm_gmem_fallocate as .fallocate()
> callback, which calls truncate_inode_pages_range() [*].
> 
> I'm not sure if it frees up the memory. I need to learn it.
> 
> [*]
> https://github.com/kvm-x86/linux/blob/911b515af3ec5f53992b9cc162cf7d3893c2fbe2/virt/kvm/guest_memfd.c#L147C73-L147C73
> 
>>>
>>>>
>>>> "Private memory" is only private from the guest POV, not from a mmap()
>>>> point of view.
>>>>
>>>> Two different concepts of "private".
>>>>
>>>>>
>>>>> Simply remove the warning will fail the purpose of this patch. The
>>>>> other
>>>>> option is to skip the warning for TDX case, which looks vary hacky. Do
>>>>> you have any idea?
>>>>
>>>> For TDX, all memory backends / RAMBlocks should be marked as "shared",
>>>> and you should fail if that is not provided by the user.
>>>
>>> As I asked above, I want to understand the logic clearly. Is mapped as
>>> shared is a must to support the memory discard? i.e., if we want to
>>> support memory discard after memory type change, then the memory must be
>>> mapped with MAP_SHARED?
>>
>> MAP_PIRVATE means that it's not sufficient to only fallocate(punch) the
>> fd to free up all memory for a virtual address, because there might be
>> anonymous memory in a private mapping that has to be freed up using
>> MADV_DONTNEED.
> 
> I can understand this. But it seems unrelated to my question: Is mapped
> as shared is a must to support the memory discard?

Sorry, I don't quite get what you are asking that I haven't answered 
yet. Let's talk about the issue you are seeing below.

> 
> e.g., if use below parameters to specify the RAM for a VM
> 
> 	-object memory-backend-memfd,id=mem0,size=2G	\
> 	-machine memory-backend=mem0
> 
> since not specifying "share" property, the ram_block doesn't have
> RAM_SHARED set. If want to discard some range of this memfd, it triggers
> the warning. Is this warning expected?

That should not be the case. See "memfd_backend_instance_init" where we 
set share=true. In memfd_backend_memory_alloc(), we set RAM_SHARED.

We only default to share=off for memory-backend-file (well, and 
memory-backend-ram).

So are you sure you get this error message in the configuration you are 
describing here?

> 
> I know it is not a possible case for current QEMU, but it's true for
> future TDX VM. TDX VM can have memory-backend-memfd as the backend for
> shared memory and kvm gmem memfd for private memory. At any time, for
> any memory range, only one type is valid, thus the range in opposite
> memfd can be fallocated.

Right.

> 
> Here I get your message as "the ramblock needs to have RAM_SHARED flag
> to allow the fallocate of the memfd". This is what I don't understand.

The problem I am seeing is that either

(a) Someone explicitly sets share=off for some reason for 
memory-backend-memfd, triggering the warning.

(b) We somehow lose RAM_SHARED in above configuration, which would be 
bad and trigger the warning.

Can you make sure that (a) is not the case?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-19  8:26             ` David Hildenbrand
@ 2023-10-19  9:26               ` Xiaoyao Li
  2023-10-19  9:43                 ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Xiaoyao Li @ 2023-10-19  9:26 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 10/19/2023 4:26 PM, David Hildenbrand wrote:
> On 18.10.23 18:27, Xiaoyao Li wrote:
>> On 10/18/2023 5:26 PM, David Hildenbrand wrote:
>>> On 18.10.23 11:02, Xiaoyao Li wrote:
>>>> On 10/18/2023 3:42 PM, David Hildenbrand wrote:
>>>>> On 18.10.23 05:02, Xiaoyao Li wrote:
>>>>>> David,
>>>>>>
>>>>>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>>>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>>>>> MAP_PRIVATE file mappings in the general case.
>>>>>>>
>>>>>>> To achieve the documented semantics, we also have to punch a hole 
>>>>>>> into
>>>>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED 
>>>>>>> mappings
>>>>>>> of such a file.
>>>>>>>
>>>>>>> For example, using VM templating -- see commit b17fbbe55cba
>>>>>>> ("migration:
>>>>>>> allow private destination ram with x-ignore-shared") -- in
>>>>>>> combination with
>>>>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>>>>> includes:
>>>>>>> * Postcopy live migration
>>>>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>>>>> * virtio-mem
>>>>>>>
>>>>>>> So at least warn that there is something possibly dangerous is
>>>>>>> going on
>>>>>>> when using ram_block_discard_range() in these cases.
>>>>>>>
>>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>      softmmu/physmem.c | 18 ++++++++++++++++++
>>>>>>>      1 file changed, 18 insertions(+)
>>>>>>>
>>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>>> index bda475a719..4ee157bda4 100644
>>>>>>> --- a/softmmu/physmem.c
>>>>>>> +++ b/softmmu/physmem.c
>>>>>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>>>> uint64_t start, size_t length)
>>>>>>>                   * so a userfault will trigger.
>>>>>>>                   */
>>>>>>>      #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>>>> +            /*
>>>>>>> +             * We'll discard data from the actual file, even though
>>>>>>> we only
>>>>>>> +             * have a MAP_PRIVATE mapping, possibly messing with
>>>>>>> other
>>>>>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy
>>>>>>> way to
>>>>>>> +             * change that behavior whithout violating the promised
>>>>>>> +             * semantics of ram_block_discard_range().
>>>>>>> +             *
>>>>>>> +             * Only warn, because it work as long as nobody else
>>>>>>> uses that
>>>>>>> +             * file.
>>>>>>> +             */
>>>>>>> +            if (!qemu_ram_is_shared(rb)) {
>>>>>>> +                warn_report_once("ram_block_discard_range:
>>>>>>> Discarding RAM"
>>>>>>> +                                 " in private file mappings is
>>>>>>> possibly"
>>>>>>> +                                 " dangerous, because it will 
>>>>>>> modify
>>>>>>> the"
>>>>>>> +                                 " underlying file and will affect
>>>>>>> other"
>>>>>>> +                                 " users of the file");
>>>>>>> +            }
>>>>>>> +
>>>>>>
>>>>>> TDX has two types of memory backend for each RAM, shared memory and
>>>>>> private memory. Private memory is serviced by guest memfd and shared
>>>>>> memory can also be backed with a fd.
>>>>>>
>>>>>> At any time, only one type needs to be valid, which means the 
>>>>>> opposite
>>>>>> can be discarded. We do implement the memory discard when TDX 
>>>>>> converts
>>>>>> the memory[1]. It will trigger this warning 100% because by 
>>>>>> default the
>>>>>> guest memfd is not mapped as shared (MAP_SHARED).
>>>>>
>>>>> If MAP_PRIVATE is not involved and you are taking the pages 
>>>>> directly out
>>>>> of the memfd, you should mark that thing as shared.
>>>>
>>>> Is it the general rule of Linux? Of just the rule of QEMU memory 
>>>> discard?
>>>>
>>>
>>> MAP_SHARED vs. MAP_PRIVATE is a common UNIX principle, and that's what
>>> this flag and the check is about.
>>>
>>>   From mmap(2)
>>>
>>> MAP_SHARED: Share this mapping.  Updates to the mapping are visible to
>>> other processes mapping the same region, and (in the case of file-backed
>>> mappings) are carried through to the underlying file.
>>>
>>> MAP_PRIVATE: Create a private copy-on-write mapping.  Updates to the
>>> mapping are not visible to other processes mapping the same file, and
>>> are not carried through to the underlying file.  It is unspecified
>>> whether changes made  to the file after the mmap() call are visible in
>>> the mapped region.
>>>
>>> For your purpose (no mmap() at all), we behave like MAP_SHARED -- as if
>>> nothing special is done. No Copy-on-write, no anonymous memory.
>>>
>>>>> Anonymous memory is never involved.
>>>>
>>>> Could you please elaborate more on this? What do you want to express
>>>> here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)
>>>
>>> Anonymous memory is memory that is private to a specific process, and
>>> (see MAP_PRIVATE) modifications remain private to the process and are
>>> not reflected to the file.
>>>
>>> If you have a MAP_PRIVATE file mapping and write to a virtual memory
>>> location, you'll get a process-private copy of the underlying pagecache
>>> page. that's what we call anonymous memory, because it does not belong
>>> to a specific file. fallocate(punch) would not free up that anonymous
>>> memory.
>>
>> For guest memfd, it does implement kvm_gmem_fallocate as .fallocate()
>> callback, which calls truncate_inode_pages_range() [*].
>>
>> I'm not sure if it frees up the memory. I need to learn it.
>>
>> [*]
>> https://github.com/kvm-x86/linux/blob/911b515af3ec5f53992b9cc162cf7d3893c2fbe2/virt/kvm/guest_memfd.c#L147C73-L147C73
>>
>>>>
>>>>>
>>>>> "Private memory" is only private from the guest POV, not from a mmap()
>>>>> point of view.
>>>>>
>>>>> Two different concepts of "private".
>>>>>
>>>>>>
>>>>>> Simply remove the warning will fail the purpose of this patch. The
>>>>>> other
>>>>>> option is to skip the warning for TDX case, which looks vary 
>>>>>> hacky. Do
>>>>>> you have any idea?
>>>>>
>>>>> For TDX, all memory backends / RAMBlocks should be marked as "shared",
>>>>> and you should fail if that is not provided by the user.
>>>>
>>>> As I asked above, I want to understand the logic clearly. Is mapped as
>>>> shared is a must to support the memory discard? i.e., if we want to
>>>> support memory discard after memory type change, then the memory 
>>>> must be
>>>> mapped with MAP_SHARED?
>>>
>>> MAP_PIRVATE means that it's not sufficient to only fallocate(punch) the
>>> fd to free up all memory for a virtual address, because there might be
>>> anonymous memory in a private mapping that has to be freed up using
>>> MADV_DONTNEED.
>>
>> I can understand this. But it seems unrelated to my question: Is mapped
>> as shared is a must to support the memory discard?
> 
> Sorry, I don't quite get what you are asking that I haven't answered 
> yet. Let's talk about the issue you are seeing below.
> 
>>
>> e.g., if use below parameters to specify the RAM for a VM
>>
>>     -object memory-backend-memfd,id=mem0,size=2G    \
>>     -machine memory-backend=mem0
>>
>> since not specifying "share" property, the ram_block doesn't have
>> RAM_SHARED set. If want to discard some range of this memfd, it triggers
>> the warning. Is this warning expected?
> 
> That should not be the case. See "memfd_backend_instance_init" where we 
> set share=true. In memfd_backend_memory_alloc(), we set RAM_SHARED.
> 
> We only default to share=off for memory-backend-file (well, and 
> memory-backend-ram).
> 
> So are you sure you get this error message in the configuration you are 
> describing here?

Sorry, I made an mistake. I was using "-object 
memory-backend-ram,id=mem0,size=2G" instead of "memory-backend-memfd".

yes, when using "memory-backend-memfd" as the backend for TDX shared 
memory, it doesn't trigger the warning because 
memfd_backend_instance_init() set "share" to true.

When using "memory-backend-ram" as the backend for TDX shared memory, 
the warning is triggered converting memory from private (kvm gmem) to 
shared (backend-ram). In this case, there is a valid fd (kvm gmem fd), 
so it goes to the path of need_fallocate. However, 
qemu_ram_is_shared(rb) returns false because "memory-backend-ram" 
doesn't have "share" default on. Thus, the warning is triggered.

It seems I need figure out a more proper solution to refactor the 
ram_block_discard_range(), to make it applicable for kvm gmem discard case.

>>
>> I know it is not a possible case for current QEMU, but it's true for
>> future TDX VM. TDX VM can have memory-backend-memfd as the backend for
>> shared memory and kvm gmem memfd for private memory. At any time, for
>> any memory range, only one type is valid, thus the range in opposite
>> memfd can be fallocated.
> 
> Right.
> 
>>
>> Here I get your message as "the ramblock needs to have RAM_SHARED flag
>> to allow the fallocate of the memfd". This is what I don't understand.
> 
> The problem I am seeing is that either
> 
> (a) Someone explicitly sets share=off for some reason for 
> memory-backend-memfd, triggering the warning.
> 
> (b) We somehow lose RAM_SHARED in above configuration, which would be 
> bad and trigger the warning.
> 
> Can you make sure that (a) is not the case?

Above reply answers it. Sorry for it.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping
  2023-10-19  9:26               ` Xiaoyao Li
@ 2023-10-19  9:43                 ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-10-19  9:43 UTC (permalink / raw)
  To: Xiaoyao Li, qemu-devel
  Cc: Michael S. Tsirkin, Juan Quintela, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Philippe Mathieu-Daudé, Peng Tao,
	Mario Casquero

On 19.10.23 11:26, Xiaoyao Li wrote:
> On 10/19/2023 4:26 PM, David Hildenbrand wrote:
>> On 18.10.23 18:27, Xiaoyao Li wrote:
>>> On 10/18/2023 5:26 PM, David Hildenbrand wrote:
>>>> On 18.10.23 11:02, Xiaoyao Li wrote:
>>>>> On 10/18/2023 3:42 PM, David Hildenbrand wrote:
>>>>>> On 18.10.23 05:02, Xiaoyao Li wrote:
>>>>>>> David,
>>>>>>>
>>>>>>> On 7/6/2023 3:56 PM, David Hildenbrand wrote:
>>>>>>>> ram_block_discard_range() cannot possibly do the right thing in
>>>>>>>> MAP_PRIVATE file mappings in the general case.
>>>>>>>>
>>>>>>>> To achieve the documented semantics, we also have to punch a hole
>>>>>>>> into
>>>>>>>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED
>>>>>>>> mappings
>>>>>>>> of such a file.
>>>>>>>>
>>>>>>>> For example, using VM templating -- see commit b17fbbe55cba
>>>>>>>> ("migration:
>>>>>>>> allow private destination ram with x-ignore-shared") -- in
>>>>>>>> combination with
>>>>>>>> any mechanism that relies on discarding of RAM is problematic. This
>>>>>>>> includes:
>>>>>>>> * Postcopy live migration
>>>>>>>> * virtio-balloon inflation/deflation or free-page-reporting
>>>>>>>> * virtio-mem
>>>>>>>>
>>>>>>>> So at least warn that there is something possibly dangerous is
>>>>>>>> going on
>>>>>>>> when using ram_block_discard_range() in these cases.
>>>>>>>>
>>>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>>>> Tested-by: Mario Casquero <mcasquer@redhat.com>
>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>> ---
>>>>>>>>       softmmu/physmem.c | 18 ++++++++++++++++++
>>>>>>>>       1 file changed, 18 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>>>> index bda475a719..4ee157bda4 100644
>>>>>>>> --- a/softmmu/physmem.c
>>>>>>>> +++ b/softmmu/physmem.c
>>>>>>>> @@ -3456,6 +3456,24 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>>>>> uint64_t start, size_t length)
>>>>>>>>                    * so a userfault will trigger.
>>>>>>>>                    */
>>>>>>>>       #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>>>>>> +            /*
>>>>>>>> +             * We'll discard data from the actual file, even though
>>>>>>>> we only
>>>>>>>> +             * have a MAP_PRIVATE mapping, possibly messing with
>>>>>>>> other
>>>>>>>> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy
>>>>>>>> way to
>>>>>>>> +             * change that behavior whithout violating the promised
>>>>>>>> +             * semantics of ram_block_discard_range().
>>>>>>>> +             *
>>>>>>>> +             * Only warn, because it work as long as nobody else
>>>>>>>> uses that
>>>>>>>> +             * file.
>>>>>>>> +             */
>>>>>>>> +            if (!qemu_ram_is_shared(rb)) {
>>>>>>>> +                warn_report_once("ram_block_discard_range:
>>>>>>>> Discarding RAM"
>>>>>>>> +                                 " in private file mappings is
>>>>>>>> possibly"
>>>>>>>> +                                 " dangerous, because it will
>>>>>>>> modify
>>>>>>>> the"
>>>>>>>> +                                 " underlying file and will affect
>>>>>>>> other"
>>>>>>>> +                                 " users of the file");
>>>>>>>> +            }
>>>>>>>> +
>>>>>>>
>>>>>>> TDX has two types of memory backend for each RAM, shared memory and
>>>>>>> private memory. Private memory is serviced by guest memfd and shared
>>>>>>> memory can also be backed with a fd.
>>>>>>>
>>>>>>> At any time, only one type needs to be valid, which means the
>>>>>>> opposite
>>>>>>> can be discarded. We do implement the memory discard when TDX
>>>>>>> converts
>>>>>>> the memory[1]. It will trigger this warning 100% because by
>>>>>>> default the
>>>>>>> guest memfd is not mapped as shared (MAP_SHARED).
>>>>>>
>>>>>> If MAP_PRIVATE is not involved and you are taking the pages
>>>>>> directly out
>>>>>> of the memfd, you should mark that thing as shared.
>>>>>
>>>>> Is it the general rule of Linux? Of just the rule of QEMU memory
>>>>> discard?
>>>>>
>>>>
>>>> MAP_SHARED vs. MAP_PRIVATE is a common UNIX principle, and that's what
>>>> this flag and the check is about.
>>>>
>>>>    From mmap(2)
>>>>
>>>> MAP_SHARED: Share this mapping.  Updates to the mapping are visible to
>>>> other processes mapping the same region, and (in the case of file-backed
>>>> mappings) are carried through to the underlying file.
>>>>
>>>> MAP_PRIVATE: Create a private copy-on-write mapping.  Updates to the
>>>> mapping are not visible to other processes mapping the same file, and
>>>> are not carried through to the underlying file.  It is unspecified
>>>> whether changes made  to the file after the mmap() call are visible in
>>>> the mapped region.
>>>>
>>>> For your purpose (no mmap() at all), we behave like MAP_SHARED -- as if
>>>> nothing special is done. No Copy-on-write, no anonymous memory.
>>>>
>>>>>> Anonymous memory is never involved.
>>>>>
>>>>> Could you please elaborate more on this? What do you want to express
>>>>> here regrading anonymous memory? (Sorry that I'm newbie for mmap stuff)
>>>>
>>>> Anonymous memory is memory that is private to a specific process, and
>>>> (see MAP_PRIVATE) modifications remain private to the process and are
>>>> not reflected to the file.
>>>>
>>>> If you have a MAP_PRIVATE file mapping and write to a virtual memory
>>>> location, you'll get a process-private copy of the underlying pagecache
>>>> page. that's what we call anonymous memory, because it does not belong
>>>> to a specific file. fallocate(punch) would not free up that anonymous
>>>> memory.
>>>
>>> For guest memfd, it does implement kvm_gmem_fallocate as .fallocate()
>>> callback, which calls truncate_inode_pages_range() [*].
>>>
>>> I'm not sure if it frees up the memory. I need to learn it.
>>>
>>> [*]
>>> https://github.com/kvm-x86/linux/blob/911b515af3ec5f53992b9cc162cf7d3893c2fbe2/virt/kvm/guest_memfd.c#L147C73-L147C73
>>>
>>>>>
>>>>>>
>>>>>> "Private memory" is only private from the guest POV, not from a mmap()
>>>>>> point of view.
>>>>>>
>>>>>> Two different concepts of "private".
>>>>>>
>>>>>>>
>>>>>>> Simply remove the warning will fail the purpose of this patch. The
>>>>>>> other
>>>>>>> option is to skip the warning for TDX case, which looks vary
>>>>>>> hacky. Do
>>>>>>> you have any idea?
>>>>>>
>>>>>> For TDX, all memory backends / RAMBlocks should be marked as "shared",
>>>>>> and you should fail if that is not provided by the user.
>>>>>
>>>>> As I asked above, I want to understand the logic clearly. Is mapped as
>>>>> shared is a must to support the memory discard? i.e., if we want to
>>>>> support memory discard after memory type change, then the memory
>>>>> must be
>>>>> mapped with MAP_SHARED?
>>>>
>>>> MAP_PIRVATE means that it's not sufficient to only fallocate(punch) the
>>>> fd to free up all memory for a virtual address, because there might be
>>>> anonymous memory in a private mapping that has to be freed up using
>>>> MADV_DONTNEED.
>>>
>>> I can understand this. But it seems unrelated to my question: Is mapped
>>> as shared is a must to support the memory discard?
>>
>> Sorry, I don't quite get what you are asking that I haven't answered
>> yet. Let's talk about the issue you are seeing below.
>>
>>>
>>> e.g., if use below parameters to specify the RAM for a VM
>>>
>>>      -object memory-backend-memfd,id=mem0,size=2G    \
>>>      -machine memory-backend=mem0
>>>
>>> since not specifying "share" property, the ram_block doesn't have
>>> RAM_SHARED set. If want to discard some range of this memfd, it triggers
>>> the warning. Is this warning expected?
>>
>> That should not be the case. See "memfd_backend_instance_init" where we
>> set share=true. In memfd_backend_memory_alloc(), we set RAM_SHARED.
>>
>> We only default to share=off for memory-backend-file (well, and
>> memory-backend-ram).
>>
>> So are you sure you get this error message in the configuration you are
>> describing here?
> 
> Sorry, I made an mistake. I was using "-object
> memory-backend-ram,id=mem0,size=2G" instead of "memory-backend-memfd".
> 
> yes, when using "memory-backend-memfd" as the backend for TDX shared
> memory, it doesn't trigger the warning because
> memfd_backend_instance_init() set "share" to true.
> 
> When using "memory-backend-ram" as the backend for TDX shared memory,
> the warning is triggered converting memory from private (kvm gmem) to
> shared (backend-ram). In this case, there is a valid fd (kvm gmem fd),
> so it goes to the path of need_fallocate. However,
> qemu_ram_is_shared(rb) returns false because "memory-backend-ram"
> doesn't have "share" default on. Thus, the warning is triggered.
> 
> It seems I need figure out a more proper solution to refactor the
> ram_block_discard_range(), to make it applicable for kvm gmem discard case.

You should probably completely ignore any ramblock flags when 
fallocate(punch) the kvm_gmem_fd. kvm_gmem_fd is a rather special 
"secondary storage that's never mapped", independent of the ordinary 
RAMBlock memory.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-10-19  9:44 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-06  7:56 [PATCH v2 0/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
2023-07-06  7:56 ` [PATCH v2 1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping David Hildenbrand
2023-07-06  8:10   ` Juan Quintela
2023-07-06  8:31     ` David Hildenbrand
2023-07-06 13:20       ` Juan Quintela
2023-07-06 13:23         ` David Hildenbrand
2023-07-06  8:49   ` David Hildenbrand
2023-10-18  3:02   ` Xiaoyao Li
2023-10-18  7:42     ` David Hildenbrand
2023-10-18  9:02       ` Xiaoyao Li
2023-10-18  9:26         ` David Hildenbrand
2023-10-18 16:27           ` Xiaoyao Li
2023-10-19  8:26             ` David Hildenbrand
2023-10-19  9:26               ` Xiaoyao Li
2023-10-19  9:43                 ` David Hildenbrand
2023-07-06  7:56 ` [PATCH v2 2/4] virtio-mem: Skip most of virtio_mem_unplug_all() without plugged memory David Hildenbrand
2023-07-06  8:15   ` Juan Quintela
2023-07-06  8:38     ` David Hildenbrand
2023-07-06 13:27       ` Juan Quintela
2023-07-06  7:56 ` [PATCH v2 3/4] migration/ram: Expose ramblock_is_ignored() as migrate_ram_is_ignored() David Hildenbrand
2023-07-06  8:16   ` Juan Quintela
2023-07-06  7:56 ` [PATCH v2 4/4] virtio-mem: Support "x-ignore-shared" migration David Hildenbrand
2023-07-06 11:06   ` Juan Quintela
2023-07-06 11:27     ` David Hildenbrand
2023-07-06 11:59   ` Juan Quintela
2023-07-06 14:03 ` [PATCH v2 0/4] " Michael S. Tsirkin
2023-07-07 12:21   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).