[PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush
@ 2024-12-06  0:58 Peter Xu
  2024-12-06  0:58 ` [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

CI: https://gitlab.com/peterx/qemu/-/pipelines/1575970314

Comparing to v1, this series v2 now contains some patches that may be
helpful for either VFIO or postcopy integration on top of multifd.

For VFIO, only patch 1 & 2 are relevant.  This is the only part that can be
compared to v1, where I renamed the sync flag to LOCAL and ALL according to
Fabiano's comment.

For postcopy, it's about patches 3-7, but it needs to be based on 1+2.

I put them together because the two changes are relevant on the multifd
sync operation, one must depend on another in some form. So I decided to
send them together here.  All these patches can be seen as cleanups /
slight optimizations on top of master branch.

This time I did more test.  Besides CI, qtests, and some real-world multifd
tests just to monitor the sync events happen all correct, I made sure to
cover 7.2 machine type (which uses the legacy sync) so it still works as
before - basically sync will be more frequent, but all thing keeps working
smoothly so far.

Fabiano, let me know what do you think comparing to the other patch [1] on
simplifying the flush checks.  I'm open to any comments.

Thanks,

[1] https://lore.kernel.org/r/875xo8n4ue.fsf@suse.de

Thanks,

Peter Xu (7):
  migration/multifd: Further remove the SYNC on complete
  migration/multifd: Allow to sync with sender threads only
  migration/ram: Move RAM_SAVE_FLAG* into ram.h
  migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages
  migration/multifd: Remove sync processing on postcopy
  migration/multifd: Cleanup src flushes on condition check
  migration/multifd: Document the reason to sync for save_setup()

 migration/multifd.h        | 23 ++++++++--
 migration/ram.h            | 25 +++++++++++
 migration/rdma.h           |  7 ---
 migration/multifd-nocomp.c | 74 ++++++++++++++++++++++++++++++-
 migration/multifd.c        | 15 ++++---
 migration/ram.c            | 91 +++++++++++++++++---------------------
 6 files changed, 166 insertions(+), 69 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 13:17   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
complete()") removed the FLUSH operation on complete() which should avoid
one global sync on destination side, because it's not needed.

However that commit overlooked multifd_ram_flush_and_sync() part of things,
as that's always used together with the FLUSH message on the main channel.

For very old binaries (multifd_flush_after_each_section==true), that's
still needed because each EOS received on destination will enforce
all-channel sync once.

For new binaries (multifd_flush_after_each_section==false), the flush and
sync shouldn't be needed just like the FLUSH, with the same reasoning.

Currently, on new binaries we'll still send SYNC messages on multifd
channels, even if FLUSH is omitted at the end.  It'll make all recv threads
hang at SYNC message.

Multifd is still all working fine because luckily recv side cleanup
code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
recv threads are stuck at SYNC it'll get kicked out.  And since this is the
completion phase of migration, nothing else will be sent after the SYNCs.

This may be needed for VFIO in the future because VFIO can have data to
push after ram_save_complete(), hence we don't want the recv thread to be
stuck in SYNC message. Remove this limitation will make src even slightly
faster too to avoid some more code.

Stable branches do not need this patch, as no real bug I can think of that
will go wrong there.. so not attaching Fixes to be clear on the backport
not needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 05ff9eb328..7284c34bd8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }

-    ret = multifd_ram_flush_and_sync();
-    if (ret < 0) {
-        return ret;
+    if (migrate_multifd() &&
+        migrate_multifd_flush_after_each_section()) {
+        /*
+         * Only the old dest QEMU will need this sync, because each EOS
+         * will require one SYNC message on each channel.
+         */
+        ret = multifd_ram_flush_and_sync();
+        if (ret < 0) {
+            return ret;
+        }
     }

     if (migrate_mapped_ram()) {
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
  2024-12-06  0:58 ` [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 13:26   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

Teach multifd_send_sync_main() to sync with threads only.

We already have such requests, which is when mapped-ram is enabled with
multifd.  In that case, no SYNC messages will be pushed to the stream when
multifd syncs the sender threads because there's no destination threads
waiting for that.  The whole point of the sync is to make sure all threads
flushed their jobs.

So fundamentally we have a request to do the sync in different ways:

  - Either to sync the threads only,
  - Or to sync the threads but also with the destination side.

Mapped-ram did it already because of the use_packet check in the sync
handler of the sender thread.  It works.

However it may stop working when e.g. VFIO may start to reuse multifd
channels to push device states.  In that case VFIO has similar request on
"thread-only sync" however we can't check a flag because such sync request
can still come from RAM which needs the on-wire notifications.

Paving way for that by allowing the multifd_send_sync_main() to specify
what kind of sync the caller needs.  We can use it for mapped-ram already.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h        | 19 ++++++++++++++++---
 migration/multifd-nocomp.c |  7 ++++++-
 migration/multifd.c        | 15 +++++++++------
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 50d58c0c9c..bd337631ec 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -19,6 +19,18 @@
 typedef struct MultiFDRecvData MultiFDRecvData;
 typedef struct MultiFDSendData MultiFDSendData;
 
+typedef enum {
+    /* No sync request */
+    MULTIFD_SYNC_NONE = 0,
+    /* Sync locally on the sender threads without pushing messages */
+    MULTIFD_SYNC_LOCAL,
+    /*
+     * Sync not only on the sender threads, but also push "SYNC" message to
+     * the wire (which is for a remote sync).
+     */
+    MULTIFD_SYNC_ALL,
+} MultiFDSyncReq;
+
 bool multifd_send_setup(void);
 void multifd_send_shutdown(void);
 void multifd_send_channel_created(void);
@@ -28,7 +40,7 @@ void multifd_recv_shutdown(void);
 bool multifd_recv_all_channels_created(void);
 void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-int multifd_send_sync_main(void);
+int multifd_send_sync_main(MultiFDSyncReq req);
 bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
 bool multifd_recv(void);
 MultiFDRecvData *multifd_get_recv_data(void);
@@ -143,7 +155,7 @@ typedef struct {
     /* multifd flags for each packet */
     uint32_t flags;
     /*
-     * The sender thread has work to do if either of below boolean is set.
+     * The sender thread has work to do if either of below field is set.
      *
      * @pending_job:  a job is pending
      * @pending_sync: a sync request is pending
@@ -152,7 +164,8 @@ typedef struct {
      * cleared by the multifd sender threads.
      */
     bool pending_job;
-    bool pending_sync;
+    MultiFDSyncReq pending_sync;
+
     MultiFDSendData *data;
 
     /* thread local variables. No locking required */
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 55191152f9..219f9e58ef 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -345,6 +345,8 @@ retry:
 
 int multifd_ram_flush_and_sync(void)
 {
+    MultiFDSyncReq req;
+
     if (!migrate_multifd()) {
         return 0;
     }
@@ -356,7 +358,10 @@ int multifd_ram_flush_and_sync(void)
         }
     }
 
-    return multifd_send_sync_main();
+    /* File migrations only need to sync with threads */
+    req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL;
+
+    return multifd_send_sync_main(req);
 }
 
 bool multifd_send_prepare_common(MultiFDSendParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index 498e71fd10..2248bd2d46 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -523,7 +523,7 @@ static int multifd_zero_copy_flush(QIOChannel *c)
     return ret;
 }
 
-int multifd_send_sync_main(void)
+int multifd_send_sync_main(MultiFDSyncReq req)
 {
     int i;
     bool flush_zero_copy;
@@ -543,8 +543,8 @@ int multifd_send_sync_main(void)
          * We should be the only user so far, so not possible to be set by
          * others concurrently.
          */
-        assert(qatomic_read(&p->pending_sync) == false);
-        qatomic_set(&p->pending_sync, true);
+        assert(qatomic_read(&p->pending_sync) == MULTIFD_SYNC_NONE);
+        qatomic_set(&p->pending_sync, req);
         qemu_sem_post(&p->sem);
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -635,14 +635,17 @@ static void *multifd_send_thread(void *opaque)
              */
             qatomic_store_release(&p->pending_job, false);
         } else {
+            MultiFDSyncReq req = qatomic_read(&p->pending_sync);
+
             /*
              * If not a normal job, must be a sync request.  Note that
              * pending_sync is a standalone flag (unlike pending_job), so
              * it doesn't require explicit memory barriers.
              */
-            assert(qatomic_read(&p->pending_sync));
+            assert(req != MULTIFD_SYNC_NONE);
 
-            if (use_packets) {
+            /* Only push the SYNC message if it involves a remote sync */
+            if (req == MULTIFD_SYNC_ALL) {
                 p->flags = MULTIFD_FLAG_SYNC;
                 multifd_send_fill_packet(p);
                 ret = qio_channel_write_all(p->c, (void *)p->packet,
@@ -654,7 +657,7 @@ static void *multifd_send_thread(void *opaque)
                 stat64_add(&mig_stats.multifd_bytes, p->packet_len);
             }
 
-            qatomic_set(&p->pending_sync, false);
+            qatomic_set(&p->pending_sync, MULTIFD_SYNC_NONE);
             qemu_sem_post(&p->sem_sync);
         }
     }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
  2024-12-06  0:58 ` [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete Peter Xu
  2024-12-06  0:58 ` [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 13:43   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

Firstly, we're going to use the multifd flag soon in multifd code, so ram.c
isn't gonna work.

Secondly, we have a separate RDMA flag dangling around, which is definitely
not obvious.  There's one comment that helps, but not too much.

We should just put it altogether, so nothing will get overlooked.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.h  | 25 +++++++++++++++++++++++++
 migration/rdma.h |  7 -------
 migration/ram.c  | 21 ---------------------
 3 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index 0d1981f888..cfdcccd266 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -33,6 +33,31 @@
 #include "exec/cpu-common.h"
 #include "io/channel.h"
 
+/*
+ * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
+ * worked for pages that were filled with the same char.  We switched
+ * it to only search for the zero value.  And to avoid confusion with
+ * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
+ *
+ * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
+ *
+ * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
+ */
+#define RAM_SAVE_FLAG_FULL     0x01
+#define RAM_SAVE_FLAG_ZERO     0x02
+#define RAM_SAVE_FLAG_MEM_SIZE 0x04
+#define RAM_SAVE_FLAG_PAGE     0x08
+#define RAM_SAVE_FLAG_EOS      0x10
+#define RAM_SAVE_FLAG_CONTINUE 0x20
+#define RAM_SAVE_FLAG_XBZRLE   0x40
+/*
+ * ONLY USED IN RDMA: Whenever this is found in the data stream, the flags
+ * will be passed to rdma functions in the incoming-migration side.
+ */
+#define RAM_SAVE_FLAG_HOOK     0x80
+#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
+/* We can't use any flag that is bigger than 0x200 */
+
 extern XBZRLECacheStats xbzrle_counters;
 
 /* Should be holding either ram_list.mutex, or the RCU lock. */
diff --git a/migration/rdma.h b/migration/rdma.h
index a8d27f33b8..f55f28bbed 100644
--- a/migration/rdma.h
+++ b/migration/rdma.h
@@ -33,13 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
 #define RAM_CONTROL_ROUND     1
 #define RAM_CONTROL_FINISH    3
 
-/*
- * Whenever this is found in the data stream, the flags
- * will be passed to rdma functions in the incoming-migration
- * side.
- */
-#define RAM_SAVE_FLAG_HOOK     0x80
-
 #define RAM_SAVE_CONTROL_NOT_SUPP -1000
 #define RAM_SAVE_CONTROL_DELAYED  -2000
 
diff --git a/migration/ram.c b/migration/ram.c
index 7284c34bd8..44010ff325 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -71,27 +71,6 @@
 /***********************************************************/
 /* ram save/restore */
 
-/*
- * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
- * worked for pages that were filled with the same char.  We switched
- * it to only search for the zero value.  And to avoid confusion with
- * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
- *
- * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
- *
- * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
- */
-#define RAM_SAVE_FLAG_FULL     0x01
-#define RAM_SAVE_FLAG_ZERO     0x02
-#define RAM_SAVE_FLAG_MEM_SIZE 0x04
-#define RAM_SAVE_FLAG_PAGE     0x08
-#define RAM_SAVE_FLAG_EOS      0x10
-#define RAM_SAVE_FLAG_CONTINUE 0x20
-#define RAM_SAVE_FLAG_XBZRLE   0x40
-/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
-#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
-/* We can't use any flag that is bigger than 0x200 */
-
 /*
  * mapped-ram migration supports O_DIRECT, so we need to make sure the
  * userspace buffer, the IO operation size and the file offset are
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
                   ` (2 preceding siblings ...)
  2024-12-06  0:58 ` [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 14:12   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

RAM_SAVE_FLAG_MULTIFD_FLUSH message should always be correlated to a sync
request on src.  Unify such message into one place, and conditionally send
the message only if necessary.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h        |  2 +-
 migration/multifd-nocomp.c | 27 +++++++++++++++++++++++++--
 migration/ram.c            | 18 ++++--------------
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index bd337631ec..c9ae57ea02 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -350,7 +350,7 @@ static inline uint32_t multifd_ram_page_count(void)
 
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
-int multifd_ram_flush_and_sync(void);
+int multifd_ram_flush_and_sync(QEMUFile *f);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 219f9e58ef..58372db0f4 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -20,6 +20,7 @@
 #include "qemu/cutils.h"
 #include "qemu/error-report.h"
 #include "trace.h"
+#include "qemu-file.h"
 
 static MultiFDSendData *multifd_ram_send;
 
@@ -343,9 +344,10 @@ retry:
     return true;
 }
 
-int multifd_ram_flush_and_sync(void)
+int multifd_ram_flush_and_sync(QEMUFile *f)
 {
     MultiFDSyncReq req;
+    int ret;
 
     if (!migrate_multifd()) {
         return 0;
@@ -361,7 +363,28 @@ int multifd_ram_flush_and_sync(void)
     /* File migrations only need to sync with threads */
     req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL;
 
-    return multifd_send_sync_main(req);
+    ret = multifd_send_sync_main(req);
+    if (ret) {
+        return ret;
+    }
+
+    /* If we don't need to sync with remote at all, nothing else to do */
+    if (req == MULTIFD_SYNC_LOCAL) {
+        return 0;
+    }
+
+    /*
+     * Old QEMUs don't understand RAM_SAVE_FLAG_MULTIFD_FLUSH, it relies
+     * on RAM_SAVE_FLAG_EOS instead.
+     */
+    if (migrate_multifd_flush_after_each_section()) {
+        return 0;
+    }
+
+    qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
+    qemu_fflush(f);
+
+    return 0;
 }
 
 bool multifd_send_prepare_common(MultiFDSendParams *p)
diff --git a/migration/ram.c b/migration/ram.c
index 44010ff325..90811aabd4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1306,15 +1306,10 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
                 (!migrate_multifd_flush_after_each_section() ||
                  migrate_mapped_ram())) {
                 QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
-                int ret = multifd_ram_flush_and_sync();
+                int ret = multifd_ram_flush_and_sync(f);
                 if (ret < 0) {
                     return ret;
                 }
-
-                if (!migrate_mapped_ram()) {
-                    qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-                    qemu_fflush(f);
-                }
             }
 
             /* Hit the end of the list */
@@ -3044,18 +3039,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
     }
 
     bql_unlock();
-    ret = multifd_ram_flush_and_sync();
+    ret = multifd_ram_flush_and_sync(f);
     bql_lock();
     if (ret < 0) {
         error_setg(errp, "%s: multifd synchronization failed", __func__);
         return ret;
     }
 
-    if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
-        && !migrate_mapped_ram()) {
-        qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
-    }
-
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     ret = qemu_fflush(f);
     if (ret < 0) {
@@ -3190,7 +3180,7 @@ out:
     if (ret >= 0 && migration_is_running()) {
         if (migrate_multifd() && migrate_multifd_flush_after_each_section() &&
             !migrate_mapped_ram()) {
-            ret = multifd_ram_flush_and_sync();
+            ret = multifd_ram_flush_and_sync(f);
             if (ret < 0) {
                 return ret;
             }
@@ -3268,7 +3258,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
          * Only the old dest QEMU will need this sync, because each EOS
          * will require one SYNC message on each channel.
          */
-        ret = multifd_ram_flush_and_sync();
+        ret = multifd_ram_flush_and_sync(f);
         if (ret < 0) {
             return ret;
         }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
                   ` (3 preceding siblings ...)
  2024-12-06  0:58 ` [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 14:19   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check Peter Xu
  2024-12-06  0:58 ` [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup() Peter Xu
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

Multifd never worked with postcopy, at least yet so far.

Remove the sync processing there, because it's confusing, and they should
never appear.  Now if RAM_SAVE_FLAG_MULTIFD_FLUSH is observed, we fail hard
instead of trying to invoke multifd code.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 90811aabd4..154ff5abd4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3772,15 +3772,7 @@ int ram_load_postcopy(QEMUFile *f, int channel)
                                          TARGET_PAGE_SIZE);
             }
             break;
-        case RAM_SAVE_FLAG_MULTIFD_FLUSH:
-            multifd_recv_sync_main();
-            break;
         case RAM_SAVE_FLAG_EOS:
-            /* normal exit */
-            if (migrate_multifd() &&
-                migrate_multifd_flush_after_each_section()) {
-                multifd_recv_sync_main();
-            }
             break;
         default:
             error_report("Unknown combination of migration flags: 0x%x"
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
                   ` (4 preceding siblings ...)
  2024-12-06  0:58 ` [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 14:18   ` Fabiano Rosas
  2024-12-06  0:58 ` [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup() Peter Xu
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

The src flush condition check is over complicated, and it's getting more
out of control if postcopy will be involved.

In general, we have two modes to do the sync: legacy or modern ways.
Legacy uses per-section flush, modern uses per-round flush.

Mapped-ram always uses the modern, which is per-round.

Introduce two helpers, which can greatly simplify the code, and hopefully
make it readable again.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h        |  2 ++
 migration/multifd-nocomp.c | 42 ++++++++++++++++++++++++++++++++++++++
 migration/ram.c            | 10 +++------
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index c9ae57ea02..582040922f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -351,6 +351,8 @@ static inline uint32_t multifd_ram_page_count(void)
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(QEMUFile *f);
+bool multifd_ram_sync_per_round(void);
+bool multifd_ram_sync_per_section(void);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 58372db0f4..c1f686c0ce 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -344,6 +344,48 @@ retry:
     return true;
 }
 
+/*
+ * We have two modes for multifd flushes:
+ *
+ * - Per-section mode: this is the legacy way to flush, it requires one
+ *   MULTIFD_FLAG_SYNC message for each RAM_SAVE_FLAG_EOS.
+ *
+ * - Per-round mode: this is the modern way to flush, it requires one
+ *   MULTIFD_FLAG_SYNC message only for each round of RAM scan.  Normally
+ *   it's paired with a new RAM_SAVE_FLAG_MULTIFD_FLUSH message in network
+ *   based migrations.
+ *
+ * One thing to mention is mapped-ram always use the modern way to sync.
+ */
+
+/* Do we need a per-section multifd flush (legacy way)? */
+bool multifd_ram_sync_per_section(void)
+{
+    if (!migrate_multifd()) {
+        return false;
+    }
+
+    if (migrate_mapped_ram()) {
+        return false;
+    }
+
+    return migrate_multifd_flush_after_each_section();
+}
+
+/* Do we need a per-round multifd flush (modern way)? */
+bool multifd_ram_sync_per_round(void)
+{
+    if (!migrate_multifd()) {
+        return false;
+    }
+
+    if (migrate_mapped_ram()) {
+        return true;
+    }
+
+    return !migrate_multifd_flush_after_each_section();
+}
+
 int multifd_ram_flush_and_sync(QEMUFile *f)
 {
     MultiFDSyncReq req;
diff --git a/migration/ram.c b/migration/ram.c
index 154ff5abd4..5d4bdefe69 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1302,9 +1302,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
         pss->page = 0;
         pss->block = QLIST_NEXT_RCU(pss->block, next);
         if (!pss->block) {
-            if (migrate_multifd() &&
-                (!migrate_multifd_flush_after_each_section() ||
-                 migrate_mapped_ram())) {
+            if (multifd_ram_sync_per_round()) {
                 QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
                 int ret = multifd_ram_flush_and_sync(f);
                 if (ret < 0) {
@@ -3178,8 +3176,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
 out:
     if (ret >= 0 && migration_is_running()) {
-        if (migrate_multifd() && migrate_multifd_flush_after_each_section() &&
-            !migrate_mapped_ram()) {
+        if (multifd_ram_sync_per_section()) {
             ret = multifd_ram_flush_and_sync(f);
             if (ret < 0) {
                 return ret;
@@ -3252,8 +3249,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    if (migrate_multifd() &&
-        migrate_multifd_flush_after_each_section()) {
+    if (multifd_ram_sync_per_section()) {
         /*
          * Only the old dest QEMU will need this sync, because each EOS
          * will require one SYNC message on each channel.
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup()
  2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
                   ` (5 preceding siblings ...)
  2024-12-06  0:58 ` [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check Peter Xu
@ 2024-12-06  0:58 ` Peter Xu
  2024-12-06 14:40   ` Fabiano Rosas
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06  0:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Fabiano Rosas, Prasad Pandit

It's not straightforward to see why src QEMU needs to sync multifd during
setup() phase.  After all, there's no page queued at that point.

For old QEMUs, there's a solid reason: EOS requires it to work.  While it's
clueless on the new QEMUs which do not take EOS message as sync requests.

One will figure that out only when this is conditionally removed.  In fact,
the author did try it out.  Logically we could still avoid doing this on
new machine types, however that needs a separate compat field and that can
be an overkill in some trivial overhead in setup() phase.

Let's instead document it completely, to avoid someone else tries this
again and do the debug one more time, or anyone confused on why this ever
existed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 5d4bdefe69..ddee703585 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3036,6 +3036,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
         migration_ops->ram_save_target_page = ram_save_target_page_legacy;
     }
 
+    /*
+     * This operation is unfortunate..
+     *
+     * For legacy QEMUs using per-section sync
+     * =======================================
+     *
+     * This must exist because the EOS below requires the SYNC messages
+     * per-channel to work.
+     *
+     * For modern QEMUs using per-round sync
+     * =====================================
+     *
+     * Logically this sync is not needed (because we know there's nothing
+     * in the multifd queue yet!).  However as a side effect, this makes
+     * sure the dest side won't receive any data before it properly reaches
+     * ram_load_precopy().
+     *
+     * Without this sync, src QEMU can send data too soon so that dest may
+     * not have been ready to receive it (e.g., rb->receivedmap may be
+     * uninitialized, for example).
+     *
+     * Logically "wait for recv setup ready" shouldn't need to involve src
+     * QEMU at all, however to be compatible with old QEMUs, let's stick
+     * with this.  Fortunately the overhead is low to sync during setup
+     * because the VM is running, so at least it's not accounted as part of
+     * downtime.
+     */
     bql_unlock();
     ret = multifd_ram_flush_and_sync(f);
     bql_lock();
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete
  2024-12-06  0:58 ` [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete Peter Xu
@ 2024-12-06 13:17   ` Fabiano Rosas
  2024-12-06 14:40     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 13:17 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
> complete()") removed the FLUSH operation on complete() which should avoid
> one global sync on destination side, because it's not needed.
>
> However that commit overlooked multifd_ram_flush_and_sync() part of things,
> as that's always used together with the FLUSH message on the main channel.
>
> For very old binaries (multifd_flush_after_each_section==true), that's
> still needed because each EOS received on destination will enforce
> all-channel sync once.
>
> For new binaries (multifd_flush_after_each_section==false), the flush and
> sync shouldn't be needed just like the FLUSH, with the same reasoning.
>
> Currently, on new binaries we'll still send SYNC messages on multifd
> channels, even if FLUSH is omitted at the end.  It'll make all recv threads
> hang at SYNC message.
>
> Multifd is still all working fine because luckily recv side cleanup
> code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
> recv threads are stuck at SYNC it'll get kicked out.  And since this is the
> completion phase of migration, nothing else will be sent after the SYNCs.
>
> This may be needed for VFIO in the future because VFIO can have data to
> push after ram_save_complete(), hence we don't want the recv thread to be
> stuck in SYNC message. Remove this limitation will make src even slightly
> faster too to avoid some more code.
>
> Stable branches do not need this patch, as no real bug I can think of that
> will go wrong there.. so not attaching Fixes to be clear on the backport
> not needed.

You forgot my comments on the commit message, so here's a brand-new one
with some things spelt out more explicitly:

--
Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
complete()") stopped sending the RAM_SAVE_FLAG_MULTIFD_FLUSH flag at
ram_save_complete(), because the sync on the destination side is not
needed due to the last iteration of find_dirty_block() having already
done it.

However, that commit overlooked that multifd_ram_flush_and_sync() on the
source side is also not needed at ram_save_complete(), for the same
reason.

Moreover, removing the RAM_SAVE_FLAG_MULTIFD_FLUSH but keeping the
multifd_ram_flush_and_sync() means that currently the recv threads will
hang when receiving the MULTIFD_FLAG_SYNC message, waiting for the
destination sync which only happens when RAM_SAVE_FLAG_MULTIFD_FLUSH is
received.

Luckily, multifd is still all working fine because recv side cleanup
code (mostly multifd_recv_sync_main()) is smart enough to make sure even
if recv threads are stuck at SYNC it'll get kicked out. And since this
is the completion phase of migration, nothing else will be sent after
the SYNCs.

This needs to be fixed because in the future VFIO will have data to push
after ram_save_complete() and we don't want the recv thread to be stuck
in the MULTIFD_FLAG_SYNC message.

Remove the unnecessary (and buggy) invocation of
multifd_ram_flush_and_sync().

For very old binaries (multifd_flush_after_each_section==true), the
flush_and_sync is still needed because each EOS received on destination
will enforce all-channel sync once.

Stable branches do not need this patch, as no real bug I can think of
that will go wrong there.. so not attaching Fixes to be clear on the
backport not needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  migration/ram.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 05ff9eb328..7284c34bd8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          }
>      }
>  
> -    ret = multifd_ram_flush_and_sync();
> -    if (ret < 0) {
> -        return ret;
> +    if (migrate_multifd() &&
> +        migrate_multifd_flush_after_each_section()) {
> +        /*
> +         * Only the old dest QEMU will need this sync, because each EOS
> +         * will require one SYNC message on each channel.
> +         */
> +        ret = multifd_ram_flush_and_sync();
> +        if (ret < 0) {
> +            return ret;
> +        }
>      }
>  
>      if (migrate_mapped_ram()) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only
  2024-12-06  0:58 ` [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only Peter Xu
@ 2024-12-06 13:26   ` Fabiano Rosas
  2024-12-06 14:50     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 13:26 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> Teach multifd_send_sync_main() to sync with threads only.
>
> We already have such requests, which is when mapped-ram is enabled with
> multifd.  In that case, no SYNC messages will be pushed to the stream when
> multifd syncs the sender threads because there's no destination threads
> waiting for that.  The whole point of the sync is to make sure all threads
> flushed their jobs.

s/flushed/finished/ otherwise we risk confusing people.

>
> So fundamentally we have a request to do the sync in different ways:
>
>   - Either to sync the threads only,
>   - Or to sync the threads but also with the destination side.
>
> Mapped-ram did it already because of the use_packet check in the sync
> handler of the sender thread.  It works.
>
> However it may stop working when e.g. VFIO may start to reuse multifd
> channels to push device states.  In that case VFIO has similar request on
> "thread-only sync" however we can't check a flag because such sync request
> can still come from RAM which needs the on-wire notifications.
>
> Paving way for that by allowing the multifd_send_sync_main() to specify
> what kind of sync the caller needs.  We can use it for mapped-ram already.
>
> No functional change intended.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h        | 19 ++++++++++++++++---
>  migration/multifd-nocomp.c |  7 ++++++-
>  migration/multifd.c        | 15 +++++++++------
>  3 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 50d58c0c9c..bd337631ec 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -19,6 +19,18 @@
>  typedef struct MultiFDRecvData MultiFDRecvData;
>  typedef struct MultiFDSendData MultiFDSendData;
>  
> +typedef enum {
> +    /* No sync request */
> +    MULTIFD_SYNC_NONE = 0,
> +    /* Sync locally on the sender threads without pushing messages */
> +    MULTIFD_SYNC_LOCAL,
> +    /*
> +     * Sync not only on the sender threads, but also push "SYNC" message to
> +     * the wire (which is for a remote sync).

s/SYNC/MULTIFD_FLAG_SYNC/

Do we need to also mention that this needs to be paired with a
multifd_recv_sync_main() via the emission of the
RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream?

> +     */
> +    MULTIFD_SYNC_ALL,
> +} MultiFDSyncReq;
> +
>  bool multifd_send_setup(void);
>  void multifd_send_shutdown(void);
>  void multifd_send_channel_created(void);
> @@ -28,7 +40,7 @@ void multifd_recv_shutdown(void);
>  bool multifd_recv_all_channels_created(void);
>  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>  void multifd_recv_sync_main(void);
> -int multifd_send_sync_main(void);
> +int multifd_send_sync_main(MultiFDSyncReq req);
>  bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>  bool multifd_recv(void);
>  MultiFDRecvData *multifd_get_recv_data(void);
> @@ -143,7 +155,7 @@ typedef struct {
>      /* multifd flags for each packet */
>      uint32_t flags;
>      /*
> -     * The sender thread has work to do if either of below boolean is set.
> +     * The sender thread has work to do if either of below field is set.
>       *
>       * @pending_job:  a job is pending
>       * @pending_sync: a sync request is pending
> @@ -152,7 +164,8 @@ typedef struct {
>       * cleared by the multifd sender threads.
>       */
>      bool pending_job;
> -    bool pending_sync;
> +    MultiFDSyncReq pending_sync;
> +
>      MultiFDSendData *data;
>  
>      /* thread local variables. No locking required */
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index 55191152f9..219f9e58ef 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -345,6 +345,8 @@ retry:
>  
>  int multifd_ram_flush_and_sync(void)
>  {
> +    MultiFDSyncReq req;
> +
>      if (!migrate_multifd()) {
>          return 0;
>      }
> @@ -356,7 +358,10 @@ int multifd_ram_flush_and_sync(void)
>          }
>      }
>  
> -    return multifd_send_sync_main();
> +    /* File migrations only need to sync with threads */
> +    req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL;
> +
> +    return multifd_send_sync_main(req);
>  }
>  
>  bool multifd_send_prepare_common(MultiFDSendParams *p)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 498e71fd10..2248bd2d46 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -523,7 +523,7 @@ static int multifd_zero_copy_flush(QIOChannel *c)
>      return ret;
>  }
>  
> -int multifd_send_sync_main(void)
> +int multifd_send_sync_main(MultiFDSyncReq req)
>  {
>      int i;
>      bool flush_zero_copy;

assert(req != MULTIFD_SYNC_NONE) ?

> @@ -543,8 +543,8 @@ int multifd_send_sync_main(void)
>           * We should be the only user so far, so not possible to be set by
>           * others concurrently.
>           */
> -        assert(qatomic_read(&p->pending_sync) == false);
> -        qatomic_set(&p->pending_sync, true);
> +        assert(qatomic_read(&p->pending_sync) == MULTIFD_SYNC_NONE);
> +        qatomic_set(&p->pending_sync, req);
>          qemu_sem_post(&p->sem);
>      }
>      for (i = 0; i < migrate_multifd_channels(); i++) {
> @@ -635,14 +635,17 @@ static void *multifd_send_thread(void *opaque)
>               */
>              qatomic_store_release(&p->pending_job, false);
>          } else {
> +            MultiFDSyncReq req = qatomic_read(&p->pending_sync);
> +
>              /*
>               * If not a normal job, must be a sync request.  Note that
>               * pending_sync is a standalone flag (unlike pending_job), so
>               * it doesn't require explicit memory barriers.
>               */
> -            assert(qatomic_read(&p->pending_sync));
> +            assert(req != MULTIFD_SYNC_NONE);
>  
> -            if (use_packets) {
> +            /* Only push the SYNC message if it involves a remote sync */
> +            if (req == MULTIFD_SYNC_ALL) {
>                  p->flags = MULTIFD_FLAG_SYNC;
>                  multifd_send_fill_packet(p);
>                  ret = qio_channel_write_all(p->c, (void *)p->packet,
> @@ -654,7 +657,7 @@ static void *multifd_send_thread(void *opaque)
>                  stat64_add(&mig_stats.multifd_bytes, p->packet_len);
>              }
>  
> -            qatomic_set(&p->pending_sync, false);
> +            qatomic_set(&p->pending_sync, MULTIFD_SYNC_NONE);
>              qemu_sem_post(&p->sem_sync);
>          }
>      }


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06  0:58 ` [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h Peter Xu
@ 2024-12-06 13:43   ` Fabiano Rosas
  2024-12-06 15:03     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 13:43 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> Firstly, we're going to use the multifd flag soon in multifd code, so ram.c
> isn't gonna work.
>
> Secondly, we have a separate RDMA flag dangling around, which is definitely
> not obvious.  There's one comment that helps, but not too much.
>
> We should just put it altogether, so nothing will get overlooked.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

just some comments about preexisting stuff:

> ---
>  migration/ram.h  | 25 +++++++++++++++++++++++++
>  migration/rdma.h |  7 -------
>  migration/ram.c  | 21 ---------------------
>  3 files changed, 25 insertions(+), 28 deletions(-)
>
> diff --git a/migration/ram.h b/migration/ram.h
> index 0d1981f888..cfdcccd266 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -33,6 +33,31 @@
>  #include "exec/cpu-common.h"
>  #include "io/channel.h"
>  
> +/*
> + * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
> + * worked for pages that were filled with the same char.  We switched
> + * it to only search for the zero value.  And to avoid confusion with
> + * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
> + *
> + * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
> + *
> + * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.

Aren't these already covered by git log? The comment makes it seem like
some special situation, but I think we're just documenting history here,
no?

> + */
> +#define RAM_SAVE_FLAG_FULL     0x01
> +#define RAM_SAVE_FLAG_ZERO     0x02
> +#define RAM_SAVE_FLAG_MEM_SIZE 0x04
> +#define RAM_SAVE_FLAG_PAGE     0x08
> +#define RAM_SAVE_FLAG_EOS      0x10
> +#define RAM_SAVE_FLAG_CONTINUE 0x20
> +#define RAM_SAVE_FLAG_XBZRLE   0x40
> +/*
> + * ONLY USED IN RDMA: Whenever this is found in the data stream, the flags
> + * will be passed to rdma functions in the incoming-migration side.
> + */
> +#define RAM_SAVE_FLAG_HOOK     0x80

No 0x100?

> +#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
> +/* We can't use any flag that is bigger than 0x200 */

Where does that limitation come from again? I know that
RAM_SAVE_FLAG_MEM_SIZE shares a u64 with something else:

  qemu_put_be64(f, ram_bytes_total_with_ignored() |
  RAM_SAVE_FLAG_MEM_SIZE);

For RAM_SAVE_FLAG_ZERO and RAM_SAVE_FLAG_PAGE, it might be a u32 (offset
is ram_addr_t):

  save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);

But others just go by themselves:

qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);


> +
>  extern XBZRLECacheStats xbzrle_counters;
>  
>  /* Should be holding either ram_list.mutex, or the RCU lock. */
> diff --git a/migration/rdma.h b/migration/rdma.h
> index a8d27f33b8..f55f28bbed 100644
> --- a/migration/rdma.h
> +++ b/migration/rdma.h
> @@ -33,13 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
>  #define RAM_CONTROL_ROUND     1
>  #define RAM_CONTROL_FINISH    3
>  
> -/*
> - * Whenever this is found in the data stream, the flags
> - * will be passed to rdma functions in the incoming-migration
> - * side.
> - */
> -#define RAM_SAVE_FLAG_HOOK     0x80
> -
>  #define RAM_SAVE_CONTROL_NOT_SUPP -1000
>  #define RAM_SAVE_CONTROL_DELAYED  -2000
>  
> diff --git a/migration/ram.c b/migration/ram.c
> index 7284c34bd8..44010ff325 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -71,27 +71,6 @@
>  /***********************************************************/
>  /* ram save/restore */
>  
> -/*
> - * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
> - * worked for pages that were filled with the same char.  We switched
> - * it to only search for the zero value.  And to avoid confusion with
> - * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
> - *
> - * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
> - *
> - * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
> - */
> -#define RAM_SAVE_FLAG_FULL     0x01
> -#define RAM_SAVE_FLAG_ZERO     0x02
> -#define RAM_SAVE_FLAG_MEM_SIZE 0x04
> -#define RAM_SAVE_FLAG_PAGE     0x08
> -#define RAM_SAVE_FLAG_EOS      0x10
> -#define RAM_SAVE_FLAG_CONTINUE 0x20
> -#define RAM_SAVE_FLAG_XBZRLE   0x40
> -/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
> -#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
> -/* We can't use any flag that is bigger than 0x200 */
> -
>  /*
>   * mapped-ram migration supports O_DIRECT, so we need to make sure the
>   * userspace buffer, the IO operation size and the file offset are


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages
  2024-12-06  0:58 ` [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages Peter Xu
@ 2024-12-06 14:12   ` Fabiano Rosas
  0 siblings, 0 replies; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 14:12 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> RAM_SAVE_FLAG_MULTIFD_FLUSH message should always be correlated to a sync
> request on src.  Unify such message into one place, and conditionally send
> the message only if necessary.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check
  2024-12-06  0:58 ` [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check Peter Xu
@ 2024-12-06 14:18   ` Fabiano Rosas
  2024-12-06 15:13     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 14:18 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> The src flush condition check is over complicated, and it's getting more
> out of control if postcopy will be involved.
>
> In general, we have two modes to do the sync: legacy or modern ways.
> Legacy uses per-section flush, modern uses per-round flush.
>
> Mapped-ram always uses the modern, which is per-round.
>
> Introduce two helpers, which can greatly simplify the code, and hopefully
> make it readable again.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h        |  2 ++
>  migration/multifd-nocomp.c | 42 ++++++++++++++++++++++++++++++++++++++
>  migration/ram.c            | 10 +++------
>  3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index c9ae57ea02..582040922f 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -351,6 +351,8 @@ static inline uint32_t multifd_ram_page_count(void)
>  void multifd_ram_save_setup(void);
>  void multifd_ram_save_cleanup(void);
>  int multifd_ram_flush_and_sync(QEMUFile *f);
> +bool multifd_ram_sync_per_round(void);
> +bool multifd_ram_sync_per_section(void);
>  size_t multifd_ram_payload_size(void);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index 58372db0f4..c1f686c0ce 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -344,6 +344,48 @@ retry:
>      return true;
>  }
>  
> +/*
> + * We have two modes for multifd flushes:
> + *
> + * - Per-section mode: this is the legacy way to flush, it requires one
> + *   MULTIFD_FLAG_SYNC message for each RAM_SAVE_FLAG_EOS.
> + *
> + * - Per-round mode: this is the modern way to flush, it requires one
> + *   MULTIFD_FLAG_SYNC message only for each round of RAM scan.  Normally
> + *   it's paired with a new RAM_SAVE_FLAG_MULTIFD_FLUSH message in network
> + *   based migrations.
> + *
> + * One thing to mention is mapped-ram always use the modern way to sync.
> + */
> +
> +/* Do we need a per-section multifd flush (legacy way)? */
> +bool multifd_ram_sync_per_section(void)
> +{
> +    if (!migrate_multifd()) {
> +        return false;
> +    }
> +
> +    if (migrate_mapped_ram()) {
> +        return false;
> +    }
> +
> +    return migrate_multifd_flush_after_each_section();
> +}
> +
> +/* Do we need a per-round multifd flush (modern way)? */
> +bool multifd_ram_sync_per_round(void)
> +{
> +    if (!migrate_multifd()) {
> +        return false;
> +    }
> +
> +    if (migrate_mapped_ram()) {
> +        return true;
> +    }
> +
> +    return !migrate_multifd_flush_after_each_section();
> +}
> +
>  int multifd_ram_flush_and_sync(QEMUFile *f)
>  {
>      MultiFDSyncReq req;
> diff --git a/migration/ram.c b/migration/ram.c
> index 154ff5abd4..5d4bdefe69 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1302,9 +1302,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>          pss->page = 0;
>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>          if (!pss->block) {
> -            if (migrate_multifd() &&
> -                (!migrate_multifd_flush_after_each_section() ||
> -                 migrate_mapped_ram())) {
> +            if (multifd_ram_sync_per_round()) {

If we're already implicitly declaring which parts of the code mean
"round" or "section", we could fold the flush into the function and call
it unconditionally.

We don't need ram.c code to be deciding about multifd. This could be all
hidden away in the multifd-nocomp code:

-- >8 --
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index c1f686c0ce..6a7eee4c25 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -358,32 +358,26 @@ retry:
  * One thing to mention is mapped-ram always use the modern way to sync.
  */
 
-/* Do we need a per-section multifd flush (legacy way)? */
-bool multifd_ram_sync_per_section(void)
+int multifd_ram_sync_per_section(QEMUFile *f)
 {
-    if (!migrate_multifd()) {
-        return false;
+    if (!migrate_multifd() || !migrate_multifd_flush_after_each_section()) {
+        return 0;
     }
 
     if (migrate_mapped_ram()) {
-        return false;
+        return 0;
     }
 
-    return migrate_multifd_flush_after_each_section();
+    return multifd_ram_flush_and_sync(f);
 }
 
-/* Do we need a per-round multifd flush (modern way)? */
-bool multifd_ram_sync_per_round(void)
+int multifd_ram_sync_per_round(QEMUFile *f)
 {
-    if (!migrate_multifd()) {
-        return false;
+    if (!migrate_multifd() || migrate_multifd_flush_after_each_section()) {
+        return 0;
     }
 
-    if (migrate_mapped_ram()) {
-        return true;
-    }
-
-    return !migrate_multifd_flush_after_each_section();
+    return multifd_ram_flush_and_sync(f);
 }
 
 int multifd_ram_flush_and_sync(QEMUFile *f)
diff --git a/migration/multifd.h b/migration/multifd.h
index 582040922f..3b42128167 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -351,8 +351,8 @@ static inline uint32_t multifd_ram_page_count(void)
 void multifd_ram_save_setup(void);
 void multifd_ram_save_cleanup(void);
 int multifd_ram_flush_and_sync(QEMUFile *f);
-bool multifd_ram_sync_per_round(void);
-bool multifd_ram_sync_per_section(void);
+int multifd_ram_sync_per_round(QEMUFile *f);
+int multifd_ram_sync_per_section(QEMUFile *f);
 size_t multifd_ram_payload_size(void);
 void multifd_ram_fill_packet(MultiFDSendParams *p);
 int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
diff --git a/migration/ram.c b/migration/ram.c
index ddee703585..fe33c8e0e8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1302,12 +1302,10 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
         pss->page = 0;
         pss->block = QLIST_NEXT_RCU(pss->block, next);
         if (!pss->block) {
-            if (multifd_ram_sync_per_round()) {
-                QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
-                int ret = multifd_ram_flush_and_sync(f);
-                if (ret < 0) {
-                    return ret;
-                }
+            QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
+            int ret = multifd_ram_sync_per_round(f);
+            if (ret < 0) {
+                return ret;
             }
 
             /* Hit the end of the list */
@@ -3203,11 +3201,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
 out:
     if (ret >= 0 && migration_is_running()) {
-        if (multifd_ram_sync_per_section()) {
-            ret = multifd_ram_flush_and_sync(f);
-            if (ret < 0) {
-                return ret;
-            }
+        ret = multifd_ram_sync_per_section(f);
+        if (ret < 0) {
+            return ret;
         }
 
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
@@ -3276,15 +3272,13 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    if (multifd_ram_sync_per_section()) {
-        /*
-         * Only the old dest QEMU will need this sync, because each EOS
-         * will require one SYNC message on each channel.
-         */
-        ret = multifd_ram_flush_and_sync(f);
-        if (ret < 0) {
-            return ret;
-        }
+    /*
+     * Only the old dest QEMU will need this sync, because each EOS
+     * will require one SYNC message on each channel.
+     */
+    ret = multifd_ram_sync_per_section(f);
+    if (ret < 0) {
+        return ret;
     }
 
     if (migrate_mapped_ram()) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy
  2024-12-06  0:58 ` [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy Peter Xu
@ 2024-12-06 14:19   ` Fabiano Rosas
  0 siblings, 0 replies; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 14:19 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> Multifd never worked with postcopy, at least yet so far.
>
> Remove the sync processing there, because it's confusing, and they should
> never appear.  Now if RAM_SAVE_FLAG_MULTIFD_FLUSH is observed, we fail hard
> instead of trying to invoke multifd code.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup()
  2024-12-06  0:58 ` [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup() Peter Xu
@ 2024-12-06 14:40   ` Fabiano Rosas
  2024-12-06 15:36     ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 14:40 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Maciej S . Szmigiero, peterx, Cédric Le Goater, Avihai Horon,
	Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> It's not straightforward to see why src QEMU needs to sync multifd during
> setup() phase.  After all, there's no page queued at that point.
>
> For old QEMUs, there's a solid reason: EOS requires it to work.  While it's
> clueless on the new QEMUs which do not take EOS message as sync requests.
>
> One will figure that out only when this is conditionally removed.  In fact,
> the author did try it out.  Logically we could still avoid doing this on
> new machine types, however that needs a separate compat field and that can
> be an overkill in some trivial overhead in setup() phase.
>
> Let's instead document it completely, to avoid someone else tries this
> again and do the debug one more time, or anyone confused on why this ever
> existed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/ram.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 5d4bdefe69..ddee703585 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3036,6 +3036,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
>          migration_ops->ram_save_target_page = ram_save_target_page_legacy;
>      }
>  
> +    /*
> +     * This operation is unfortunate..
> +     *
> +     * For legacy QEMUs using per-section sync
> +     * =======================================
> +     *
> +     * This must exist because the EOS below requires the SYNC messages
> +     * per-channel to work.
> +     *
> +     * For modern QEMUs using per-round sync
> +     * =====================================
> +     *
> +     * Logically this sync is not needed (because we know there's nothing
> +     * in the multifd queue yet!).

This is a bit misleading because even today we could split the
multifd_ram_flush_and_sync() into _flush and _sync (haven't I seen a
patch doing this somewhere? Maybe from Maciej...) and call just the
_sync here, which is unrelated to any multifd queue.

I think we shouldn't tie "sync" with "wait for multifd threads to finish
sending their data (a kind of flush)" as this implies. The sync is as
much making sure the threads are ready to receive as it is making sure
the data is received in order with relation to ram scanning rounds.

IOW, the local sync is what ensures multifd send threads are flushed
while this code deals with the sync of src&dst threads, which is "just"
a synchronization point between the two QEMUs.

> However as a side effect, this makes
> +     * sure the dest side won't receive any data before it properly reaches
> +     * ram_load_precopy().

I'm not sure it's a side-effect. It seems deliberate to me, seeing that
multifd usually does its own synchronization. For instance, on the send
side we also need some sync to make sure ram.c doesn't send data to
multifd send threads that are not ready yet (i.e. the wait on
channels_ready at the start of multifd_send()).

> +     *
> +     * Without this sync, src QEMU can send data too soon so that dest may
> +     * not have been ready to receive it (e.g., rb->receivedmap may be
> +     * uninitialized, for example).
> +     *
> +     * Logically "wait for recv setup ready" shouldn't need to involve src
> +     * QEMU at all, however to be compatible with old QEMUs, let's stick

I don't understand this statement, you're saying that QEMU src could
just start dumping data on the channel without a remote end? Certainly
for file migrations, but socket as well?

> +     * with this.  Fortunately the overhead is low to sync during setup
> +     * because the VM is running, so at least it's not accounted as part of
> +     * downtime.
> +     */
>      bql_unlock();
>      ret = multifd_ram_flush_and_sync(f);
>      bql_lock();


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete
  2024-12-06 13:17   ` Fabiano Rosas
@ 2024-12-06 14:40     ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-12-06 14:40 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 10:17:22AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
> > complete()") removed the FLUSH operation on complete() which should avoid
> > one global sync on destination side, because it's not needed.
> >
> > However that commit overlooked multifd_ram_flush_and_sync() part of things,
> > as that's always used together with the FLUSH message on the main channel.
> >
> > For very old binaries (multifd_flush_after_each_section==true), that's
> > still needed because each EOS received on destination will enforce
> > all-channel sync once.
> >
> > For new binaries (multifd_flush_after_each_section==false), the flush and
> > sync shouldn't be needed just like the FLUSH, with the same reasoning.
> >
> > Currently, on new binaries we'll still send SYNC messages on multifd
> > channels, even if FLUSH is omitted at the end.  It'll make all recv threads
> > hang at SYNC message.
> >
> > Multifd is still all working fine because luckily recv side cleanup
> > code (mostly multifd_recv_sync_main()) is smart enough to make sure even if
> > recv threads are stuck at SYNC it'll get kicked out.  And since this is the
> > completion phase of migration, nothing else will be sent after the SYNCs.
> >
> > This may be needed for VFIO in the future because VFIO can have data to
> > push after ram_save_complete(), hence we don't want the recv thread to be
> > stuck in SYNC message. Remove this limitation will make src even slightly
> > faster too to avoid some more code.
> >
> > Stable branches do not need this patch, as no real bug I can think of that
> > will go wrong there.. so not attaching Fixes to be clear on the backport
> > not needed.
> 
> You forgot my comments on the commit message, so here's a brand-new one

Yes I did.. :(

> with some things spelt out more explicitly:
> 
> --
> Commit 637280aeb2 ("migration/multifd: Avoid the final FLUSH in
> complete()") stopped sending the RAM_SAVE_FLAG_MULTIFD_FLUSH flag at
> ram_save_complete(), because the sync on the destination side is not
> needed due to the last iteration of find_dirty_block() having already
> done it.
> 
> However, that commit overlooked that multifd_ram_flush_and_sync() on the
> source side is also not needed at ram_save_complete(), for the same
> reason.
> 
> Moreover, removing the RAM_SAVE_FLAG_MULTIFD_FLUSH but keeping the
> multifd_ram_flush_and_sync() means that currently the recv threads will
> hang when receiving the MULTIFD_FLAG_SYNC message, waiting for the
> destination sync which only happens when RAM_SAVE_FLAG_MULTIFD_FLUSH is
> received.
> 
> Luckily, multifd is still all working fine because recv side cleanup
> code (mostly multifd_recv_sync_main()) is smart enough to make sure even
> if recv threads are stuck at SYNC it'll get kicked out. And since this
> is the completion phase of migration, nothing else will be sent after
> the SYNCs.
> 
> This needs to be fixed because in the future VFIO will have data to push
> after ram_save_complete() and we don't want the recv thread to be stuck
> in the MULTIFD_FLAG_SYNC message.
> 
> Remove the unnecessary (and buggy) invocation of
> multifd_ram_flush_and_sync().
> 
> For very old binaries (multifd_flush_after_each_section==true), the
> flush_and_sync is still needed because each EOS received on destination
> will enforce all-channel sync once.
> 
> Stable branches do not need this patch, as no real bug I can think of
> that will go wrong there.. so not attaching Fixes to be clear on the
> backport not needed.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Fabiano Rosas <farosas@suse.de>

Sure, I used it, thanks.

> 
> > ---
> >  migration/ram.c | 13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 05ff9eb328..7284c34bd8 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3283,9 +3283,16 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> >          }
> >      }
> >  
> > -    ret = multifd_ram_flush_and_sync();
> > -    if (ret < 0) {
> > -        return ret;
> > +    if (migrate_multifd() &&
> > +        migrate_multifd_flush_after_each_section()) {
> > +        /*
> > +         * Only the old dest QEMU will need this sync, because each EOS
> > +         * will require one SYNC message on each channel.
> > +         */
> > +        ret = multifd_ram_flush_and_sync();
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> >      }
> >  
> >      if (migrate_mapped_ram()) {
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only
  2024-12-06 13:26   ` Fabiano Rosas
@ 2024-12-06 14:50     ` Peter Xu
  2024-12-06 15:00       ` Fabiano Rosas
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06 14:50 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 10:26:06AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Teach multifd_send_sync_main() to sync with threads only.
> >
> > We already have such requests, which is when mapped-ram is enabled with
> > multifd.  In that case, no SYNC messages will be pushed to the stream when
> > multifd syncs the sender threads because there's no destination threads
> > waiting for that.  The whole point of the sync is to make sure all threads
> > flushed their jobs.
> 
> s/flushed/finished/ otherwise we risk confusing people.

done.

> 
> >
> > So fundamentally we have a request to do the sync in different ways:
> >
> >   - Either to sync the threads only,
> >   - Or to sync the threads but also with the destination side.
> >
> > Mapped-ram did it already because of the use_packet check in the sync
> > handler of the sender thread.  It works.
> >
> > However it may stop working when e.g. VFIO may start to reuse multifd
> > channels to push device states.  In that case VFIO has similar request on
> > "thread-only sync" however we can't check a flag because such sync request
> > can still come from RAM which needs the on-wire notifications.
> >
> > Paving way for that by allowing the multifd_send_sync_main() to specify
> > what kind of sync the caller needs.  We can use it for mapped-ram already.
> >
> > No functional change intended.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h        | 19 ++++++++++++++++---
> >  migration/multifd-nocomp.c |  7 ++++++-
> >  migration/multifd.c        | 15 +++++++++------
> >  3 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 50d58c0c9c..bd337631ec 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -19,6 +19,18 @@
> >  typedef struct MultiFDRecvData MultiFDRecvData;
> >  typedef struct MultiFDSendData MultiFDSendData;
> >  
> > +typedef enum {
> > +    /* No sync request */
> > +    MULTIFD_SYNC_NONE = 0,
> > +    /* Sync locally on the sender threads without pushing messages */
> > +    MULTIFD_SYNC_LOCAL,
> > +    /*
> > +     * Sync not only on the sender threads, but also push "SYNC" message to
> > +     * the wire (which is for a remote sync).
> 
> s/SYNC/MULTIFD_FLAG_SYNC/
> 
> Do we need to also mention that this needs to be paired with a
> multifd_recv_sync_main() via the emission of the
> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream?

If we want to mention something, IMO it would be better about what happens
on the src, not dest.  It can be too hard to follow if we connect that
directly to the dest behavior.

Does this look good to you?

    /*
     * Sync not only on the sender threads, but also push MULTIFD_FLAG_SYNC
     * message to the wire for each iochannel (which is for a remote sync).
     *
     * When remote sync is used, need to be paired with a follow up
     * RAM_SAVE_FLAG_EOS / RAM_SAVE_FLAG_MULTIFD_FLUSH message on the main
     * channel.
     */

> 
> > +     */
> > +    MULTIFD_SYNC_ALL,
> > +} MultiFDSyncReq;
> > +
> >  bool multifd_send_setup(void);
> >  void multifd_send_shutdown(void);
> >  void multifd_send_channel_created(void);
> > @@ -28,7 +40,7 @@ void multifd_recv_shutdown(void);
> >  bool multifd_recv_all_channels_created(void);
> >  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> >  void multifd_recv_sync_main(void);
> > -int multifd_send_sync_main(void);
> > +int multifd_send_sync_main(MultiFDSyncReq req);
> >  bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> >  bool multifd_recv(void);
> >  MultiFDRecvData *multifd_get_recv_data(void);
> > @@ -143,7 +155,7 @@ typedef struct {
> >      /* multifd flags for each packet */
> >      uint32_t flags;
> >      /*
> > -     * The sender thread has work to do if either of below boolean is set.
> > +     * The sender thread has work to do if either of below field is set.
> >       *
> >       * @pending_job:  a job is pending
> >       * @pending_sync: a sync request is pending
> > @@ -152,7 +164,8 @@ typedef struct {
> >       * cleared by the multifd sender threads.
> >       */
> >      bool pending_job;
> > -    bool pending_sync;
> > +    MultiFDSyncReq pending_sync;
> > +
> >      MultiFDSendData *data;
> >  
> >      /* thread local variables. No locking required */
> > diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> > index 55191152f9..219f9e58ef 100644
> > --- a/migration/multifd-nocomp.c
> > +++ b/migration/multifd-nocomp.c
> > @@ -345,6 +345,8 @@ retry:
> >  
> >  int multifd_ram_flush_and_sync(void)
> >  {
> > +    MultiFDSyncReq req;
> > +
> >      if (!migrate_multifd()) {
> >          return 0;
> >      }
> > @@ -356,7 +358,10 @@ int multifd_ram_flush_and_sync(void)
> >          }
> >      }
> >  
> > -    return multifd_send_sync_main();
> > +    /* File migrations only need to sync with threads */
> > +    req = migrate_mapped_ram() ? MULTIFD_SYNC_LOCAL : MULTIFD_SYNC_ALL;
> > +
> > +    return multifd_send_sync_main(req);
> >  }
> >  
> >  bool multifd_send_prepare_common(MultiFDSendParams *p)
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 498e71fd10..2248bd2d46 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -523,7 +523,7 @@ static int multifd_zero_copy_flush(QIOChannel *c)
> >      return ret;
> >  }
> >  
> > -int multifd_send_sync_main(void)
> > +int multifd_send_sync_main(MultiFDSyncReq req)
> >  {
> >      int i;
> >      bool flush_zero_copy;
> 
> assert(req != MULTIFD_SYNC_NONE) ?

Sure.

> 
> > @@ -543,8 +543,8 @@ int multifd_send_sync_main(void)
> >           * We should be the only user so far, so not possible to be set by
> >           * others concurrently.
> >           */
> > -        assert(qatomic_read(&p->pending_sync) == false);
> > -        qatomic_set(&p->pending_sync, true);
> > +        assert(qatomic_read(&p->pending_sync) == MULTIFD_SYNC_NONE);
> > +        qatomic_set(&p->pending_sync, req);
> >          qemu_sem_post(&p->sem);
> >      }
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> > @@ -635,14 +635,17 @@ static void *multifd_send_thread(void *opaque)
> >               */
> >              qatomic_store_release(&p->pending_job, false);
> >          } else {
> > +            MultiFDSyncReq req = qatomic_read(&p->pending_sync);
> > +
> >              /*
> >               * If not a normal job, must be a sync request.  Note that
> >               * pending_sync is a standalone flag (unlike pending_job), so
> >               * it doesn't require explicit memory barriers.
> >               */
> > -            assert(qatomic_read(&p->pending_sync));
> > +            assert(req != MULTIFD_SYNC_NONE);
> >  
> > -            if (use_packets) {
> > +            /* Only push the SYNC message if it involves a remote sync */
> > +            if (req == MULTIFD_SYNC_ALL) {
> >                  p->flags = MULTIFD_FLAG_SYNC;
> >                  multifd_send_fill_packet(p);
> >                  ret = qio_channel_write_all(p->c, (void *)p->packet,
> > @@ -654,7 +657,7 @@ static void *multifd_send_thread(void *opaque)
> >                  stat64_add(&mig_stats.multifd_bytes, p->packet_len);
> >              }
> >  
> > -            qatomic_set(&p->pending_sync, false);
> > +            qatomic_set(&p->pending_sync, MULTIFD_SYNC_NONE);
> >              qemu_sem_post(&p->sem_sync);
> >          }
> >      }
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only
  2024-12-06 14:50     ` Peter Xu
@ 2024-12-06 15:00       ` Fabiano Rosas
  0 siblings, 0 replies; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 15:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> On Fri, Dec 06, 2024 at 10:26:06AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > Teach multifd_send_sync_main() to sync with threads only.
>> >
>> > We already have such requests, which is when mapped-ram is enabled with
>> > multifd.  In that case, no SYNC messages will be pushed to the stream when
>> > multifd syncs the sender threads because there's no destination threads
>> > waiting for that.  The whole point of the sync is to make sure all threads
>> > flushed their jobs.
>> 
>> s/flushed/finished/ otherwise we risk confusing people.
>
> done.
>
>> 
>> >
>> > So fundamentally we have a request to do the sync in different ways:
>> >
>> >   - Either to sync the threads only,
>> >   - Or to sync the threads but also with the destination side.
>> >
>> > Mapped-ram did it already because of the use_packet check in the sync
>> > handler of the sender thread.  It works.
>> >
>> > However it may stop working when e.g. VFIO may start to reuse multifd
>> > channels to push device states.  In that case VFIO has similar request on
>> > "thread-only sync" however we can't check a flag because such sync request
>> > can still come from RAM which needs the on-wire notifications.
>> >
>> > Paving way for that by allowing the multifd_send_sync_main() to specify
>> > what kind of sync the caller needs.  We can use it for mapped-ram already.
>> >
>> > No functional change intended.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/multifd.h        | 19 ++++++++++++++++---
>> >  migration/multifd-nocomp.c |  7 ++++++-
>> >  migration/multifd.c        | 15 +++++++++------
>> >  3 files changed, 31 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 50d58c0c9c..bd337631ec 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -19,6 +19,18 @@
>> >  typedef struct MultiFDRecvData MultiFDRecvData;
>> >  typedef struct MultiFDSendData MultiFDSendData;
>> >  
>> > +typedef enum {
>> > +    /* No sync request */
>> > +    MULTIFD_SYNC_NONE = 0,
>> > +    /* Sync locally on the sender threads without pushing messages */
>> > +    MULTIFD_SYNC_LOCAL,
>> > +    /*
>> > +     * Sync not only on the sender threads, but also push "SYNC" message to
>> > +     * the wire (which is for a remote sync).
>> 
>> s/SYNC/MULTIFD_FLAG_SYNC/
>> 
>> Do we need to also mention that this needs to be paired with a
>> multifd_recv_sync_main() via the emission of the
>> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream?
>
> If we want to mention something, IMO it would be better about what happens
> on the src, not dest.  It can be too hard to follow if we connect that
> directly to the dest behavior.
>
> Does this look good to you?
>
>     /*
>      * Sync not only on the sender threads, but also push MULTIFD_FLAG_SYNC
>      * message to the wire for each iochannel (which is for a remote sync).
>      *
>      * When remote sync is used, need to be paired with a follow up
>      * RAM_SAVE_FLAG_EOS / RAM_SAVE_FLAG_MULTIFD_FLUSH message on the main
>      * channel.
>      */

Yes, thanks.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06 13:43   ` Fabiano Rosas
@ 2024-12-06 15:03     ` Peter Xu
  2024-12-06 15:10       ` Fabiano Rosas
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06 15:03 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 10:43:31AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Firstly, we're going to use the multifd flag soon in multifd code, so ram.c
> > isn't gonna work.
> >
> > Secondly, we have a separate RDMA flag dangling around, which is definitely
> > not obvious.  There's one comment that helps, but not too much.
> >
> > We should just put it altogether, so nothing will get overlooked.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>
> 
> just some comments about preexisting stuff:
> 
> > ---
> >  migration/ram.h  | 25 +++++++++++++++++++++++++
> >  migration/rdma.h |  7 -------
> >  migration/ram.c  | 21 ---------------------
> >  3 files changed, 25 insertions(+), 28 deletions(-)
> >
> > diff --git a/migration/ram.h b/migration/ram.h
> > index 0d1981f888..cfdcccd266 100644
> > --- a/migration/ram.h
> > +++ b/migration/ram.h
> > @@ -33,6 +33,31 @@
> >  #include "exec/cpu-common.h"
> >  #include "io/channel.h"
> >  
> > +/*
> > + * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
> > + * worked for pages that were filled with the same char.  We switched
> > + * it to only search for the zero value.  And to avoid confusion with
> > + * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
> > + *
> > + * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
> > + *
> > + * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
> 
> Aren't these already covered by git log? The comment makes it seem like
> some special situation, but I think we're just documenting history here,
> no?

I guess so.

Maybe still useful when we hit a bug that some ancient QEMU manually
migrates to a new one and hit a weird 0x100 message.

> 
> > + */
> > +#define RAM_SAVE_FLAG_FULL     0x01
> > +#define RAM_SAVE_FLAG_ZERO     0x02
> > +#define RAM_SAVE_FLAG_MEM_SIZE 0x04
> > +#define RAM_SAVE_FLAG_PAGE     0x08
> > +#define RAM_SAVE_FLAG_EOS      0x10
> > +#define RAM_SAVE_FLAG_CONTINUE 0x20
> > +#define RAM_SAVE_FLAG_XBZRLE   0x40
> > +/*
> > + * ONLY USED IN RDMA: Whenever this is found in the data stream, the flags
> > + * will be passed to rdma functions in the incoming-migration side.
> > + */
> > +#define RAM_SAVE_FLAG_HOOK     0x80
> 
> No 0x100?

You just asked about it one min ago! ^^^^

> 
> > +#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
> > +/* We can't use any flag that is bigger than 0x200 */
> 
> Where does that limitation come from again? I know that
> RAM_SAVE_FLAG_MEM_SIZE shares a u64 with something else:
> 
>   qemu_put_be64(f, ram_bytes_total_with_ignored() |
>   RAM_SAVE_FLAG_MEM_SIZE);
> 
> For RAM_SAVE_FLAG_ZERO and RAM_SAVE_FLAG_PAGE, it might be a u32 (offset
> is ram_addr_t):
> 
>   save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
> 
> But others just go by themselves:
> 
> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);

No matter if it goes by itself, I am guessing migration was initially
developed by assuming each initial 8 bytes is an address, see:

ram_load_precopy():
        addr = qemu_get_be64(f);
        ...
        flags = addr & ~TARGET_PAGE_MASK;
        addr &= TARGET_PAGE_MASK;

So it must be no more than 0x200, probably because it wants to work with
whatever architectures that have PAGE_SIZE>=1K (which is 0x400).  Then the
offset will never use the last 10 bits.

Wanna me to add a comment for that in this patch?

> 
> 
> > +
> >  extern XBZRLECacheStats xbzrle_counters;
> >  
> >  /* Should be holding either ram_list.mutex, or the RCU lock. */
> > diff --git a/migration/rdma.h b/migration/rdma.h
> > index a8d27f33b8..f55f28bbed 100644
> > --- a/migration/rdma.h
> > +++ b/migration/rdma.h
> > @@ -33,13 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
> >  #define RAM_CONTROL_ROUND     1
> >  #define RAM_CONTROL_FINISH    3
> >  
> > -/*
> > - * Whenever this is found in the data stream, the flags
> > - * will be passed to rdma functions in the incoming-migration
> > - * side.
> > - */
> > -#define RAM_SAVE_FLAG_HOOK     0x80
> > -
> >  #define RAM_SAVE_CONTROL_NOT_SUPP -1000
> >  #define RAM_SAVE_CONTROL_DELAYED  -2000
> >  
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 7284c34bd8..44010ff325 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -71,27 +71,6 @@
> >  /***********************************************************/
> >  /* ram save/restore */
> >  
> > -/*
> > - * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
> > - * worked for pages that were filled with the same char.  We switched
> > - * it to only search for the zero value.  And to avoid confusion with
> > - * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
> > - *
> > - * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
> > - *
> > - * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
> > - */
> > -#define RAM_SAVE_FLAG_FULL     0x01
> > -#define RAM_SAVE_FLAG_ZERO     0x02
> > -#define RAM_SAVE_FLAG_MEM_SIZE 0x04
> > -#define RAM_SAVE_FLAG_PAGE     0x08
> > -#define RAM_SAVE_FLAG_EOS      0x10
> > -#define RAM_SAVE_FLAG_CONTINUE 0x20
> > -#define RAM_SAVE_FLAG_XBZRLE   0x40
> > -/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
> > -#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
> > -/* We can't use any flag that is bigger than 0x200 */
> > -
> >  /*
> >   * mapped-ram migration supports O_DIRECT, so we need to make sure the
> >   * userspace buffer, the IO operation size and the file offset are
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06 15:03     ` Peter Xu
@ 2024-12-06 15:10       ` Fabiano Rosas
  2024-12-06 15:46         ` Peter Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 15:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> On Fri, Dec 06, 2024 at 10:43:31AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > Firstly, we're going to use the multifd flag soon in multifd code, so ram.c
>> > isn't gonna work.
>> >
>> > Secondly, we have a separate RDMA flag dangling around, which is definitely
>> > not obvious.  There's one comment that helps, but not too much.
>> >
>> > We should just put it altogether, so nothing will get overlooked.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> 
>> Reviewed-by: Fabiano Rosas <farosas@suse.de>
>> 
>> just some comments about preexisting stuff:
>> 
>> > ---
>> >  migration/ram.h  | 25 +++++++++++++++++++++++++
>> >  migration/rdma.h |  7 -------
>> >  migration/ram.c  | 21 ---------------------
>> >  3 files changed, 25 insertions(+), 28 deletions(-)
>> >
>> > diff --git a/migration/ram.h b/migration/ram.h
>> > index 0d1981f888..cfdcccd266 100644
>> > --- a/migration/ram.h
>> > +++ b/migration/ram.h
>> > @@ -33,6 +33,31 @@
>> >  #include "exec/cpu-common.h"
>> >  #include "io/channel.h"
>> >  
>> > +/*
>> > + * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
>> > + * worked for pages that were filled with the same char.  We switched
>> > + * it to only search for the zero value.  And to avoid confusion with
>> > + * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
>> > + *
>> > + * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
>> > + *
>> > + * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
>> 
>> Aren't these already covered by git log? The comment makes it seem like
>> some special situation, but I think we're just documenting history here,
>> no?
>
> I guess so.
>
> Maybe still useful when we hit a bug that some ancient QEMU manually
> migrates to a new one and hit a weird 0x100 message.
>
>> 
>> > + */
>> > +#define RAM_SAVE_FLAG_FULL     0x01
>> > +#define RAM_SAVE_FLAG_ZERO     0x02
>> > +#define RAM_SAVE_FLAG_MEM_SIZE 0x04
>> > +#define RAM_SAVE_FLAG_PAGE     0x08
>> > +#define RAM_SAVE_FLAG_EOS      0x10
>> > +#define RAM_SAVE_FLAG_CONTINUE 0x20
>> > +#define RAM_SAVE_FLAG_XBZRLE   0x40
>> > +/*
>> > + * ONLY USED IN RDMA: Whenever this is found in the data stream, the flags
>> > + * will be passed to rdma functions in the incoming-migration side.
>> > + */
>> > +#define RAM_SAVE_FLAG_HOOK     0x80
>> 
>> No 0x100?
>
> You just asked about it one min ago! ^^^^

Ah, so RAM_SAVE_FLAG_FULL was left behind but
RAM_SAVE_FLAG_COMPRESS_PAGE was removed, I missed that.

>
>> 
>> > +#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
>> > +/* We can't use any flag that is bigger than 0x200 */
>> 
>> Where does that limitation come from again? I know that
>> RAM_SAVE_FLAG_MEM_SIZE shares a u64 with something else:
>> 
>>   qemu_put_be64(f, ram_bytes_total_with_ignored() |
>>   RAM_SAVE_FLAG_MEM_SIZE);
>> 
>> For RAM_SAVE_FLAG_ZERO and RAM_SAVE_FLAG_PAGE, it might be a u32 (offset
>> is ram_addr_t):
>> 
>>   save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>> 
>> But others just go by themselves:
>> 
>> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>
> No matter if it goes by itself, I am guessing migration was initially
> developed by assuming each initial 8 bytes is an address, see:
>
> ram_load_precopy():
>         addr = qemu_get_be64(f);
>         ...
>         flags = addr & ~TARGET_PAGE_MASK;
>         addr &= TARGET_PAGE_MASK;
>
> So it must be no more than 0x200, probably because it wants to work with
> whatever architectures that have PAGE_SIZE>=1K (which is 0x400).  Then the
> offset will never use the last 10 bits.
>
> Wanna me to add a comment for that in this patch?

Yes, please.

>
>> 
>> 
>> > +
>> >  extern XBZRLECacheStats xbzrle_counters;
>> >  
>> >  /* Should be holding either ram_list.mutex, or the RCU lock. */
>> > diff --git a/migration/rdma.h b/migration/rdma.h
>> > index a8d27f33b8..f55f28bbed 100644
>> > --- a/migration/rdma.h
>> > +++ b/migration/rdma.h
>> > @@ -33,13 +33,6 @@ void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
>> >  #define RAM_CONTROL_ROUND     1
>> >  #define RAM_CONTROL_FINISH    3
>> >  
>> > -/*
>> > - * Whenever this is found in the data stream, the flags
>> > - * will be passed to rdma functions in the incoming-migration
>> > - * side.
>> > - */
>> > -#define RAM_SAVE_FLAG_HOOK     0x80
>> > -
>> >  #define RAM_SAVE_CONTROL_NOT_SUPP -1000
>> >  #define RAM_SAVE_CONTROL_DELAYED  -2000
>> >  
>> > diff --git a/migration/ram.c b/migration/ram.c
>> > index 7284c34bd8..44010ff325 100644
>> > --- a/migration/ram.c
>> > +++ b/migration/ram.c
>> > @@ -71,27 +71,6 @@
>> >  /***********************************************************/
>> >  /* ram save/restore */
>> >  
>> > -/*
>> > - * RAM_SAVE_FLAG_ZERO used to be named RAM_SAVE_FLAG_COMPRESS, it
>> > - * worked for pages that were filled with the same char.  We switched
>> > - * it to only search for the zero value.  And to avoid confusion with
>> > - * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
>> > - *
>> > - * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
>> > - *
>> > - * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
>> > - */
>> > -#define RAM_SAVE_FLAG_FULL     0x01
>> > -#define RAM_SAVE_FLAG_ZERO     0x02
>> > -#define RAM_SAVE_FLAG_MEM_SIZE 0x04
>> > -#define RAM_SAVE_FLAG_PAGE     0x08
>> > -#define RAM_SAVE_FLAG_EOS      0x10
>> > -#define RAM_SAVE_FLAG_CONTINUE 0x20
>> > -#define RAM_SAVE_FLAG_XBZRLE   0x40
>> > -/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
>> > -#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
>> > -/* We can't use any flag that is bigger than 0x200 */
>> > -
>> >  /*
>> >   * mapped-ram migration supports O_DIRECT, so we need to make sure the
>> >   * userspace buffer, the IO operation size and the file offset are
>> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check
  2024-12-06 14:18   ` Fabiano Rosas
@ 2024-12-06 15:13     ` Peter Xu
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Xu @ 2024-12-06 15:13 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 11:18:59AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > The src flush condition check is over complicated, and it's getting more
> > out of control if postcopy will be involved.
> >
> > In general, we have two modes to do the sync: legacy or modern ways.
> > Legacy uses per-section flush, modern uses per-round flush.
> >
> > Mapped-ram always uses the modern, which is per-round.
> >
> > Introduce two helpers, which can greatly simplify the code, and hopefully
> > make it readable again.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h        |  2 ++
> >  migration/multifd-nocomp.c | 42 ++++++++++++++++++++++++++++++++++++++
> >  migration/ram.c            | 10 +++------
> >  3 files changed, 47 insertions(+), 7 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index c9ae57ea02..582040922f 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -351,6 +351,8 @@ static inline uint32_t multifd_ram_page_count(void)
> >  void multifd_ram_save_setup(void);
> >  void multifd_ram_save_cleanup(void);
> >  int multifd_ram_flush_and_sync(QEMUFile *f);
> > +bool multifd_ram_sync_per_round(void);
> > +bool multifd_ram_sync_per_section(void);
> >  size_t multifd_ram_payload_size(void);
> >  void multifd_ram_fill_packet(MultiFDSendParams *p);
> >  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> > diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> > index 58372db0f4..c1f686c0ce 100644
> > --- a/migration/multifd-nocomp.c
> > +++ b/migration/multifd-nocomp.c
> > @@ -344,6 +344,48 @@ retry:
> >      return true;
> >  }
> >  
> > +/*
> > + * We have two modes for multifd flushes:
> > + *
> > + * - Per-section mode: this is the legacy way to flush, it requires one
> > + *   MULTIFD_FLAG_SYNC message for each RAM_SAVE_FLAG_EOS.
> > + *
> > + * - Per-round mode: this is the modern way to flush, it requires one
> > + *   MULTIFD_FLAG_SYNC message only for each round of RAM scan.  Normally
> > + *   it's paired with a new RAM_SAVE_FLAG_MULTIFD_FLUSH message in network
> > + *   based migrations.
> > + *
> > + * One thing to mention is mapped-ram always use the modern way to sync.
> > + */
> > +
> > +/* Do we need a per-section multifd flush (legacy way)? */
> > +bool multifd_ram_sync_per_section(void)
> > +{
> > +    if (!migrate_multifd()) {
> > +        return false;
> > +    }
> > +
> > +    if (migrate_mapped_ram()) {
> > +        return false;
> > +    }
> > +
> > +    return migrate_multifd_flush_after_each_section();
> > +}
> > +
> > +/* Do we need a per-round multifd flush (modern way)? */
> > +bool multifd_ram_sync_per_round(void)
> > +{
> > +    if (!migrate_multifd()) {
> > +        return false;
> > +    }
> > +
> > +    if (migrate_mapped_ram()) {
> > +        return true;
> > +    }
> > +
> > +    return !migrate_multifd_flush_after_each_section();
> > +}
> > +
> >  int multifd_ram_flush_and_sync(QEMUFile *f)
> >  {
> >      MultiFDSyncReq req;
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 154ff5abd4..5d4bdefe69 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1302,9 +1302,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
> >          pss->page = 0;
> >          pss->block = QLIST_NEXT_RCU(pss->block, next);
> >          if (!pss->block) {
> > -            if (migrate_multifd() &&
> > -                (!migrate_multifd_flush_after_each_section() ||
> > -                 migrate_mapped_ram())) {
> > +            if (multifd_ram_sync_per_round()) {
> 
> If we're already implicitly declaring which parts of the code mean
> "round" or "section", we could fold the flush into the function and call
> it unconditionally.

That will add mistery when reading the callers, not be able to identify
whether the flush was sent or not.

If you have a strong preference on it, you can reply with a formal patch
and I can include it when I repost.  However one comment below:

> 
> We don't need ram.c code to be deciding about multifd. This could be all
> hidden away in the multifd-nocomp code:

Side note: maybe I should have a pre-requisite patch moving flush things
out of multifd-nocomp.c but into multifd.c, because compressors will also
need it.  Then I could add multifd_ram_sync_per_* into multifd.c too.

> 
> -- >8 --
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index c1f686c0ce..6a7eee4c25 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -358,32 +358,26 @@ retry:
>   * One thing to mention is mapped-ram always use the modern way to sync.
>   */
>  
> -/* Do we need a per-section multifd flush (legacy way)? */
> -bool multifd_ram_sync_per_section(void)
> +int multifd_ram_sync_per_section(QEMUFile *f)
>  {
> -    if (!migrate_multifd()) {
> -        return false;
> +    if (!migrate_multifd() || !migrate_multifd_flush_after_each_section()) {
> +        return 0;

If you're going to reply with the patch, please consider not queuing up the
if condition check again.  That's one of the reason why I introduced the
helper anyway, so that it'll be clear to see each check, and we can easily
add comment to each check whenever it's necessary (though I unified the
comment part all over to above, because the two modes share it).

>      }
>  
>      if (migrate_mapped_ram()) {
> -        return false;
> +        return 0;
>      }
>  
> -    return migrate_multifd_flush_after_each_section();
> +    return multifd_ram_flush_and_sync(f);
>  }
>  
> -/* Do we need a per-round multifd flush (modern way)? */
> -bool multifd_ram_sync_per_round(void)
> +int multifd_ram_sync_per_round(QEMUFile *f)
>  {
> -    if (!migrate_multifd()) {
> -        return false;
> +    if (!migrate_multifd() || migrate_multifd_flush_after_each_section()) {

Same here.

> +        return 0;
>      }
>  
> -    if (migrate_mapped_ram()) {
> -        return true;
> -    }
> -
> -    return !migrate_multifd_flush_after_each_section();
> +    return multifd_ram_flush_and_sync(f);
>  }
>  
>  int multifd_ram_flush_and_sync(QEMUFile *f)
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 582040922f..3b42128167 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -351,8 +351,8 @@ static inline uint32_t multifd_ram_page_count(void)
>  void multifd_ram_save_setup(void);
>  void multifd_ram_save_cleanup(void);
>  int multifd_ram_flush_and_sync(QEMUFile *f);
> -bool multifd_ram_sync_per_round(void);
> -bool multifd_ram_sync_per_section(void);
> +int multifd_ram_sync_per_round(QEMUFile *f);
> +int multifd_ram_sync_per_section(QEMUFile *f);
>  size_t multifd_ram_payload_size(void);
>  void multifd_ram_fill_packet(MultiFDSendParams *p);
>  int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp);
> diff --git a/migration/ram.c b/migration/ram.c
> index ddee703585..fe33c8e0e8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1302,12 +1302,10 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>          pss->page = 0;
>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>          if (!pss->block) {
> -            if (multifd_ram_sync_per_round()) {
> -                QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
> -                int ret = multifd_ram_flush_and_sync(f);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
> +            int ret = multifd_ram_sync_per_round(f);
> +            if (ret < 0) {
> +                return ret;
>              }
>  
>              /* Hit the end of the list */
> @@ -3203,11 +3201,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  
>  out:
>      if (ret >= 0 && migration_is_running()) {
> -        if (multifd_ram_sync_per_section()) {
> -            ret = multifd_ram_flush_and_sync(f);
> -            if (ret < 0) {
> -                return ret;
> -            }
> +        ret = multifd_ram_sync_per_section(f);
> +        if (ret < 0) {
> +            return ret;
>          }
>  
>          qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> @@ -3276,15 +3272,13 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>          }
>      }
>  
> -    if (multifd_ram_sync_per_section()) {
> -        /*
> -         * Only the old dest QEMU will need this sync, because each EOS
> -         * will require one SYNC message on each channel.
> -         */
> -        ret = multifd_ram_flush_and_sync(f);
> -        if (ret < 0) {
> -            return ret;
> -        }
> +    /*
> +     * Only the old dest QEMU will need this sync, because each EOS
> +     * will require one SYNC message on each channel.
> +     */
> +    ret = multifd_ram_sync_per_section(f);
> +    if (ret < 0) {
> +        return ret;
>      }
>  
>      if (migrate_mapped_ram()) {
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup()
  2024-12-06 14:40   ` Fabiano Rosas
@ 2024-12-06 15:36     ` Peter Xu
  2024-12-06 17:01       ` Fabiano Rosas
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06 15:36 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 11:40:27AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > It's not straightforward to see why src QEMU needs to sync multifd during
> > setup() phase.  After all, there's no page queued at that point.
> >
> > For old QEMUs, there's a solid reason: EOS requires it to work.  While it's
> > clueless on the new QEMUs which do not take EOS message as sync requests.
> >
> > One will figure that out only when this is conditionally removed.  In fact,
> > the author did try it out.  Logically we could still avoid doing this on
> > new machine types, however that needs a separate compat field and that can
> > be an overkill in some trivial overhead in setup() phase.
> >
> > Let's instead document it completely, to avoid someone else tries this
> > again and do the debug one more time, or anyone confused on why this ever
> > existed.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/ram.c | 27 +++++++++++++++++++++++++++
> >  1 file changed, 27 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5d4bdefe69..ddee703585 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -3036,6 +3036,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
> >          migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> >      }
> >  
> > +    /*
> > +     * This operation is unfortunate..
> > +     *
> > +     * For legacy QEMUs using per-section sync
> > +     * =======================================
> > +     *
> > +     * This must exist because the EOS below requires the SYNC messages
> > +     * per-channel to work.
> > +     *
> > +     * For modern QEMUs using per-round sync
> > +     * =====================================
> > +     *
> > +     * Logically this sync is not needed (because we know there's nothing
> > +     * in the multifd queue yet!).
> 
> This is a bit misleading because even today we could split the
> multifd_ram_flush_and_sync() into _flush and _sync (haven't I seen a
> patch doing this somewhere? Maybe from Maciej...) and call just the
> _sync here, which is unrelated to any multifd queue.

Yeah you have a point, maybe at least I shouldn't mention the queues, they
can be irrelevant.

> 
> I think we shouldn't tie "sync" with "wait for multifd threads to finish
> sending their data (a kind of flush)" as this implies. The sync is as
> much making sure the threads are ready to receive as it is making sure
> the data is received in order with relation to ram scanning rounds.
> 
> IOW, the local sync is what ensures multifd send threads are flushed
> while this code deals with the sync of src&dst threads, which is "just"
> a synchronization point between the two QEMUs.

This is a remote sync, not a local sync.  But yes, it can be the
synchronization point.

As I mentioned below, such sync point has nothing to do with src, so it can
be implemented in dest alone without such sync message.  It works, though.

> 
> > However as a side effect, this makes
> > +     * sure the dest side won't receive any data before it properly reaches
> > +     * ram_load_precopy().
> 
> I'm not sure it's a side-effect. It seems deliberate to me, seeing that
> multifd usually does its own synchronization. For instance, on the send
> side we also need some sync to make sure ram.c doesn't send data to
> multifd send threads that are not ready yet (i.e. the wait on
> channels_ready at the start of multifd_send()).

Yes, and that's exactly what I wanted to express.  If dest has that
"channels_ready" thing, I'm pretty sure we don't need this remote sync.
We're using a heavier sync to service the purpose for "local sync for
dest".  It's ok, but it's very unclear on what it really does.

> 
> > +     *
> > +     * Without this sync, src QEMU can send data too soon so that dest may
> > +     * not have been ready to receive it (e.g., rb->receivedmap may be
> > +     * uninitialized, for example).
> > +     *
> > +     * Logically "wait for recv setup ready" shouldn't need to involve src
> > +     * QEMU at all, however to be compatible with old QEMUs, let's stick
> 
> I don't understand this statement, you're saying that QEMU src could
> just start dumping data on the channel without a remote end? Certainly
> for file migrations, but socket as well?

Yes.

When reaching here on sender side, all multifd channels are already there:
multifd_send_setup() guaranteed it.  We can start dump things, AFAICT,
irrelevant of what dest is doing.

Maybe the recv threads are not even created, but that isn't relevant, IMO,
as long as they'll be there at some point and start collecting socket
buffers.

> 
> > +     * with this.  Fortunately the overhead is low to sync during setup
> > +     * because the VM is running, so at least it's not accounted as part of
> > +     * downtime.
> > +     */
> >      bql_unlock();
> >      ret = multifd_ram_flush_and_sync(f);
> >      bql_lock();
> 

I removed ambiguous wordings on "queue is empty", and simplified it a bit.
How's this one look?

    /*
     * This operation is unfortunate..
     *
     * For legacy QEMUs using per-section sync
     * =======================================
     *
     * This must exist because the EOS below requires the SYNC messages
     * per-channel to work.
     *
     * For modern QEMUs using per-round sync
     * =====================================
     *
     * Logically such sync is not needed, and recv threads should not run
     * until setup ready (using things like channels_ready on src).  Then
     * we should be all fine.
     *
     * However even if we add channels_ready to recv side in new QEMUs, old
     * QEMU won't have them so this sync will still be needed to make sure
     * multifd recv threads won't start processing guest pages early before
     * ram_load_setup() is properly done.
     *
     * Let's stick with this.  Fortunately the overhead is low to sync
     * during setup because the VM is running, so at least it's not
     * accounted as part of downtime.
     */

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06 15:10       ` Fabiano Rosas
@ 2024-12-06 15:46         ` Peter Xu
  2024-12-06 16:58           ` Fabiano Rosas
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Xu @ 2024-12-06 15:46 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

On Fri, Dec 06, 2024 at 12:10:46PM -0300, Fabiano Rosas wrote:
> > Wanna me to add a comment for that in this patch?
> 
> Yes, please.

When I did it, I also did a few other things:

  - Rearranged the comment section, put all things together
  - Remove RAM_SAVE_FLAG_ZERO directly
  - Reindents

The plan is to squash below diff in v3 post, feel free to comment before
that.

===8<===
From 593227f1b9e678418f5b99ac51525884fa0adfc6 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 6 Dec 2024 10:43:25 -0500
Subject: [PATCH] fixup! migration/ram: Move RAM_SAVE_FLAG* into ram.h

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/ram.h | 33 ++++++++++++++++++---------------
 migration/ram.c | 22 ++++++++++------------
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/migration/ram.h b/migration/ram.h
index cfdcccd266..921c39a2c5 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -39,24 +39,27 @@
  * it to only search for the zero value.  And to avoid confusion with
  * RAM_SAVE_FLAG_COMPRESS_PAGE just rename it.
  *
- * RAM_SAVE_FLAG_FULL was obsoleted in 2009.
+ * RAM_SAVE_FLAG_FULL (0x01) was obsoleted in 2009.
  *
  * RAM_SAVE_FLAG_COMPRESS_PAGE (0x100) was removed in QEMU 9.1.
+ *
+ * RAM_SAVE_FLAG_HOOK is only used in RDMA. Whenever this is found in the
+ * data stream, the flags will be passed to rdma functions in the
+ * incoming-migration side.
+ *
+ * We can't use any flag that is bigger than 0x200, because the flags are
+ * always assumed to be encoded in a ramblock address offset, which is
+ * multiple of PAGE_SIZE.  Here it means QEMU supports migration with any
+ * architecture that has PAGE_SIZE>=1K (0x400).
  */
-#define RAM_SAVE_FLAG_FULL     0x01
-#define RAM_SAVE_FLAG_ZERO     0x02
-#define RAM_SAVE_FLAG_MEM_SIZE 0x04
-#define RAM_SAVE_FLAG_PAGE     0x08
-#define RAM_SAVE_FLAG_EOS      0x10
-#define RAM_SAVE_FLAG_CONTINUE 0x20
-#define RAM_SAVE_FLAG_XBZRLE   0x40
-/*
- * ONLY USED IN RDMA: Whenever this is found in the data stream, the flags
- * will be passed to rdma functions in the incoming-migration side.
- */
-#define RAM_SAVE_FLAG_HOOK     0x80
-#define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
-/* We can't use any flag that is bigger than 0x200 */
+#define RAM_SAVE_FLAG_ZERO                    0x002
+#define RAM_SAVE_FLAG_MEM_SIZE                0x004
+#define RAM_SAVE_FLAG_PAGE                    0x008
+#define RAM_SAVE_FLAG_EOS                     0x010
+#define RAM_SAVE_FLAG_CONTINUE                0x020
+#define RAM_SAVE_FLAG_XBZRLE                  0x040
+#define RAM_SAVE_FLAG_HOOK                    0x080
+#define RAM_SAVE_FLAG_MULTIFD_FLUSH           0x200
 
 extern XBZRLECacheStats xbzrle_counters;
-- 
2.47.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h
  2024-12-06 15:46         ` Peter Xu
@ 2024-12-06 16:58           ` Fabiano Rosas
  0 siblings, 0 replies; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 16:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> On Fri, Dec 06, 2024 at 12:10:46PM -0300, Fabiano Rosas wrote:
>> > Wanna me to add a comment for that in this patch?
>> 
>> Yes, please.
>
> When I did it, I also did a few other things:
>
>   - Rearranged the comment section, put all things together
>   - Remove RAM_SAVE_FLAG_ZERO directly
>   - Reindents
>
> The plan is to squash below diff in v3 post, feel free to comment before
> that.
>
> ===8<===
> From 593227f1b9e678418f5b99ac51525884fa0adfc6 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Fri, 6 Dec 2024 10:43:25 -0500
> Subject: [PATCH] fixup! migration/ram: Move RAM_SAVE_FLAG* into ram.h
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

LGTM



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup()
  2024-12-06 15:36     ` Peter Xu
@ 2024-12-06 17:01       ` Fabiano Rosas
  0 siblings, 0 replies; 25+ messages in thread
From: Fabiano Rosas @ 2024-12-06 17:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Maciej S . Szmigiero, Cédric Le Goater,
	Avihai Horon, Alex Williamson, Prasad Pandit

Peter Xu <peterx@redhat.com> writes:

> On Fri, Dec 06, 2024 at 11:40:27AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > It's not straightforward to see why src QEMU needs to sync multifd during
>> > setup() phase.  After all, there's no page queued at that point.
>> >
>> > For old QEMUs, there's a solid reason: EOS requires it to work.  While it's
>> > clueless on the new QEMUs which do not take EOS message as sync requests.
>> >
>> > One will figure that out only when this is conditionally removed.  In fact,
>> > the author did try it out.  Logically we could still avoid doing this on
>> > new machine types, however that needs a separate compat field and that can
>> > be an overkill in some trivial overhead in setup() phase.
>> >
>> > Let's instead document it completely, to avoid someone else tries this
>> > again and do the debug one more time, or anyone confused on why this ever
>> > existed.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/ram.c | 27 +++++++++++++++++++++++++++
>> >  1 file changed, 27 insertions(+)
>> >
>> > diff --git a/migration/ram.c b/migration/ram.c
>> > index 5d4bdefe69..ddee703585 100644
>> > --- a/migration/ram.c
>> > +++ b/migration/ram.c
>> > @@ -3036,6 +3036,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
>> >          migration_ops->ram_save_target_page = ram_save_target_page_legacy;
>> >      }
>> >  
>> > +    /*
>> > +     * This operation is unfortunate..
>> > +     *
>> > +     * For legacy QEMUs using per-section sync
>> > +     * =======================================
>> > +     *
>> > +     * This must exist because the EOS below requires the SYNC messages
>> > +     * per-channel to work.
>> > +     *
>> > +     * For modern QEMUs using per-round sync
>> > +     * =====================================
>> > +     *
>> > +     * Logically this sync is not needed (because we know there's nothing
>> > +     * in the multifd queue yet!).
>> 
>> This is a bit misleading because even today we could split the
>> multifd_ram_flush_and_sync() into _flush and _sync (haven't I seen a
>> patch doing this somewhere? Maybe from Maciej...) and call just the
>> _sync here, which is unrelated to any multifd queue.
>
> Yeah you have a point, maybe at least I shouldn't mention the queues, they
> can be irrelevant.
>
>> 
>> I think we shouldn't tie "sync" with "wait for multifd threads to finish
>> sending their data (a kind of flush)" as this implies. The sync is as
>> much making sure the threads are ready to receive as it is making sure
>> the data is received in order with relation to ram scanning rounds.
>> 
>> IOW, the local sync is what ensures multifd send threads are flushed
>> while this code deals with the sync of src&dst threads, which is "just"
>> a synchronization point between the two QEMUs.
>
> This is a remote sync, not a local sync.  But yes, it can be the
> synchronization point.
>
> As I mentioned below, such sync point has nothing to do with src, so it can
> be implemented in dest alone without such sync message.  It works, though.
>
>> 
>> > However as a side effect, this makes
>> > +     * sure the dest side won't receive any data before it properly reaches
>> > +     * ram_load_precopy().
>> 
>> I'm not sure it's a side-effect. It seems deliberate to me, seeing that
>> multifd usually does its own synchronization. For instance, on the send
>> side we also need some sync to make sure ram.c doesn't send data to
>> multifd send threads that are not ready yet (i.e. the wait on
>> channels_ready at the start of multifd_send()).
>
> Yes, and that's exactly what I wanted to express.  If dest has that
> "channels_ready" thing, I'm pretty sure we don't need this remote sync.
> We're using a heavier sync to service the purpose for "local sync for
> dest".  It's ok, but it's very unclear on what it really does.
>
>> 
>> > +     *
>> > +     * Without this sync, src QEMU can send data too soon so that dest may
>> > +     * not have been ready to receive it (e.g., rb->receivedmap may be
>> > +     * uninitialized, for example).
>> > +     *
>> > +     * Logically "wait for recv setup ready" shouldn't need to involve src
>> > +     * QEMU at all, however to be compatible with old QEMUs, let's stick
>> 
>> I don't understand this statement, you're saying that QEMU src could
>> just start dumping data on the channel without a remote end? Certainly
>> for file migrations, but socket as well?
>
> Yes.
>
> When reaching here on sender side, all multifd channels are already there:
> multifd_send_setup() guaranteed it.  We can start dump things, AFAICT,
> irrelevant of what dest is doing.
>
> Maybe the recv threads are not even created, but that isn't relevant, IMO,
> as long as they'll be there at some point and start collecting socket
> buffers.
>
>> 
>> > +     * with this.  Fortunately the overhead is low to sync during setup
>> > +     * because the VM is running, so at least it's not accounted as part of
>> > +     * downtime.
>> > +     */
>> >      bql_unlock();
>> >      ret = multifd_ram_flush_and_sync(f);
>> >      bql_lock();
>> 
>
> I removed ambiguous wordings on "queue is empty", and simplified it a bit.
> How's this one look?
>
>     /*
>      * This operation is unfortunate..
>      *
>      * For legacy QEMUs using per-section sync
>      * =======================================
>      *
>      * This must exist because the EOS below requires the SYNC messages
>      * per-channel to work.
>      *
>      * For modern QEMUs using per-round sync
>      * =====================================
>      *
>      * Logically such sync is not needed, and recv threads should not run
>      * until setup ready (using things like channels_ready on src).  Then
>      * we should be all fine.
>      *
>      * However even if we add channels_ready to recv side in new QEMUs, old
>      * QEMU won't have them so this sync will still be needed to make sure
>      * multifd recv threads won't start processing guest pages early before
>      * ram_load_setup() is properly done.
>      *
>      * Let's stick with this.  Fortunately the overhead is low to sync
>      * during setup because the VM is running, so at least it's not
>      * accounted as part of downtime.
>      */

Looks ok, thanks!


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-12-06 17:02 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06  0:58 [PATCH v2 0/7] migration/multifd: Some VFIO / postcopy preparations on flush Peter Xu
2024-12-06  0:58 ` [PATCH v2 1/7] migration/multifd: Further remove the SYNC on complete Peter Xu
2024-12-06 13:17   ` Fabiano Rosas
2024-12-06 14:40     ` Peter Xu
2024-12-06  0:58 ` [PATCH v2 2/7] migration/multifd: Allow to sync with sender threads only Peter Xu
2024-12-06 13:26   ` Fabiano Rosas
2024-12-06 14:50     ` Peter Xu
2024-12-06 15:00       ` Fabiano Rosas
2024-12-06  0:58 ` [PATCH v2 3/7] migration/ram: Move RAM_SAVE_FLAG* into ram.h Peter Xu
2024-12-06 13:43   ` Fabiano Rosas
2024-12-06 15:03     ` Peter Xu
2024-12-06 15:10       ` Fabiano Rosas
2024-12-06 15:46         ` Peter Xu
2024-12-06 16:58           ` Fabiano Rosas
2024-12-06  0:58 ` [PATCH v2 4/7] migration/multifd: Unify RAM_SAVE_FLAG_MULTIFD_FLUSH messages Peter Xu
2024-12-06 14:12   ` Fabiano Rosas
2024-12-06  0:58 ` [PATCH v2 5/7] migration/multifd: Remove sync processing on postcopy Peter Xu
2024-12-06 14:19   ` Fabiano Rosas
2024-12-06  0:58 ` [PATCH v2 6/7] migration/multifd: Cleanup src flushes on condition check Peter Xu
2024-12-06 14:18   ` Fabiano Rosas
2024-12-06 15:13     ` Peter Xu
2024-12-06  0:58 ` [PATCH v2 7/7] migration/multifd: Document the reason to sync for save_setup() Peter Xu
2024-12-06 14:40   ` Fabiano Rosas
2024-12-06 15:36     ` Peter Xu
2024-12-06 17:01       ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).