qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Avihai Horon <avihaih@nvidia.com>
To: <qemu-devel@nongnu.org>
Cc: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>,
	"Avihai Horon" <avihaih@nvidia.com>
Subject: [PATCH 05/17] migration/multifd: Wait for multifd channels creation before proceeding
Date: Thu, 25 Jan 2024 18:25:16 +0200	[thread overview]
Message-ID: <20240125162528.7552-6-avihaih@nvidia.com> (raw)
In-Reply-To: <20240125162528.7552-1-avihaih@nvidia.com>

Currently, multifd channels are created asynchronously without waiting
for their creation -- migration simply proceeds and may wait in
multifd_send_sync_main(), which is called by ram_save_setup(). This
hides in it some race conditions which can cause an unexpected behavior
if some channels creation fail.

For example, the following scenario of multifd migration with two
channels, where the first channel creation fails, will end in a
segmentation fault (time advances from top to bottom):

Thread           | Code execution
------------------------------------------------------------------------
Multifd 1        |
                 | multifd_new_send_channel_async (errors and quits)
                 |   multifd_new_send_channel_cleanup
                 |
Migration thread |
                 | qemu_savevm_state_setup
                 |   ram_save_setup
                 |     multifd_send_sync_main
                 |     (detects Multifd 1 error and quits)
                 | [...]
                 | migration_iteration_finish
                 |   migrate_fd_cleanup_schedule
                 |
Main thread      |
                 | migrate_fd_cleanup
                 |   multifd_save_cleanup (destroys Multifd 2 resources)
                 |
Multifd 2        |
                 | multifd_new_send_channel_async
                 | (accesses destroyed resources, segfault)

In another scenario, migration can hang indefinitely:
1. Main migration thread reaches multifd_send_sync_main() and waits on
   the semaphores.
2. Then, all multifd channels creation fails, so they post the
   semaphores and quit.
3. Main migration channel will not identify the error, proceed to send
   pages and will hang.

Fix it by waiting for all multifd channels to be created before
proceeding with migration.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/multifd.h   |  3 +++
 migration/migration.c |  1 +
 migration/multifd.c   | 34 +++++++++++++++++++++++++++++++---
 migration/ram.c       |  7 +++++++
 4 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 35d11f103c..87a64e0a87 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -23,6 +23,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(void);
 int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
+int multifd_send_channels_created(void);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
@@ -86,6 +87,8 @@ typedef struct {
     /* multifd flags for sending ram */
     int write_flags;
 
+    /* Syncs channel creation and migration thread */
+    QemuSemaphore create_sem;
     /* sem where to wait for more work */
     QemuSemaphore sem;
     /* syncs main thread and channels */
diff --git a/migration/migration.c b/migration/migration.c
index 9c769a1ecd..d81d96eaa5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3621,6 +3621,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
         error_report_err(local_err);
         migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                           MIGRATION_STATUS_FAILED);
+        multifd_send_channels_created();
         migrate_fd_cleanup(s);
         return;
     }
diff --git a/migration/multifd.c b/migration/multifd.c
index 564e911b6c..f0c216f4f9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -538,6 +538,7 @@ void multifd_save_cleanup(void)
         multifd_send_channel_destroy(p->c);
         p->c = NULL;
         qemu_mutex_destroy(&p->mutex);
+        qemu_sem_destroy(&p->create_sem);
         qemu_sem_destroy(&p->sem);
         qemu_sem_destroy(&p->sem_sync);
         g_free(p->name);
@@ -766,6 +767,29 @@ out:
     return NULL;
 }
 
+int multifd_send_channels_created(void)
+{
+    int ret = 0;
+
+    if (!migrate_multifd()) {
+        return 0;
+    }
+
+    for (int i = 0; i < migrate_multifd_channels(); i++) {
+        MultiFDSendParams *p = &multifd_send_state->params[i];
+
+        qemu_sem_wait(&p->create_sem);
+        WITH_QEMU_LOCK_GUARD(&p->mutex) {
+            if (p->quit) {
+                error_report("%s: channel %d has already quit", __func__, i);
+                ret = -1;
+            }
+        }
+    }
+
+    return ret;
+}
+
 static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error **errp);
@@ -794,6 +818,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
     p->quit = true;
     qemu_sem_post(&multifd_send_state->channels_ready);
     qemu_sem_post(&p->sem_sync);
+    qemu_sem_post(&p->create_sem);
     error_free(err);
 }
 
@@ -857,6 +882,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
     qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
                        QEMU_THREAD_JOINABLE);
     p->running = true;
+    qemu_sem_post(&p->create_sem);
     return true;
 }
 
@@ -864,15 +890,16 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
                                              QIOChannel *ioc, Error *err)
 {
      migrate_set_error(migrate_get_current(), err);
-     /* Error happen, we need to tell who pay attention to me */
-     qemu_sem_post(&multifd_send_state->channels_ready);
-     qemu_sem_post(&p->sem_sync);
      /*
+      * Error happen, we need to tell who pay attention to me.
       * Although multifd_send_thread is not created, but main migration
       * thread need to judge whether it is running, so we need to mark
       * its status.
       */
      p->quit = true;
+     qemu_sem_post(&multifd_send_state->channels_ready);
+     qemu_sem_post(&p->sem_sync);
+     qemu_sem_post(&p->create_sem);
      object_unref(OBJECT(ioc));
      error_free(err);
 }
@@ -921,6 +948,7 @@ int multifd_save_setup(Error **errp)
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
         qemu_mutex_init(&p->mutex);
+        qemu_sem_init(&p->create_sem, 0);
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
         p->quit = false;
diff --git a/migration/ram.c b/migration/ram.c
index c0cdcccb75..b3e864a22b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2937,6 +2937,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     RAMBlock *block;
     int ret;
 
+    bql_unlock();
+    ret = multifd_send_channels_created();
+    bql_lock();
+    if (ret < 0) {
+        return ret;
+    }
+
     if (compress_threads_save_setup()) {
         return -1;
     }
-- 
2.26.3



  parent reply	other threads:[~2024-01-25 16:28 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-25 16:25 [PATCH 00/17] migration: Add new migration channel connect and TLS upgrade APIs Avihai Horon
2024-01-25 16:25 ` [PATCH 01/17] migration: Fix logic of channels and transport compatibility check Avihai Horon
2024-01-26  3:09   ` Peter Xu
2024-01-25 16:25 ` [PATCH 02/17] migration: Move local_err check in migration_ioc_process_incoming() Avihai Horon
2024-01-26  3:10   ` Peter Xu
2024-01-25 16:25 ` [PATCH 03/17] migration: Rename default_channel to main_channel Avihai Horon
2024-01-26  3:11   ` Peter Xu
2024-01-25 16:25 ` [PATCH 04/17] migration/multifd: Set p->running = true in the right place Avihai Horon
2024-01-25 20:57   ` Fabiano Rosas
2024-01-28 15:43     ` Avihai Horon
2024-01-29  4:17       ` Peter Xu
2024-01-29 12:20         ` Avihai Horon
2024-01-30  5:57           ` Peter Xu
2024-01-30 18:44             ` Avihai Horon
2024-02-06 10:25               ` Peter Xu
2024-02-08 15:31                 ` Avihai Horon
2024-01-29 12:23         ` Fabiano Rosas
2024-01-25 16:25 ` Avihai Horon [this message]
2024-01-29 14:34   ` [PATCH 05/17] migration/multifd: Wait for multifd channels creation before proceeding Fabiano Rosas
2024-01-30 18:32     ` Avihai Horon
2024-01-30 21:32       ` Fabiano Rosas
2024-01-31  4:49         ` Peter Xu
2024-01-31 10:39         ` Avihai Horon
2024-01-25 16:25 ` [PATCH 06/17] migration/tls: Rename main migration channel TLS functions Avihai Horon
2024-01-25 16:25 ` [PATCH 07/17] migration/tls: Add new migration channel TLS upgrade API Avihai Horon
2024-01-25 16:25 ` [PATCH 08/17] migration: Use the new TLS upgrade API for main channel Avihai Horon
2024-01-25 16:25 ` [PATCH 09/17] migration/multifd: Use the new TLS upgrade API for multifd channels Avihai Horon
2024-01-25 16:25 ` [PATCH 10/17] migration/postcopy: Use the new TLS upgrade API for preempt channel Avihai Horon
2024-01-25 16:25 ` [PATCH 11/17] migration/tls: Make migration_tls_client_create() static Avihai Horon
2024-01-25 16:25 ` [PATCH 12/17] migration/multifd: Consolidate TLS/non-TLS multifd channel error flow Avihai Horon
2024-01-25 16:25 ` [PATCH 13/17] migration: Store MigrationAddress in MigrationState Avihai Horon
2024-01-25 16:25 ` [PATCH 14/17] migration: Rename migration_channel_connect() Avihai Horon
2024-01-25 16:25 ` [PATCH 15/17] migration: Add new migration channel connect API Avihai Horon
2024-01-25 16:25 ` [PATCH 16/17] migration/multifd: Use the new migration channel connect API for multifd Avihai Horon
2024-01-25 16:25 ` [PATCH 17/17] migration/postcopy: Use the new migration channel connect API for postcopy preempt Avihai Horon
2024-02-06 10:04 ` [PATCH 00/17] migration: Add new migration channel connect and TLS upgrade APIs Peter Xu
2024-02-06 13:10   ` Avihai Horon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240125162528.7552-6-avihaih@nvidia.com \
    --to=avihaih@nvidia.com \
    --cc=farosas@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).