All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: peterx@redhat.com, Juan Quintela <quintela@redhat.com>,
	Xiaohui Li <xiaohli@redhat.com>
Subject: Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery
Date: Mon, 11 Sep 2023 21:31:51 -0300	[thread overview]
Message-ID: <877cowmdu0.fsf@suse.de> (raw)
In-Reply-To: <20230829214235.69309-10-peterx@redhat.com>

Peter Xu <peterx@redhat.com> writes:

Hi, sorry it took me so long to get to this.

> Normally the postcopy recover phase should only exist for a super short
> period, that's the duration when QEMU is trying to recover from an
> interrupted postcopy migration, during which handshake will be carried out
> for continuing the procedure with state changes from PAUSED -> RECOVER ->
> POSTCOPY_ACTIVE again.
>
> Here RECOVER phase should be super small, that happens right after the
> admin specified a new but working network link for QEMU to reconnect to
> dest QEMU.
>
> However there can still be case where the channel is broken in this small
> RECOVER window.
>
> If it happens, with current code there's no way the src QEMU can got kicked
> out of RECOVER stage. No way either to retry the recover in another channel
> when established.
>
> This patch allows the RECOVER phase to fail itself too - we're mostly
> ready, just some small things missing, e.g. properly kick the main
> migration thread out when sleeping on rp_sem when we found that we're at
> RECOVER stage.  When this happens, it fails the RECOVER itself, and
> rollback to PAUSED stage.  Then the user can retry another round of
> recovery.
>
> To make it even stronger, teach QMP command migrate-pause to explicitly
> kick src/dst QEMU out when needed, so even if for some reason the migration
> thread didn't got kicked out already by a failing rethrn-path thread, the
> admin can also kick it out.
>
> This will be an super, super corner case, but still try to cover that.

It would be nice to have a test for this. Being such a corner case, it
will be hard to keep this scenario working.

I wrote two tests[1] that do the recovery each using a different URI:
1) fd: using a freshly opened file,
2) fd: using a socketpair that simply has nothing on the other end.

These might be too far from the original bug, but it seems to exercise
some of the same paths:

Scenario 1:
/x86_64/migration/postcopy/recovery/fail-twice

the stacks are:

Thread 8 (Thread 0x7fffd5ffe700 (LWP 30282) "live_migration"):
 qemu_sem_wait
 ram_dirty_bitmap_sync_all
 ram_resume_prepare
 qemu_savevm_state_resume_prepare
 postcopy_do_resume
 postcopy_pause
 migration_detect_error
 migration_thread

Thread 7 (Thread 0x7fffd67ff700 (LWP 30281) "return path"):
 qemu_sem_wait
 postcopy_pause_return_path_thread
 source_return_path_thread

This patch seems to fix it, although we cannot call qmp_migrate_recover
a second time because the mis state is now in RECOVER:

  "Migrate recover can only be run when postcopy is paused."

Do we maybe need to return the state to PAUSED, or allow
qmp_migrate_recover to run in RECOVER, like you did on the src side?


Scenario 2:
/x86_64/migration/postcopy/recovery/fail-twice/rp

Thread 8 (Thread 0x7fffd5ffe700 (LWP 30456) "live_migration"):
 qemu_sem_wait
 ram_dirty_bitmap_sync_all
 ram_resume_prepare
 qemu_savevm_state_resume_prepare
 postcopy_do_resume
 postcopy_pause
 migration_detect_error
 migration_thread

Thread 7 (Thread 0x7fffd67ff700 (LWP 30455) "return path"):
 recvmsg
 qio_channel_socket_readv
 qio_channel_readv_full
 qio_channel_read
 qemu_fill_buffer
 qemu_peek_byte
 qemu_get_byte
 qemu_get_be16
 source_return_path_thread

Here, with this patch the migration gets stuck unless we call
migrate_pause() one more time. After another round of migrate_pause +
recover, it finishes properly.


1- hacked-together test:
-->8--
From a34685c8795799350665a880cd2ddddbf53c5812 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Mon, 11 Sep 2023 20:45:33 -0300
Subject: [PATCH] test patch

---
 tests/qtest/migration-test.c | 87 ++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 1b43df5ca7..4d9d2209c1 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -695,6 +695,7 @@ typedef struct {
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
+    int postcopy_recovery_method;
 } MigrateCommon;
 
 static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1357,6 +1358,61 @@ static void test_postcopy_preempt_tls_psk(void)
 }
 #endif
 
+static void postcopy_recover_fail(QTestState *from, QTestState *to, int method)
+{
+    int src, dst;
+
+    if (method == 1) {
+        /* give it some random fd to recover */
+        g_autofree char *uri = g_strdup_printf("%s/noop", tmpfs);
+        src = dst = open(uri, O_CREAT|O_RDWR);
+    } else if (method == 2) {
+        int ret, pair1[2], pair2[2];
+
+        /* create two unrelated socketpairs */
+        ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
+        g_assert_cmpint(ret, ==, 0);
+
+        ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair2);
+        g_assert_cmpint(ret, ==, 0);
+
+        /* give the guests unpaired ends of the sockets */
+        src = pair1[0];
+        dst = pair2[0];
+    }
+
+    qtest_qmp_fds_assert_success(to, &src, 1,
+                                 "{ 'execute': 'getfd',"
+                                 "  'arguments': { 'fdname': 'fd-mig' }}");
+
+    qtest_qmp_fds_assert_success(from, &dst, 1,
+                                 "{ 'execute': 'getfd',"
+                                 "  'arguments': { 'fdname': 'fd-mig' }}");
+
+    migrate_recover(to, "fd:fd-mig");
+
+    wait_for_migration_status(from, "postcopy-paused",
+                              (const char * []) { "failed", "active",
+                                  "completed", NULL });
+    migrate_qmp(from, "fd:fd-mig", "{'resume': true}");
+
+    printf("WAIT\n");
+    if (method == 2) {
+        /* This would be issued by the admin upon noticing the hang */
+        migrate_pause(from);
+    }
+
+    wait_for_migration_status(from, "postcopy-paused",
+                              (const char * []) { "failed", "active",
+                                  "completed", NULL });
+    printf("PAUSED\n");
+
+    close(src);
+    if (method == 2) {
+        close(dst);
+    }
+}
+
 static void test_postcopy_recovery_common(MigrateCommon *args)
 {
     QTestState *from, *to;
@@ -1396,6 +1452,13 @@ static void test_postcopy_recovery_common(MigrateCommon *args)
                               (const char * []) { "failed", "active",
                                                   "completed", NULL });
 
+    if (args->postcopy_recovery_method) {
+        /* fail the recovery */
+        postcopy_recover_fail(from, to, args->postcopy_recovery_method);
+
+        /* continue with a good recovery */
+    }
+
     /*
      * Create a new socket to emulate a new channel that is different
      * from the broken migration channel; tell the destination to
@@ -1435,6 +1498,24 @@ static void test_postcopy_recovery_compress(void)
     test_postcopy_recovery_common(&args);
 }
 
+static void test_postcopy_recovery_fail(void)
+{
+    MigrateCommon args = {
+        .postcopy_recovery_method = 1,
+    };
+
+    test_postcopy_recovery_common(&args);
+}
+
+static void test_postcopy_recovery_fail_rp(void)
+{
+    MigrateCommon args = {
+        .postcopy_recovery_method = 2,
+    };
+
+    test_postcopy_recovery_common(&args);
+}
+
 #ifdef CONFIG_GNUTLS
 static void test_postcopy_recovery_tls_psk(void)
 {
@@ -2825,6 +2906,12 @@ int main(int argc, char **argv)
             qtest_add_func("/migration/postcopy/recovery/compress/plain",
                            test_postcopy_recovery_compress);
         }
+        qtest_add_func("/migration/postcopy/recovery/fail-twice",
+                       test_postcopy_recovery_fail);
+
+        qtest_add_func("/migration/postcopy/recovery/fail-twice/rp",
+                       test_postcopy_recovery_fail_rp);
+
     }
 
     qtest_add_func("/migration/bad_dest", test_baddest);
-- 
2.35.3



  reply	other threads:[~2023-09-12  0:32 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-29 21:42 [PATCH 0/9] migration: Better error handling in rp thread, allow failures in recover Peter Xu
2023-08-29 21:42 ` [PATCH 1/9] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-08-29 21:42 ` [PATCH 2/9] migration: Let migrate_set_error() take ownership Peter Xu
2023-09-12 19:40   ` Fabiano Rosas
2023-09-12 20:14     ` Peter Xu
2023-08-29 21:42 ` [PATCH 3/9] migration: Introduce migrate_has_error() Peter Xu
2023-08-29 21:42 ` [PATCH 4/9] migration: Refactor error handling in source return path Peter Xu
2023-08-29 21:42 ` [PATCH 5/9] migration: Deliver return path file error to migrate state too Peter Xu
2023-08-29 21:42 ` [PATCH 6/9] qemufile: Always return a verbose error Peter Xu
2023-08-29 21:42 ` [PATCH 7/9] migration: Remember num of ramblocks to sync during recovery Peter Xu
2023-09-12  0:33   ` Fabiano Rosas
2023-08-29 21:42 ` [PATCH 8/9] migration: Add migration_rp_wait|kick() Peter Xu
2023-09-12  0:32   ` Fabiano Rosas
2023-08-29 21:42 ` [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery Peter Xu
2023-09-12  0:31   ` Fabiano Rosas [this message]
2023-09-12 20:05     ` Peter Xu
2023-09-12 22:16       ` Peter Xu
2023-09-12 22:49       ` Fabiano Rosas
2023-09-13  0:38         ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877cowmdu0.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=xiaohli@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.