From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: peterx@redhat.com, Juan Quintela <quintela@redhat.com>,
Xiaohui Li <xiaohli@redhat.com>
Subject: Re: [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery
Date: Mon, 11 Sep 2023 21:31:51 -0300 [thread overview]
Message-ID: <877cowmdu0.fsf@suse.de> (raw)
In-Reply-To: <20230829214235.69309-10-peterx@redhat.com>
Peter Xu <peterx@redhat.com> writes:
Hi, sorry it took me so long to get to this.
> Normally the postcopy recover phase should only exist for a super short
> period, that's the duration when QEMU is trying to recover from an
> interrupted postcopy migration, during which handshake will be carried out
> for continuing the procedure with state changes from PAUSED -> RECOVER ->
> POSTCOPY_ACTIVE again.
>
> Here RECOVER phase should be super small, that happens right after the
> admin specified a new but working network link for QEMU to reconnect to
> dest QEMU.
>
> However there can still be case where the channel is broken in this small
> RECOVER window.
>
> If it happens, with current code there's no way the src QEMU can got kicked
> out of RECOVER stage. No way either to retry the recover in another channel
> when established.
>
> This patch allows the RECOVER phase to fail itself too - we're mostly
> ready, just some small things missing, e.g. properly kick the main
> migration thread out when sleeping on rp_sem when we found that we're at
> RECOVER stage. When this happens, it fails the RECOVER itself, and
> rollback to PAUSED stage. Then the user can retry another round of
> recovery.
>
> To make it even stronger, teach QMP command migrate-pause to explicitly
> kick src/dst QEMU out when needed, so even if for some reason the migration
> thread didn't got kicked out already by a failing rethrn-path thread, the
> admin can also kick it out.
>
> This will be an super, super corner case, but still try to cover that.
It would be nice to have a test for this. Being such a corner case, it
will be hard to keep this scenario working.
I wrote two tests[1] that do the recovery each using a different URI:
1) fd: using a freshly opened file,
2) fd: using a socketpair that simply has nothing on the other end.
These might be too far from the original bug, but it seems to exercise
some of the same paths:
Scenario 1:
/x86_64/migration/postcopy/recovery/fail-twice
the stacks are:
Thread 8 (Thread 0x7fffd5ffe700 (LWP 30282) "live_migration"):
qemu_sem_wait
ram_dirty_bitmap_sync_all
ram_resume_prepare
qemu_savevm_state_resume_prepare
postcopy_do_resume
postcopy_pause
migration_detect_error
migration_thread
Thread 7 (Thread 0x7fffd67ff700 (LWP 30281) "return path"):
qemu_sem_wait
postcopy_pause_return_path_thread
source_return_path_thread
This patch seems to fix it, although we cannot call qmp_migrate_recover
a second time because the mis state is now in RECOVER:
"Migrate recover can only be run when postcopy is paused."
Do we maybe need to return the state to PAUSED, or allow
qmp_migrate_recover to run in RECOVER, like you did on the src side?
Scenario 2:
/x86_64/migration/postcopy/recovery/fail-twice/rp
Thread 8 (Thread 0x7fffd5ffe700 (LWP 30456) "live_migration"):
qemu_sem_wait
ram_dirty_bitmap_sync_all
ram_resume_prepare
qemu_savevm_state_resume_prepare
postcopy_do_resume
postcopy_pause
migration_detect_error
migration_thread
Thread 7 (Thread 0x7fffd67ff700 (LWP 30455) "return path"):
recvmsg
qio_channel_socket_readv
qio_channel_readv_full
qio_channel_read
qemu_fill_buffer
qemu_peek_byte
qemu_get_byte
qemu_get_be16
source_return_path_thread
Here, with this patch the migration gets stuck unless we call
migrate_pause() one more time. After another round of migrate_pause +
recover, it finishes properly.
1- hacked-together test:
-->8--
From a34685c8795799350665a880cd2ddddbf53c5812 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Mon, 11 Sep 2023 20:45:33 -0300
Subject: [PATCH] test patch
---
tests/qtest/migration-test.c | 87 ++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 1b43df5ca7..4d9d2209c1 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -695,6 +695,7 @@ typedef struct {
/* Postcopy specific fields */
void *postcopy_data;
bool postcopy_preempt;
+ int postcopy_recovery_method;
} MigrateCommon;
static int test_migrate_start(QTestState **from, QTestState **to,
@@ -1357,6 +1358,61 @@ static void test_postcopy_preempt_tls_psk(void)
}
#endif
+static void postcopy_recover_fail(QTestState *from, QTestState *to, int method)
+{
+ int src, dst;
+
+ if (method == 1) {
+ /* give it some random fd to recover */
+ g_autofree char *uri = g_strdup_printf("%s/noop", tmpfs);
+ src = dst = open(uri, O_CREAT|O_RDWR);
+ } else if (method == 2) {
+ int ret, pair1[2], pair2[2];
+
+ /* create two unrelated socketpairs */
+ ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
+ g_assert_cmpint(ret, ==, 0);
+
+ ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair2);
+ g_assert_cmpint(ret, ==, 0);
+
+ /* give the guests unpaired ends of the sockets */
+ src = pair1[0];
+ dst = pair2[0];
+ }
+
+ qtest_qmp_fds_assert_success(to, &src, 1,
+ "{ 'execute': 'getfd',"
+ " 'arguments': { 'fdname': 'fd-mig' }}");
+
+ qtest_qmp_fds_assert_success(from, &dst, 1,
+ "{ 'execute': 'getfd',"
+ " 'arguments': { 'fdname': 'fd-mig' }}");
+
+ migrate_recover(to, "fd:fd-mig");
+
+ wait_for_migration_status(from, "postcopy-paused",
+ (const char * []) { "failed", "active",
+ "completed", NULL });
+ migrate_qmp(from, "fd:fd-mig", "{'resume': true}");
+
+ printf("WAIT\n");
+ if (method == 2) {
+ /* This would be issued by the admin upon noticing the hang */
+ migrate_pause(from);
+ }
+
+ wait_for_migration_status(from, "postcopy-paused",
+ (const char * []) { "failed", "active",
+ "completed", NULL });
+ printf("PAUSED\n");
+
+ close(src);
+ if (method == 2) {
+ close(dst);
+ }
+}
+
static void test_postcopy_recovery_common(MigrateCommon *args)
{
QTestState *from, *to;
@@ -1396,6 +1452,13 @@ static void test_postcopy_recovery_common(MigrateCommon *args)
(const char * []) { "failed", "active",
"completed", NULL });
+ if (args->postcopy_recovery_method) {
+ /* fail the recovery */
+ postcopy_recover_fail(from, to, args->postcopy_recovery_method);
+
+ /* continue with a good recovery */
+ }
+
/*
* Create a new socket to emulate a new channel that is different
* from the broken migration channel; tell the destination to
@@ -1435,6 +1498,24 @@ static void test_postcopy_recovery_compress(void)
test_postcopy_recovery_common(&args);
}
+static void test_postcopy_recovery_fail(void)
+{
+ MigrateCommon args = {
+ .postcopy_recovery_method = 1,
+ };
+
+ test_postcopy_recovery_common(&args);
+}
+
+static void test_postcopy_recovery_fail_rp(void)
+{
+ MigrateCommon args = {
+ .postcopy_recovery_method = 2,
+ };
+
+ test_postcopy_recovery_common(&args);
+}
+
#ifdef CONFIG_GNUTLS
static void test_postcopy_recovery_tls_psk(void)
{
@@ -2825,6 +2906,12 @@ int main(int argc, char **argv)
qtest_add_func("/migration/postcopy/recovery/compress/plain",
test_postcopy_recovery_compress);
}
+ qtest_add_func("/migration/postcopy/recovery/fail-twice",
+ test_postcopy_recovery_fail);
+
+ qtest_add_func("/migration/postcopy/recovery/fail-twice/rp",
+ test_postcopy_recovery_fail_rp);
+
}
qtest_add_func("/migration/bad_dest", test_baddest);
--
2.35.3
next prev parent reply other threads:[~2023-09-12 0:32 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-29 21:42 [PATCH 0/9] migration: Better error handling in rp thread, allow failures in recover Peter Xu
2023-08-29 21:42 ` [PATCH 1/9] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-08-29 21:42 ` [PATCH 2/9] migration: Let migrate_set_error() take ownership Peter Xu
2023-09-12 19:40 ` Fabiano Rosas
2023-09-12 20:14 ` Peter Xu
2023-08-29 21:42 ` [PATCH 3/9] migration: Introduce migrate_has_error() Peter Xu
2023-08-29 21:42 ` [PATCH 4/9] migration: Refactor error handling in source return path Peter Xu
2023-08-29 21:42 ` [PATCH 5/9] migration: Deliver return path file error to migrate state too Peter Xu
2023-08-29 21:42 ` [PATCH 6/9] qemufile: Always return a verbose error Peter Xu
2023-08-29 21:42 ` [PATCH 7/9] migration: Remember num of ramblocks to sync during recovery Peter Xu
2023-09-12 0:33 ` Fabiano Rosas
2023-08-29 21:42 ` [PATCH 8/9] migration: Add migration_rp_wait|kick() Peter Xu
2023-09-12 0:32 ` Fabiano Rosas
2023-08-29 21:42 ` [PATCH 9/9] migration/postcopy: Allow network to fail even during recovery Peter Xu
2023-09-12 0:31 ` Fabiano Rosas [this message]
2023-09-12 20:05 ` Peter Xu
2023-09-12 22:16 ` Peter Xu
2023-09-12 22:49 ` Fabiano Rosas
2023-09-13 0:38 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877cowmdu0.fsf@suse.de \
--to=farosas@suse.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=xiaohli@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).