qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: peterx@redhat.com, Juan Quintela <quintela@redhat.com>
Subject: Re: [PATCH v3 10/10] tests/migration-test: Add a test for postcopy hangs during RECOVER
Date: Thu, 05 Oct 2023 10:24:54 -0300	[thread overview]
Message-ID: <87edi9fbh5.fsf@suse.de> (raw)
In-Reply-To: <20231004220240.167175-11-peterx@redhat.com>

Peter Xu <peterx@redhat.com> writes:

> From: Fabiano Rosas <farosas@suse.de>
>
> To do so, create two paired sockets, but make them not providing real data.
> Feed those fake sockets to src/dst QEMUs for recovery to let them go into
> RECOVER stage without going out.  Test that we can always kick it out and
> recover again with the right ports.
>
> This patch is based on Fabiano's version here:
>
> https://lore.kernel.org/r/877cowmdu0.fsf@suse.de
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> [peterx: write commit message, remove case 1, fix bugs, and more]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  tests/qtest/migration-test.c | 94 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 46f1c275a2..fb7a3765e4 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -729,6 +729,7 @@ typedef struct {
>      /* Postcopy specific fields */
>      void *postcopy_data;
>      bool postcopy_preempt;
> +    bool postcopy_recovery_test_fail;
>  } MigrateCommon;
>  
>  static int test_migrate_start(QTestState **from, QTestState **to,
> @@ -1381,6 +1382,78 @@ static void test_postcopy_preempt_tls_psk(void)
>  }
>  #endif
>  
> +static void wait_for_postcopy_status(QTestState *one, const char *status)
> +{
> +    wait_for_migration_status(one, status,
> +                              (const char * []) { "failed", "active",
> +                                                  "completed", NULL });
> +}
> +
> +static void postcopy_recover_fail(QTestState *from, QTestState *to)
> +{
> +    int ret, pair1[2], pair2[2];
> +    char c;
> +
> +    /* Create two unrelated socketpairs */
> +    ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair1);
> +    g_assert_cmpint(ret, ==, 0);
> +
> +    ret = qemu_socketpair(PF_LOCAL, SOCK_STREAM, 0, pair2);
> +    g_assert_cmpint(ret, ==, 0);
> +
> +    /*
> +     * Give the guests unpaired ends of the sockets, so they'll all blocked
> +     * at reading.  This mimics a wrong channel established.
> +     */
> +    qtest_qmp_fds_assert_success(from, &pair1[0], 1,
> +                                 "{ 'execute': 'getfd',"
> +                                 "  'arguments': { 'fdname': 'fd-mig' }}");
> +    qtest_qmp_fds_assert_success(to, &pair2[0], 1,
> +                                 "{ 'execute': 'getfd',"
> +                                 "  'arguments': { 'fdname': 'fd-mig' }}");
> +
> +    /*
> +     * Write the 1st byte as QEMU_VM_COMMAND (0x8) for the dest socket, to
> +     * emulate the 1st byte of a real recovery, but stops from there to
> +     * keep dest QEMU in RECOVER.  This is needed so that we can kick off
> +     * the recover process on dest QEMU (by triggering the G_IO_IN event).
> +     *
> +     * NOTE: this trick is not needed on src QEMUs, because src doesn't
> +     * rely on an pre-existing G_IO_IN event, so it will always trigger the
> +     * upcoming recovery anyway even if it can read nothing.
> +     */
> +#define QEMU_VM_COMMAND              0x08
> +    c = QEMU_VM_COMMAND;
> +    ret = send(pair2[1], &c, 1, 0);
> +    g_assert_cmpint(ret, ==, 1);
> +
> +    migrate_recover(to, "fd:fd-mig");
> +    migrate_qmp(from, "fd:fd-mig", "{'resume': true}");
> +
> +    /*
> +     * Make sure both QEMU instances will go into RECOVER stage, then test
> +     * kicking them out using migrate-pause.
> +     */
> +    wait_for_postcopy_status(from, "postcopy-recover");
> +    wait_for_postcopy_status(to, "postcopy-recover");

Is this wait out of place? I think we're trying to resume too fast after
migrate_recover():

# {                        
#     "error": {                                                                                                                                                                               
#         "class": "GenericError",                                                                                                                                                             
#         "desc": "Cannot resume if there is no paused migration"
#     }                                                                                                                                                                                        
# }  

> +
> +    /*
> +     * This would be issued by the admin upon noticing the hang, we should
> +     * make sure we're able to kick this out.
> +     */
> +    migrate_pause(from);
> +    wait_for_postcopy_status(from, "postcopy-paused");
> +
> +    /* Do the same test on dest */
> +    migrate_pause(to);
> +    wait_for_postcopy_status(to, "postcopy-paused");
> +
> +    close(pair1[0]);
> +    close(pair1[1]);
> +    close(pair2[0]);
> +    close(pair2[1]);
> +}
> +
>  static void test_postcopy_recovery_common(MigrateCommon *args)
>  {
>      QTestState *from, *to;
> @@ -1420,6 +1493,15 @@ static void test_postcopy_recovery_common(MigrateCommon *args)
>                                (const char * []) { "failed", "active",
>                                                    "completed", NULL });
>  
> +    if (args->postcopy_recovery_test_fail) {
> +        /*
> +         * Test when a wrong socket specified for recover, and then the
> +         * ability to kick it out, and continue with a correct socket.
> +         */
> +        postcopy_recover_fail(from, to);
> +        /* continue with a good recovery */
> +    }
> +
>      /*
>       * Create a new socket to emulate a new channel that is different
>       * from the broken migration channel; tell the destination to
> @@ -1459,6 +1541,15 @@ static void test_postcopy_recovery_compress(void)
>      test_postcopy_recovery_common(&args);
>  }
>  
> +static void test_postcopy_recovery_double_fail(void)
> +{
> +    MigrateCommon args = {
> +        .postcopy_recovery_test_fail = true,
> +    };
> +
> +    test_postcopy_recovery_common(&args);
> +}
> +
>  #ifdef CONFIG_GNUTLS
>  static void test_postcopy_recovery_tls_psk(void)
>  {
> @@ -2841,6 +2932,9 @@ int main(int argc, char **argv)
>              qtest_add_func("/migration/postcopy/recovery/compress/plain",
>                             test_postcopy_recovery_compress);
>          }
> +        qtest_add_func("/migration/postcopy/recovery/double-failures",
> +                       test_postcopy_recovery_double_fail);
> +
>      }
>  
>      qtest_add_func("/migration/bad_dest", test_baddest);


  reply	other threads:[~2023-10-05 13:27 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-04 22:02 [PATCH v3 00/10] migration: Better error handling in rp thread, allow failures in recover Peter Xu
2023-10-04 22:02 ` [PATCH v3 01/10] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-10-05  7:28   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 02/10] migration: Introduce migrate_has_error() Peter Xu
2023-10-05  7:30   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 03/10] migration: Refactor error handling in source return path Peter Xu
2023-10-05  6:11   ` Philippe Mathieu-Daudé
2023-10-05 16:05     ` Peter Xu
2023-10-08 11:39       ` Philippe Mathieu-Daudé
2023-10-05  8:22   ` Juan Quintela
2023-10-05 19:35     ` Peter Xu
2023-10-05 12:57   ` Fabiano Rosas
2023-10-05 19:35     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 04/10] migration: Deliver return path file error to migrate state too Peter Xu
2023-10-05  7:32   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 05/10] qemufile: Always return a verbose error Peter Xu
2023-10-05  7:42   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 06/10] migration: Remember num of ramblocks to sync during recovery Peter Xu
2023-10-05  7:43   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 07/10] migration: Add migration_rp_wait|kick() Peter Xu
2023-10-05  7:49   ` Juan Quintela
2023-10-05 20:47     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 08/10] migration: Allow network to fail even during recovery Peter Xu
2023-10-05 13:25   ` Fabiano Rosas
2023-10-04 22:02 ` [PATCH v3 09/10] migration: Allow RECOVER->PAUSED convertion for dest qemu Peter Xu
2023-10-05  8:24   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 10/10] tests/migration-test: Add a test for postcopy hangs during RECOVER Peter Xu
2023-10-05 13:24   ` Fabiano Rosas [this message]
2023-10-05 13:37     ` Fabiano Rosas
2023-10-05 20:55       ` Peter Xu
2023-10-05 21:10         ` Fabiano Rosas
2023-10-05 21:44           ` Peter Xu
2023-10-05 22:01             ` Fabiano Rosas
2023-10-09 16:50               ` Fabiano Rosas
2023-10-10 16:00                 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87edi9fbh5.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).