All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Shivam Kumar <shivam.kumar1@nutanix.com>
Cc: Fabiano Rosas <farosas@suse.de>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [RFC PATCH] Fix race in live migration failure path
Date: Wed, 22 Jan 2025 11:40:10 -0500	[thread overview]
Message-ID: <Z5EfapuXuV7oFL7A@x1n> (raw)
In-Reply-To: <0C92F4E5-56EE-4036-927C-2F06F9F29252@nutanix.com>

Hi, Shivam,

On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
> There is one place where we set the migration status to FAILED but we don’t set
> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but
> not sure setting s->error in this case is possible (or required), as it seems a
> different workflow to me.

Yes it's very different indeed.  I may not yet fully get the challenge on
how switching to s->error (implies FAILING) would affect this one, though.
I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:

    ret = qemu_file_get_error(f);

IIUC we could also try to fetch s->error just like what migration thread
does, and if it sets means it's failing?

> 
> In addition, one potentially real problem that I see is this comment in
> migration_detect_error:
> /*
>  * For postcopy, we allow the network to be down for a
>  * while. After that, it can be continued by a
>  * recovery phase.
>  */
> Let's say if we set s->error at some place and there was a file error on either
> source or destination (qemu_file_get_error_obj_any returns a positive value

This is trivial, but I suppose you meant s/positive/negative/ here.. as
qemufile's last_error should always be negative, iiuc.

> when called by migration_detect_error). We expect migration to fail in this
> case but migration will continue to run since post-copy migration is tolerant
> to file errors?

Yes it can halt at postcopy_pause().  I'm not yet understand why it's an
issue to using s->error, though.

In general, I'm thinking whether we could also check s->error in detect
error path like this:

===8<===
diff --git a/migration/migration.c b/migration/migration.c
index 2d1da917c7..fbd97395e0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3015,17 +3015,17 @@ static MigThrError migration_detect_error(MigrationState *s)
     ret = qemu_file_get_error_obj_any(s->to_dst_file,
                                       s->postcopy_qemufile_src,
                                       &local_error);
-    if (!ret) {
-        /* Everything is fine */
-        assert(!local_error);
-        return MIG_THR_ERR_NONE;
-    }
-
-    if (local_error) {
+    if (ret) {
+        /* Passover qemufile errors to s->error */
+        assert(local_error);
         migrate_set_error(s, local_error);
         error_free(local_error);
     }
 
+    if (!migrate_has_error(s)) {
+        return MIG_THR_ERR_NONE;
+    }
+
     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
         /*
          * For postcopy, we allow the network to be down for a
@@ -3037,6 +3037,8 @@ static MigThrError migration_detect_error(MigrationState *s)
         /*
          * For precopy (or postcopy with error outside IO), we fail
          * with no time.
+         *
+         * TODO: update FAILED only until the end of migration in BH.
          */
         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
         trace_migration_thread_file_err();
===8<===

I kept a TODO above, I would hope if you reworked everything to route
errors to s->error, then we can move this to the cleanup BH to avoid the
race.

Do you think that could work?

Thanks,

-- 
Peter Xu



  reply	other threads:[~2025-01-22 16:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-10 10:07 [RFC PATCH] Fix race in live migration failure path Shivam Kumar
2025-01-10 13:09 ` Fabiano Rosas
2025-01-13 16:29   ` Peter Xu
2025-01-22 10:54     ` Shivam Kumar
2025-01-22 16:40       ` Peter Xu [this message]
2025-01-23  9:53         ` Shivam Kumar
2025-01-23 16:27           ` Peter Xu
2025-01-23 18:05             ` Shivam Kumar
2025-01-23 20:40               ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5EfapuXuV7oFL7A@x1n \
    --to=peterx@redhat.com \
    --cc=farosas@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=shivam.kumar1@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.