qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] migration/block: disk activation rewrite
@ 2024-12-06 23:08 Peter Xu
  2024-12-06 23:08 ` [PATCH v2 1/6] migration: Add helper to get target runstate Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033
 (note: it's a pipeline of two patchsets, to save CI credits and time)

v1: https://lore.kernel.org/r/20241204005138.702289-1-peterx@redhat.com

This is v2 of the series, removing RFC tag, because my goal is to have them
(or some newer version) merged.

The major change is I merged last three patches, and did quite some changes
here and there, to make sure the global disk activation status is always
consistent.  The whole idea is still the same.  I say changelog won't help.

I also temporarily dropped Fabiano's ping-pong test cases to avoid
different versions floating on the list (as I know a new version is coming
at some point. Fabiano: you're taking over the 10.0 pulls, so I assume
you're aware so there's no concern on order of merges).  I'll review the
test cases separately when they're ready, but this series is still tested
with that pingpong test and it keeps working.

I started looking at this problem as a whole when reviewing Fabiano's
series, especially the patch (for a QEMU crash [1]):

https://lore.kernel.org/r/20241125144612.16194-5-farosas@suse.de

The proposed patch could work, but it's unwanted to add such side effect to
migration.  So I start to think about whether we can provide a cleaner
approach, because migration doesn't need the disks to be active to work at
all.  Hence we should try to avoid adding a migration ABI (which doesn't
matter now, but may matter some day) into prepare phase on disk activation
status.  Migration should happen with disks inactivated.

It's also a pure wish that, if bdrv_inactivate_all() could be benign to be
called even if all disks are already inactive.  Then the bug is also gone.
After all, similar call on bdrv_activate_all() upon all-active disks is all
fine.  I hope that wish could still be fair.  But I don't know well on
block layer to say anything meaningful.

And when I was looking at that, I found more things spread all over the
place on disk activation.  I decided to clean all of them up, while
hopefully fixing the QEMU crash [1] too.

For this v2, I did some more tests, I want to make sure all the past paths
keep working at least on failure or cancel races, also in postcopy failure
cases.  So I did below and they all run pass (when I said "emulated" below,
I meant I hacked something to trigger those race / rare failures, because
they aren't easy to trigger with vanilla binary):

- Tested generic migrate_cancel during precopy, disk activation won't be
  affected.  Disk status reports correct values in tracepoints.

- Test Fabiano's ping-pong migration tests on PAUSED state VM.

- Emulated precopy failure before sending non-iterable, disk inactivation
  won't happen, and also activation won't trigger after migration cleanups
  (even if activation on top of activate disk is benign, I checked traces
  to make sure it'll provide consistent disk status, skipping activation).

- Emulated precopy failure right after sending non-iterable. Disks will be
  inactivated, but then can be reactivated properly before VM starts.

- Emulated postcopy failure when sending the packed data (which is after
  disk invalidated), and making sure src VM will get back to live properly,
  re-activate the disks before starting.

- Emulated concurrent migrate_cancel at the end of migration_completion()
  of precopy, after disk inactivated.  Disks can be reactivated properly.

  NOTE: here if dest QEMU didn't quit before migrate_cancel,
  bdrv_activate_all() can crash src QEMU.  This behavior should be the same
  before/after this patch.

Comments welcomed, thanks.

[1] https://gitlab.com/qemu-project/qemu/-/issues/2395

Peter Xu (6):
  migration: Add helper to get target runstate
  qmp/cont: Only activate disks if migration completed
  migration/block: Make late-block-active the default
  migration/block: Apply late-block-active behavior to postcopy
  migration/block: Fix possible race with block_inactive
  migration/block: Rewrite disk activation

 include/migration/misc.h |   4 ++
 migration/migration.h    |   6 +-
 migration/block-active.c |  94 +++++++++++++++++++++++++++
 migration/colo.c         |   2 +-
 migration/migration.c    | 136 +++++++++++++++------------------------
 migration/savevm.c       |  46 ++++++-------
 monitor/qmp-cmds.c       |  22 +++----
 migration/meson.build    |   1 +
 migration/trace-events   |   3 +
 9 files changed, 188 insertions(+), 126 deletions(-)
 create mode 100644 migration/block-active.c

-- 
2.47.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/6] migration: Add helper to get target runstate
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-06 23:08 ` [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

In 99% cases, after QEMU migrates to dest host, it tries to detect the
target VM runstate using global_state_get_runstate().

There's one outlier so far which is Xen that won't send global state.
That's the major reason why global_state_received() check was always there
together with global_state_get_runstate().

However it's utterly confusing why global_state_received() has anything to
do with "let's start VM or not".

Provide a helper to explain it, then we have an unified entry for getting
the target dest QEMU runstate after migration.

Suggested-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 8c5bd0a75c..d2a6b939cf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -135,6 +135,21 @@ static bool migration_needs_multiple_sockets(void)
     return migrate_multifd() || migrate_postcopy_preempt();
 }
 
+static RunState migration_get_target_runstate(void)
+{
+    /*
+     * When the global state is not migrated, it means we don't know the
+     * runstate of the src QEMU.  We don't have much choice but assuming
+     * the VM is running.  NOTE: this is pretty rare case, so far only Xen
+     * uses it.
+     */
+    if (!global_state_received()) {
+        return RUN_STATE_RUNNING;
+    }
+
+    return global_state_get_runstate();
+}
+
 static bool transport_supports_multi_channels(MigrationAddress *addr)
 {
     if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
@@ -735,8 +750,7 @@ static void process_incoming_migration_bh(void *opaque)
      * unless we really are starting the VM.
      */
     if (!migrate_late_block_activate() ||
-         (autostart && (!global_state_received() ||
-            runstate_is_live(global_state_get_runstate())))) {
+        (autostart && runstate_is_live(migration_get_target_runstate()))) {
         /* Make sure all file formats throw away their mutable metadata.
          * If we get an error here, just don't restart the VM yet. */
         bdrv_activate_all(&local_err);
@@ -759,8 +773,7 @@ static void process_incoming_migration_bh(void *opaque)
 
     dirty_bitmap_mig_before_vm_start();
 
-    if (!global_state_received() ||
-        runstate_is_live(global_state_get_runstate())) {
+    if (runstate_is_live(migration_get_target_runstate())) {
         if (autostart) {
             vm_start();
         } else {
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
  2024-12-06 23:08 ` [PATCH v2 1/6] migration: Add helper to get target runstate Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-16 15:10   ` Fabiano Rosas
  2024-12-06 23:08 ` [PATCH v2 3/6] migration/block: Make late-block-active the default Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé,
	Markus Armbruster

As the comment says, the activation of disks is for the case where
migration has completed, rather than when QEMU is still during
migration (RUN_STATE_INMIGRATE).

Move the code over to reflect what the comment is describing.

Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 monitor/qmp-cmds.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index f84a0dc523..76f21e8af3 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -96,21 +96,23 @@ void qmp_cont(Error **errp)
         }
     }
 
-    /* Continuing after completed migration. Images have been inactivated to
-     * allow the destination to take control. Need to get control back now.
-     *
-     * If there are no inactive block nodes (e.g. because the VM was just
-     * paused rather than completing a migration), bdrv_inactivate_all() simply
-     * doesn't do anything. */
-    bdrv_activate_all(&local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
-
     if (runstate_check(RUN_STATE_INMIGRATE)) {
         autostart = 1;
     } else {
+        /*
+         * Continuing after completed migration. Images have been
+         * inactivated to allow the destination to take control. Need to
+         * get control back now.
+         *
+         * If there are no inactive block nodes (e.g. because the VM was
+         * just paused rather than completing a migration),
+         * bdrv_inactivate_all() simply doesn't do anything.
+         */
+        bdrv_activate_all(&local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
         vm_start();
     }
 }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/6] migration/block: Make late-block-active the default
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
  2024-12-06 23:08 ` [PATCH v2 1/6] migration: Add helper to get target runstate Peter Xu
  2024-12-06 23:08 ` [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-16 15:16   ` Fabiano Rosas
  2024-12-06 23:08 ` [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

Migration capability 'late-block-active' controls when the block drives
will be activated.  If enabled, block drives will only be activated until
VM starts, either src runstate was "live" (RUNNING, or SUSPENDED), or it'll
be postponed until qmp_cont().

Let's do this unconditionally.  There's no harm to delay activation of
block drives.  Meanwhile there's no ABI breakage if dest does it, because
src QEMU has nothing to do with it, so it's no concern on ABI breakage.

IIUC we could avoid introducing this cap when introducing it before, but
now it's still not too late to just always do it.  Cap now prone to
removal, but it'll be for later patches.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index d2a6b939cf..e6db9cfc50 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -743,24 +743,6 @@ static void process_incoming_migration_bh(void *opaque)
 
     trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter");
 
-    /* If capability late_block_activate is set:
-     * Only fire up the block code now if we're going to restart the
-     * VM, else 'cont' will do it.
-     * This causes file locking to happen; so we don't want it to happen
-     * unless we really are starting the VM.
-     */
-    if (!migrate_late_block_activate() ||
-        (autostart && runstate_is_live(migration_get_target_runstate()))) {
-        /* Make sure all file formats throw away their mutable metadata.
-         * If we get an error here, just don't restart the VM yet. */
-        bdrv_activate_all(&local_err);
-        if (local_err) {
-            error_report_err(local_err);
-            local_err = NULL;
-            autostart = false;
-        }
-    }
-
     /*
      * This must happen after all error conditions are dealt with and
      * we're sure the VM is going to be running on this host.
@@ -775,7 +757,25 @@ static void process_incoming_migration_bh(void *opaque)
 
     if (runstate_is_live(migration_get_target_runstate())) {
         if (autostart) {
-            vm_start();
+            /*
+             * Block activation is always delayed until VM starts, either
+             * here (which means we need to start the dest VM right now..),
+             * or until qmp_cont() later.
+             *
+             * We used to have cap 'late-block-activate' but now we do this
+             * unconditionally, as it has no harm but only benefit.  E.g.,
+             * it's not part of migration ABI on the time of disk activation.
+             *
+             * Make sure all file formats throw away their mutable
+             * metadata.  If error, don't restart the VM yet.
+             */
+            bdrv_activate_all(&local_err);
+            if (local_err) {
+                error_report_err(local_err);
+                local_err = NULL;
+            } else {
+                vm_start();
+            }
         } else {
             runstate_set(RUN_STATE_PAUSED);
         }
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
                   ` (2 preceding siblings ...)
  2024-12-06 23:08 ` [PATCH v2 3/6] migration/block: Make late-block-active the default Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-16 15:17   ` Fabiano Rosas
  2024-12-06 23:08 ` [PATCH v2 5/6] migration/block: Fix possible race with block_inactive Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

Postcopy never cared about late-block-active.  However there's no mention
in the capability that it doesn't apply to postcopy.

Considering that we _assumed_ late activation is always good, do that too
for postcopy unconditionally, just like precopy.  After this patch, we
should have unified the behavior across all.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 98821c8120..80726726fe 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2137,22 +2137,21 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
 
     trace_vmstate_downtime_checkpoint("dst-postcopy-bh-announced");
 
-    /* Make sure all file formats throw away their mutable metadata.
-     * If we get an error here, just don't restart the VM yet. */
-    bdrv_activate_all(&local_err);
-    if (local_err) {
-        error_report_err(local_err);
-        local_err = NULL;
-        autostart = false;
-    }
-
-    trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated");
-
     dirty_bitmap_mig_before_vm_start();
 
     if (autostart) {
-        /* Hold onto your hats, starting the CPU */
-        vm_start();
+        /*
+         * Make sure all file formats throw away their mutable metadata.
+         * If we get an error here, just don't restart the VM yet.
+         */
+        bdrv_activate_all(&local_err);
+        trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated");
+        if (local_err) {
+            error_report_err(local_err);
+            local_err = NULL;
+        } else {
+            vm_start();
+        }
     } else {
         /* leave it paused and let management decide when to start the CPU */
         runstate_set(RUN_STATE_PAUSED);
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/6] migration/block: Fix possible race with block_inactive
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
                   ` (3 preceding siblings ...)
  2024-12-06 23:08 ` [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-16 15:18   ` Fabiano Rosas
  2024-12-06 23:08 ` [PATCH v2 6/6] migration/block: Rewrite disk activation Peter Xu
  2024-12-17 15:26 ` [PATCH v2 0/6] migration/block: disk activation rewrite Fabiano Rosas
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

Src QEMU sets block_inactive=true very early before the invalidation takes
place.  It means if something wrong happened during setting the flag but
before reaching qemu_savevm_state_complete_precopy_non_iterable() where it
did the invalidation work, it'll make block_inactive flag inconsistent.

For example, think about when qemu_savevm_state_complete_precopy_iterable()
can fail: it will have block_inactive set to true even if all block drives
are active.

Fix that by only update the flag after the invalidation is done.

No Fixes for any commit, because it's not an issue if bdrv_activate_all()
is re-entrant upon all-active disks - false positive block_inactive can
bring nothing more than "trying to active the blocks but they're already
active".  However let's still do it right to avoid the inconsistent flag
v.s. reality.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 9 +++------
 migration/savevm.c    | 2 ++
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index e6db9cfc50..ba5deec5bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2749,14 +2749,11 @@ static int migration_completion_precopy(MigrationState *s,
         goto out_unlock;
     }
 
-    /*
-     * Inactivate disks except in COLO, and track that we have done so in order
-     * to remember to reactivate them if migration fails or is cancelled.
-     */
-    s->block_inactive = !migrate_colo();
     migration_rate_set(RATE_LIMIT_DISABLED);
+
+    /* Inactivate disks except in COLO */
     ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
-                                             s->block_inactive);
+                                             !migrate_colo());
 out_unlock:
     bql_unlock();
     return ret;
diff --git a/migration/savevm.c b/migration/savevm.c
index 80726726fe..706b77ffab 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1558,6 +1558,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
             qemu_file_set_error(f, ret);
             return ret;
         }
+        /* Remember that we did this */
+        s->block_inactive = true;
     }
     if (!in_postcopy) {
         /* Postcopy stream will still be going */
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 6/6] migration/block: Rewrite disk activation
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
                   ` (4 preceding siblings ...)
  2024-12-06 23:08 ` [PATCH v2 5/6] migration/block: Fix possible race with block_inactive Peter Xu
@ 2024-12-06 23:08 ` Peter Xu
  2024-12-16 15:58   ` Fabiano Rosas
  2024-12-17 15:26 ` [PATCH v2 0/6] migration/block: disk activation rewrite Fabiano Rosas
  6 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-12-06 23:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Fabiano Rosas, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

This patch proposes a flag to maintain disk activation status globally.  It
mostly rewrites disk activation mgmt for QEMU, including COLO and QMP
command xen_save_devices_state.

Backgrounds
===========

We have two problems on disk activations, one resolved, one not.

Problem 1: disk activation recover (for switchover interruptions)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When migration is either cancelled or failed during switchover, especially
when after the disks are inactivated, QEMU needs to remember re-activate
the disks again before vm starts.

It used to be done separately in two paths: one in qmp_migrate_cancel(),
the other one in the failure path of migration_completion().

It used to be fixed in different commits, all over the places in QEMU.  So
these are the relevant changes I saw, I'm not sure if it's complete list:

 - In 2016, commit fe904ea824 ("migration: regain control of images when
   migration fails to complete")

 - In 2017, commit 1d2acc3162 ("migration: re-active images while migration
   been canceled after inactive them")

 - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in
   more failure scenarios")

Now since we have a slightly better picture maybe we can unify the
reactivation in a single path.

One side benefit of doing so is, we can move the disk operation outside QMP
command "migrate_cancel".  It's possible that in the future we may want to
make "migrate_cancel" be OOB-compatible, while that requires the command
doesn't need BQL in the first place.  This will already do that and make
migrate_cancel command lightweight.

Problem 2: disk invalidation on top of invalidated disks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is an unresolved bug for current QEMU.  Link in "Resolves:" at the
end.  It turns out besides the src switchover phase (problem 1 above), QEMU
also needs to remember block activation on destination.

Consider two continuous migration in a row, where the VM was always paused.
In that scenario, the disks are not activated even until migration
completed in the 1st round.  When the 2nd round starts, if QEMU doesn't
know the status of the disks, it needs to try inactivate the disk again.

Here the issue is the block layer API bdrv_inactivate_all() will crash a
QEMU if invoked on already inactive disks for the 2nd migration.  For
detail, see the bug link at the end.

Implementation
==============

This patch proposes to maintain disk activation with a global flag, so we
know:

  - If we used to inactivate disks for migration, but migration got
  cancelled, or failed, QEMU will know it should reactivate the disks.

  - On incoming side, if the disks are never activated but then another
  migration is triggered, QEMU should be able to tell that inactivate is
  not needed for the 2nd migration.

We used to have disk_inactive, but it only solves the 1st issue, not the
2nd.  Also, it's done in completely separate paths so it's extremely hard
to follow either how the flag changes, or the duration that the flag is
valid, and when we will reactivate the disks.

Convert the existing disk_inactive flag into that global flag (also invert
its naming), and maintain the disk activation status for the whole
lifecycle of qemu.  That includes the incoming QEMU.

Put both of the error cases of source migration (failure, cancelled)
together into migration_iteration_finish(), which will be invoked for
either of the scenario.  So from that part QEMU should behave the same as
before.  However with such global maintenance on disk activation status, we
not only cleanup quite a few temporary paths that we try to maintain the
disk activation status (e.g. in postcopy code), meanwhile it fixes the
crash for problem 2 in one shot.

For freshly started QEMU, the flag is initialized to TRUE showing that the
QEMU owns the disks by default.

For incoming migrated QEMU, the flag will be initialized to FALSE once and
for all showing that the dest QEMU doesn't own the disks until switchover.
That is guaranteed by the "once" variable.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/misc.h |  4 ++
 migration/migration.h    |  6 +--
 migration/block-active.c | 94 ++++++++++++++++++++++++++++++++++++++++
 migration/colo.c         |  2 +-
 migration/migration.c    | 80 ++++++++--------------------------
 migration/savevm.c       | 33 ++++++--------
 monitor/qmp-cmds.c       |  8 +---
 migration/meson.build    |  1 +
 migration/trace-events   |  3 ++
 9 files changed, 140 insertions(+), 91 deletions(-)
 create mode 100644 migration/block-active.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 804eb23c06..e68a473feb 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -106,4 +106,8 @@ bool migration_incoming_postcopy_advised(void);
 /* True if background snapshot is active */
 bool migration_in_bg_snapshot(void);
 
+/* Wrapper for block active/inactive operations */
+bool migration_block_activate(Error **errp);
+bool migration_block_inactivate(void);
+
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index 3857905c0e..fab3cad2b9 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -370,9 +370,6 @@ struct MigrationState {
     /* Flag set once the migration thread is running (and needs joining) */
     bool migration_thread_running;
 
-    /* Flag set once the migration thread called bdrv_inactivate_all */
-    bool block_inactive;
-
     /* Migration is waiting for guest to unplug device */
     QemuSemaphore wait_unplug_sem;
 
@@ -556,4 +553,7 @@ void migration_bitmap_sync_precopy(bool last_stage);
 /* migration/block-dirty-bitmap.c */
 void dirty_bitmap_mig_init(void);
 
+/* migration/block-active.c */
+void migration_block_active_setup(bool active);
+
 #endif
diff --git a/migration/block-active.c b/migration/block-active.c
new file mode 100644
index 0000000000..d477cf8182
--- /dev/null
+++ b/migration/block-active.c
@@ -0,0 +1,94 @@
+/*
+ * Block activation tracking for migration purpose
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * Copyright (C) 2024 Red Hat, Inc.
+ */
+#include "qemu/osdep.h"
+#include "block/block.h"
+#include "qapi/error.h"
+#include "migration/migration.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+
+/*
+ * Migration-only cache to remember the block layer activation status.
+ * Protected by BQL.
+ *
+ * We need this because..
+ *
+ * - Migration can fail after block devices are invalidated (during
+ *   switchover phase).  When that happens, we need to be able to recover
+ *   the block drive status by re-activating them.
+ *
+ * - Currently bdrv_inactivate_all() is not safe to be invoked on top of
+ *   invalidated drives (even if bdrv_activate_all() is actually safe to be
+ *   called any time!).  It means remembering this could help migration to
+ *   make sure it won't invalidate twice in a row, crashing QEMU.  It can
+ *   happen when we migrate a PAUSED VM from host1 to host2, then migrate
+ *   again to host3 without starting it.  TODO: a cleaner solution is to
+ *   allow safe invoke of bdrv_inactivate_all() at anytime, like
+ *   bdrv_activate_all().
+ *
+ * For freshly started QEMU, the flag is initialized to TRUE reflecting the
+ * scenario where QEMU owns block device ownerships.
+ *
+ * For incoming QEMU taking a migration stream, the flag is initialized to
+ * FALSE reflecting that the incoming side doesn't own the block devices,
+ * not until switchover happens.
+ */
+static bool migration_block_active;
+
+/* Setup the disk activation status */
+void migration_block_active_setup(bool active)
+{
+    migration_block_active = active;
+}
+
+bool migration_block_activate(Error **errp)
+{
+    ERRP_GUARD();
+
+    assert(bql_locked());
+
+    if (migration_block_active) {
+        trace_migration_block_activation("active-skipped");
+        return true;
+    }
+
+    trace_migration_block_activation("active");
+
+    bdrv_activate_all(errp);
+    if (*errp) {
+        error_report_err(error_copy(*errp));
+        return false;
+    }
+
+    migration_block_active = true;
+    return true;
+}
+
+bool migration_block_inactivate(void)
+{
+    int ret;
+
+    assert(bql_locked());
+
+    if (!migration_block_active) {
+        trace_migration_block_activation("inactive-skipped");
+        return true;
+    }
+
+    trace_migration_block_activation("inactive");
+
+    ret = bdrv_inactivate_all();
+    if (ret) {
+        error_report("%s: bdrv_inactivate_all() failed: %d",
+                     __func__, ret);
+        return false;
+    }
+
+    migration_block_active = false;
+    return true;
+}
diff --git a/migration/colo.c b/migration/colo.c
index 9590f281d0..ae3387a7a4 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -836,7 +836,7 @@ static void *colo_process_incoming_thread(void *opaque)
 
     /* Make sure all file formats throw away their mutable metadata */
     bql_lock();
-    bdrv_activate_all(&local_err);
+    migration_block_activate(&local_err);
     bql_unlock();
     if (local_err) {
         error_report_err(local_err);
diff --git a/migration/migration.c b/migration/migration.c
index ba5deec5bc..d755ccb03d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -738,7 +738,6 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
 
 static void process_incoming_migration_bh(void *opaque)
 {
-    Error *local_err = NULL;
     MigrationIncomingState *mis = opaque;
 
     trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter");
@@ -769,11 +768,7 @@ static void process_incoming_migration_bh(void *opaque)
              * Make sure all file formats throw away their mutable
              * metadata.  If error, don't restart the VM yet.
              */
-            bdrv_activate_all(&local_err);
-            if (local_err) {
-                error_report_err(local_err);
-                local_err = NULL;
-            } else {
+            if (migration_block_activate(NULL)) {
                 vm_start();
             }
         } else {
@@ -1552,16 +1547,6 @@ static void migrate_fd_cancel(MigrationState *s)
             }
         }
     }
-    if (s->state == MIGRATION_STATUS_CANCELLING && s->block_inactive) {
-        Error *local_err = NULL;
-
-        bdrv_activate_all(&local_err);
-        if (local_err) {
-            error_report_err(local_err);
-        } else {
-            s->block_inactive = false;
-        }
-    }
 }
 
 void migration_add_notifier_mode(NotifierWithReturn *notify,
@@ -1860,6 +1845,12 @@ void qmp_migrate_incoming(const char *uri, bool has_channels,
         return;
     }
 
+    /*
+     * Newly setup incoming QEMU.  Mark the block active state to reflect
+     * that the src currently owns the disks.
+     */
+    migration_block_active_setup(false);
+
     once = false;
 }
 
@@ -2512,7 +2503,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     QIOChannelBuffer *bioc;
     QEMUFile *fb;
     uint64_t bandwidth = migrate_max_postcopy_bandwidth();
-    bool restart_block = false;
     int cur_state = MIGRATION_STATUS_ACTIVE;
 
     if (migrate_postcopy_preempt()) {
@@ -2548,13 +2538,10 @@ static int postcopy_start(MigrationState *ms, Error **errp)
         goto fail;
     }
 
-    ret = bdrv_inactivate_all();
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "%s: Failed in bdrv_inactivate_all()",
-                         __func__);
+    if (!migration_block_inactivate()) {
+        error_setg(errp, "%s: Failed in bdrv_inactivate_all()", __func__);
         goto fail;
     }
-    restart_block = true;
 
     /*
      * Cause any non-postcopiable, but iterative devices to
@@ -2624,8 +2611,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
         goto fail_closefb;
     }
 
-    restart_block = false;
-
     /* Now send that blob */
     if (qemu_savevm_send_packaged(ms->to_dst_file, bioc->data, bioc->usage)) {
         error_setg(errp, "%s: Failed to send packaged data", __func__);
@@ -2670,17 +2655,7 @@ fail_closefb:
 fail:
     migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
                           MIGRATION_STATUS_FAILED);
-    if (restart_block) {
-        /* A failure happened early enough that we know the destination hasn't
-         * accessed block devices, so we're safe to recover.
-         */
-        Error *local_err = NULL;
-
-        bdrv_activate_all(&local_err);
-        if (local_err) {
-            error_report_err(local_err);
-        }
-    }
+    migration_block_activate(NULL);
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
     bql_unlock();
     return -1;
@@ -2778,31 +2753,6 @@ static void migration_completion_postcopy(MigrationState *s)
     trace_migration_completion_postcopy_end_after_complete();
 }
 
-static void migration_completion_failed(MigrationState *s,
-                                        int current_active_state)
-{
-    if (s->block_inactive && (s->state == MIGRATION_STATUS_ACTIVE ||
-                              s->state == MIGRATION_STATUS_DEVICE)) {
-        /*
-         * If not doing postcopy, vm_start() will be called: let's
-         * regain control on images.
-         */
-        Error *local_err = NULL;
-
-        bql_lock();
-        bdrv_activate_all(&local_err);
-        if (local_err) {
-            error_report_err(local_err);
-        } else {
-            s->block_inactive = false;
-        }
-        bql_unlock();
-    }
-
-    migrate_set_state(&s->state, current_active_state,
-                      MIGRATION_STATUS_FAILED);
-}
-
 /**
  * migration_completion: Used by migration_thread when there's not much left.
  *   The caller 'breaks' the loop when this returns.
@@ -2856,7 +2806,8 @@ fail:
         error_free(local_err);
     }
 
-    migration_completion_failed(s, current_active_state);
+    migrate_set_state(&s->state, current_active_state,
+                      MIGRATION_STATUS_FAILED);
 }
 
 /**
@@ -3286,6 +3237,11 @@ static void migration_iteration_finish(MigrationState *s)
     case MIGRATION_STATUS_FAILED:
     case MIGRATION_STATUS_CANCELLED:
     case MIGRATION_STATUS_CANCELLING:
+        /*
+         * Re-activate the block drives if they're inactivated.  Note, COLO
+         * shouldn't use block_active at all, so it should be no-op there.
+         */
+        migration_block_activate(NULL);
         if (runstate_is_live(s->vm_old_state)) {
             if (!runstate_check(RUN_STATE_SHUTDOWN)) {
                 vm_start();
@@ -3858,6 +3814,8 @@ static void migration_instance_init(Object *obj)
     ms->state = MIGRATION_STATUS_NONE;
     ms->mbps = -1;
     ms->pages_per_second = -1;
+    /* Freshly started QEMU owns all the block devices */
+    migration_block_active_setup(true);
     qemu_sem_init(&ms->pause_sem, 0);
     qemu_mutex_init(&ms->error_mutex);
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 706b77ffab..969a994a85 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1547,19 +1547,18 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
     }
 
     if (inactivate_disks) {
-        /* Inactivate before sending QEMU_VM_EOF so that the
-         * bdrv_activate_all() on the other end won't fail. */
-        ret = bdrv_inactivate_all();
-        if (ret) {
-            error_setg(&local_err, "%s: bdrv_inactivate_all() failed (%d)",
-                       __func__, ret);
+        /*
+         * Inactivate before sending QEMU_VM_EOF so that the
+         * bdrv_activate_all() on the other end won't fail.
+         */
+        if (!migration_block_inactivate()) {
+            error_setg(&local_err, "%s: bdrv_inactivate_all() failed",
+                       __func__);
             migrate_set_error(ms, local_err);
             error_report_err(local_err);
-            qemu_file_set_error(f, ret);
+            qemu_file_set_error(f, -EFAULT);
             return ret;
         }
-        /* Remember that we did this */
-        s->block_inactive = true;
     }
     if (!in_postcopy) {
         /* Postcopy stream will still be going */
@@ -2123,7 +2122,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
 
 static void loadvm_postcopy_handle_run_bh(void *opaque)
 {
-    Error *local_err = NULL;
     MigrationIncomingState *mis = opaque;
 
     trace_vmstate_downtime_checkpoint("dst-postcopy-bh-enter");
@@ -2146,12 +2144,11 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
          * Make sure all file formats throw away their mutable metadata.
          * If we get an error here, just don't restart the VM yet.
          */
-        bdrv_activate_all(&local_err);
+        bool success = migration_block_activate(NULL);
+
         trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated");
-        if (local_err) {
-            error_report_err(local_err);
-            local_err = NULL;
-        } else {
+
+        if (success) {
             vm_start();
         }
     } else {
@@ -3193,11 +3190,7 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
          * side of the migration take control of the images.
          */
         if (live && !saved_vm_running) {
-            ret = bdrv_inactivate_all();
-            if (ret) {
-                error_setg(errp, "%s: bdrv_inactivate_all() failed (%d)",
-                           __func__, ret);
-            }
+            migration_block_inactivate();
         }
     }
 
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index 76f21e8af3..6f76d9beaf 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -31,6 +31,7 @@
 #include "qapi/type-helpers.h"
 #include "hw/mem/memory-device.h"
 #include "hw/intc/intc.h"
+#include "migration/misc.h"
 
 NameInfo *qmp_query_name(Error **errp)
 {
@@ -103,13 +104,8 @@ void qmp_cont(Error **errp)
          * Continuing after completed migration. Images have been
          * inactivated to allow the destination to take control. Need to
          * get control back now.
-         *
-         * If there are no inactive block nodes (e.g. because the VM was
-         * just paused rather than completing a migration),
-         * bdrv_inactivate_all() simply doesn't do anything.
          */
-        bdrv_activate_all(&local_err);
-        if (local_err) {
+        if (!migration_block_activate(&local_err)) {
             error_propagate(errp, local_err);
             return;
         }
diff --git a/migration/meson.build b/migration/meson.build
index d53cf3417a..dac687ee3a 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -11,6 +11,7 @@ migration_files = files(
 
 system_ss.add(files(
   'block-dirty-bitmap.c',
+  'block-active.c',
   'channel.c',
   'channel-block.c',
   'cpu-throttle.c',
diff --git a/migration/trace-events b/migration/trace-events
index bb0e0cc6dc..b82a1c5e40 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -383,3 +383,6 @@ migration_pagecache_insert(void) "Error allocating page"
 # cpu-throttle.c
 cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
 cpu_throttle_dirty_sync(void) ""
+
+# block-active.c
+migration_block_activation(const char *name) "%s"
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed
  2024-12-06 23:08 ` [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed Peter Xu
@ 2024-12-16 15:10   ` Fabiano Rosas
  0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-16 15:10 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé, Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

> As the comment says, the activation of disks is for the case where
> migration has completed, rather than when QEMU is still during
> migration (RUN_STATE_INMIGRATE).
>
> Move the code over to reflect what the comment is describing.
>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/6] migration/block: Make late-block-active the default
  2024-12-06 23:08 ` [PATCH v2 3/6] migration/block: Make late-block-active the default Peter Xu
@ 2024-12-16 15:16   ` Fabiano Rosas
  0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-16 15:16 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> Migration capability 'late-block-active' controls when the block drives
> will be activated.  If enabled, block drives will only be activated until
> VM starts, either src runstate was "live" (RUNNING, or SUSPENDED), or it'll
> be postponed until qmp_cont().
>
> Let's do this unconditionally.  There's no harm to delay activation of
> block drives.  Meanwhile there's no ABI breakage if dest does it, because
> src QEMU has nothing to do with it, so it's no concern on ABI breakage.
>
> IIUC we could avoid introducing this cap when introducing it before, but
> now it's still not too late to just always do it.  Cap now prone to
> removal, but it'll be for later patches.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy
  2024-12-06 23:08 ` [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy Peter Xu
@ 2024-12-16 15:17   ` Fabiano Rosas
  0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-16 15:17 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> Postcopy never cared about late-block-active.  However there's no mention
> in the capability that it doesn't apply to postcopy.
>
> Considering that we _assumed_ late activation is always good, do that too
> for postcopy unconditionally, just like precopy.  After this patch, we
> should have unified the behavior across all.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/6] migration/block: Fix possible race with block_inactive
  2024-12-06 23:08 ` [PATCH v2 5/6] migration/block: Fix possible race with block_inactive Peter Xu
@ 2024-12-16 15:18   ` Fabiano Rosas
  0 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-16 15:18 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> Src QEMU sets block_inactive=true very early before the invalidation takes
> place.  It means if something wrong happened during setting the flag but
> before reaching qemu_savevm_state_complete_precopy_non_iterable() where it
> did the invalidation work, it'll make block_inactive flag inconsistent.
>
> For example, think about when qemu_savevm_state_complete_precopy_iterable()
> can fail: it will have block_inactive set to true even if all block drives
> are active.
>
> Fix that by only update the flag after the invalidation is done.
>
> No Fixes for any commit, because it's not an issue if bdrv_activate_all()
> is re-entrant upon all-active disks - false positive block_inactive can
> bring nothing more than "trying to active the blocks but they're already
> active".  However let's still do it right to avoid the inconsistent flag
> v.s. reality.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 6/6] migration/block: Rewrite disk activation
  2024-12-06 23:08 ` [PATCH v2 6/6] migration/block: Rewrite disk activation Peter Xu
@ 2024-12-16 15:58   ` Fabiano Rosas
  2024-12-16 18:33     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-16 15:58 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> This patch proposes a flag to maintain disk activation status globally.  It
> mostly rewrites disk activation mgmt for QEMU, including COLO and QMP
> command xen_save_devices_state.
>
> Backgrounds
> ===========
>
> We have two problems on disk activations, one resolved, one not.
>
> Problem 1: disk activation recover (for switchover interruptions)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> When migration is either cancelled or failed during switchover, especially
> when after the disks are inactivated, QEMU needs to remember re-activate
> the disks again before vm starts.
>
> It used to be done separately in two paths: one in qmp_migrate_cancel(),
> the other one in the failure path of migration_completion().
>
> It used to be fixed in different commits, all over the places in QEMU.  So
> these are the relevant changes I saw, I'm not sure if it's complete list:
>
>  - In 2016, commit fe904ea824 ("migration: regain control of images when
>    migration fails to complete")
>
>  - In 2017, commit 1d2acc3162 ("migration: re-active images while migration
>    been canceled after inactive them")
>
>  - In 2023, commit 6dab4c93ec ("migration: Attempt disk reactivation in
>    more failure scenarios")
>
> Now since we have a slightly better picture maybe we can unify the
> reactivation in a single path.
>
> One side benefit of doing so is, we can move the disk operation outside QMP
> command "migrate_cancel".  It's possible that in the future we may want to
> make "migrate_cancel" be OOB-compatible, while that requires the command
> doesn't need BQL in the first place.  This will already do that and make
> migrate_cancel command lightweight.
>
> Problem 2: disk invalidation on top of invalidated disks
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This is an unresolved bug for current QEMU.  Link in "Resolves:" at the
> end.  It turns out besides the src switchover phase (problem 1 above), QEMU
> also needs to remember block activation on destination.
>
> Consider two continuous migration in a row, where the VM was always paused.
> In that scenario, the disks are not activated even until migration
> completed in the 1st round.  When the 2nd round starts, if QEMU doesn't
> know the status of the disks, it needs to try inactivate the disk again.
>
> Here the issue is the block layer API bdrv_inactivate_all() will crash a
> QEMU if invoked on already inactive disks for the 2nd migration.  For
> detail, see the bug link at the end.
>
> Implementation
> ==============
>
> This patch proposes to maintain disk activation with a global flag, so we
> know:
>
>   - If we used to inactivate disks for migration, but migration got
>   cancelled, or failed, QEMU will know it should reactivate the disks.
>
>   - On incoming side, if the disks are never activated but then another
>   migration is triggered, QEMU should be able to tell that inactivate is
>   not needed for the 2nd migration.
>
> We used to have disk_inactive, but it only solves the 1st issue, not the
> 2nd.  Also, it's done in completely separate paths so it's extremely hard
> to follow either how the flag changes, or the duration that the flag is
> valid, and when we will reactivate the disks.
>
> Convert the existing disk_inactive flag into that global flag (also invert
> its naming), and maintain the disk activation status for the whole
> lifecycle of qemu.  That includes the incoming QEMU.
>
> Put both of the error cases of source migration (failure, cancelled)
> together into migration_iteration_finish(), which will be invoked for
> either of the scenario.  So from that part QEMU should behave the same as
> before.  However with such global maintenance on disk activation status, we
> not only cleanup quite a few temporary paths that we try to maintain the
> disk activation status (e.g. in postcopy code), meanwhile it fixes the
> crash for problem 2 in one shot.
>
> For freshly started QEMU, the flag is initialized to TRUE showing that the
> QEMU owns the disks by default.
>
> For incoming migrated QEMU, the flag will be initialized to FALSE once and
> for all showing that the dest QEMU doesn't own the disks until switchover.
> That is guaranteed by the "once" variable.
>
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2395
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

Just a note about errp below and a comment about a pre-existing
condition, but nothing that requires action.

> ---
>  include/migration/misc.h |  4 ++
>  migration/migration.h    |  6 +--
>  migration/block-active.c | 94 ++++++++++++++++++++++++++++++++++++++++
>  migration/colo.c         |  2 +-
>  migration/migration.c    | 80 ++++++++--------------------------
>  migration/savevm.c       | 33 ++++++--------
>  monitor/qmp-cmds.c       |  8 +---
>  migration/meson.build    |  1 +
>  migration/trace-events   |  3 ++
>  9 files changed, 140 insertions(+), 91 deletions(-)
>  create mode 100644 migration/block-active.c
>
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index 804eb23c06..e68a473feb 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -106,4 +106,8 @@ bool migration_incoming_postcopy_advised(void);
>  /* True if background snapshot is active */
>  bool migration_in_bg_snapshot(void);
>  
> +/* Wrapper for block active/inactive operations */
> +bool migration_block_activate(Error **errp);
> +bool migration_block_inactivate(void);
> +
>  #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index 3857905c0e..fab3cad2b9 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -370,9 +370,6 @@ struct MigrationState {
>      /* Flag set once the migration thread is running (and needs joining) */
>      bool migration_thread_running;
>  
> -    /* Flag set once the migration thread called bdrv_inactivate_all */
> -    bool block_inactive;
> -
>      /* Migration is waiting for guest to unplug device */
>      QemuSemaphore wait_unplug_sem;
>  
> @@ -556,4 +553,7 @@ void migration_bitmap_sync_precopy(bool last_stage);
>  /* migration/block-dirty-bitmap.c */
>  void dirty_bitmap_mig_init(void);
>  
> +/* migration/block-active.c */
> +void migration_block_active_setup(bool active);
> +
>  #endif
> diff --git a/migration/block-active.c b/migration/block-active.c
> new file mode 100644
> index 0000000000..d477cf8182
> --- /dev/null
> +++ b/migration/block-active.c
> @@ -0,0 +1,94 @@
> +/*
> + * Block activation tracking for migration purpose
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Copyright (C) 2024 Red Hat, Inc.
> + */
> +#include "qemu/osdep.h"
> +#include "block/block.h"
> +#include "qapi/error.h"
> +#include "migration/migration.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +
> +/*
> + * Migration-only cache to remember the block layer activation status.
> + * Protected by BQL.
> + *
> + * We need this because..
> + *
> + * - Migration can fail after block devices are invalidated (during
> + *   switchover phase).  When that happens, we need to be able to recover
> + *   the block drive status by re-activating them.
> + *
> + * - Currently bdrv_inactivate_all() is not safe to be invoked on top of
> + *   invalidated drives (even if bdrv_activate_all() is actually safe to be
> + *   called any time!).  It means remembering this could help migration to
> + *   make sure it won't invalidate twice in a row, crashing QEMU.  It can
> + *   happen when we migrate a PAUSED VM from host1 to host2, then migrate
> + *   again to host3 without starting it.  TODO: a cleaner solution is to
> + *   allow safe invoke of bdrv_inactivate_all() at anytime, like
> + *   bdrv_activate_all().
> + *
> + * For freshly started QEMU, the flag is initialized to TRUE reflecting the
> + * scenario where QEMU owns block device ownerships.
> + *
> + * For incoming QEMU taking a migration stream, the flag is initialized to
> + * FALSE reflecting that the incoming side doesn't own the block devices,
> + * not until switchover happens.
> + */
> +static bool migration_block_active;
> +
> +/* Setup the disk activation status */
> +void migration_block_active_setup(bool active)
> +{
> +    migration_block_active = active;
> +}
> +
> +bool migration_block_activate(Error **errp)
> +{
> +    ERRP_GUARD();
> +
> +    assert(bql_locked());
> +
> +    if (migration_block_active) {
> +        trace_migration_block_activation("active-skipped");
> +        return true;
> +    }
> +
> +    trace_migration_block_activation("active");
> +
> +    bdrv_activate_all(errp);
> +    if (*errp) {
> +        error_report_err(error_copy(*errp));
> +        return false;
> +    }
> +
> +    migration_block_active = true;
> +    return true;
> +}
> +
> +bool migration_block_inactivate(void)
> +{
> +    int ret;
> +
> +    assert(bql_locked());
> +
> +    if (!migration_block_active) {
> +        trace_migration_block_activation("inactive-skipped");
> +        return true;
> +    }
> +
> +    trace_migration_block_activation("inactive");
> +
> +    ret = bdrv_inactivate_all();
> +    if (ret) {
> +        error_report("%s: bdrv_inactivate_all() failed: %d",
> +                     __func__, ret);
> +        return false;
> +    }
> +
> +    migration_block_active = false;
> +    return true;
> +}
> diff --git a/migration/colo.c b/migration/colo.c
> index 9590f281d0..ae3387a7a4 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -836,7 +836,7 @@ static void *colo_process_incoming_thread(void *opaque)
>  
>      /* Make sure all file formats throw away their mutable metadata */
>      bql_lock();
> -    bdrv_activate_all(&local_err);
> +    migration_block_activate(&local_err);
>      bql_unlock();
>      if (local_err) {
>          error_report_err(local_err);
> diff --git a/migration/migration.c b/migration/migration.c
> index ba5deec5bc..d755ccb03d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -738,7 +738,6 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
>  
>  static void process_incoming_migration_bh(void *opaque)
>  {
> -    Error *local_err = NULL;
>      MigrationIncomingState *mis = opaque;
>  
>      trace_vmstate_downtime_checkpoint("dst-precopy-bh-enter");
> @@ -769,11 +768,7 @@ static void process_incoming_migration_bh(void *opaque)
>               * Make sure all file formats throw away their mutable
>               * metadata.  If error, don't restart the VM yet.
>               */
> -            bdrv_activate_all(&local_err);
> -            if (local_err) {
> -                error_report_err(local_err);
> -                local_err = NULL;
> -            } else {
> +            if (migration_block_activate(NULL)) {
>                  vm_start();
>              }
>          } else {
> @@ -1552,16 +1547,6 @@ static void migrate_fd_cancel(MigrationState *s)
>              }
>          }
>      }
> -    if (s->state == MIGRATION_STATUS_CANCELLING && s->block_inactive) {
> -        Error *local_err = NULL;
> -
> -        bdrv_activate_all(&local_err);
> -        if (local_err) {
> -            error_report_err(local_err);
> -        } else {
> -            s->block_inactive = false;
> -        }
> -    }
>  }
>  
>  void migration_add_notifier_mode(NotifierWithReturn *notify,
> @@ -1860,6 +1845,12 @@ void qmp_migrate_incoming(const char *uri, bool has_channels,
>          return;
>      }
>  
> +    /*
> +     * Newly setup incoming QEMU.  Mark the block active state to reflect
> +     * that the src currently owns the disks.
> +     */
> +    migration_block_active_setup(false);
> +
>      once = false;
>  }
>  
> @@ -2512,7 +2503,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>      QIOChannelBuffer *bioc;
>      QEMUFile *fb;
>      uint64_t bandwidth = migrate_max_postcopy_bandwidth();
> -    bool restart_block = false;
>      int cur_state = MIGRATION_STATUS_ACTIVE;
>  
>      if (migrate_postcopy_preempt()) {
> @@ -2548,13 +2538,10 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>          goto fail;
>      }
>  
> -    ret = bdrv_inactivate_all();
> -    if (ret < 0) {
> -        error_setg_errno(errp, -ret, "%s: Failed in bdrv_inactivate_all()",
> -                         __func__);
> +    if (!migration_block_inactivate()) {
> +        error_setg(errp, "%s: Failed in bdrv_inactivate_all()", __func__);
>          goto fail;
>      }
> -    restart_block = true;
>  
>      /*
>       * Cause any non-postcopiable, but iterative devices to
> @@ -2624,8 +2611,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>          goto fail_closefb;
>      }
>  
> -    restart_block = false;
> -
>      /* Now send that blob */
>      if (qemu_savevm_send_packaged(ms->to_dst_file, bioc->data, bioc->usage)) {
>          error_setg(errp, "%s: Failed to send packaged data", __func__);
> @@ -2670,17 +2655,7 @@ fail_closefb:
>  fail:
>      migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>                            MIGRATION_STATUS_FAILED);
> -    if (restart_block) {
> -        /* A failure happened early enough that we know the destination hasn't
> -         * accessed block devices, so we're safe to recover.
> -         */
> -        Error *local_err = NULL;
> -
> -        bdrv_activate_all(&local_err);
> -        if (local_err) {
> -            error_report_err(local_err);
> -        }
> -    }
> +    migration_block_activate(NULL);
>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>      bql_unlock();
>      return -1;
> @@ -2778,31 +2753,6 @@ static void migration_completion_postcopy(MigrationState *s)
>      trace_migration_completion_postcopy_end_after_complete();
>  }
>  
> -static void migration_completion_failed(MigrationState *s,
> -                                        int current_active_state)
> -{
> -    if (s->block_inactive && (s->state == MIGRATION_STATUS_ACTIVE ||
> -                              s->state == MIGRATION_STATUS_DEVICE)) {
> -        /*
> -         * If not doing postcopy, vm_start() will be called: let's
> -         * regain control on images.
> -         */
> -        Error *local_err = NULL;
> -
> -        bql_lock();
> -        bdrv_activate_all(&local_err);
> -        if (local_err) {
> -            error_report_err(local_err);
> -        } else {
> -            s->block_inactive = false;
> -        }
> -        bql_unlock();
> -    }
> -
> -    migrate_set_state(&s->state, current_active_state,
> -                      MIGRATION_STATUS_FAILED);
> -}
> -
>  /**
>   * migration_completion: Used by migration_thread when there's not much left.
>   *   The caller 'breaks' the loop when this returns.
> @@ -2856,7 +2806,8 @@ fail:
>          error_free(local_err);
>      }
>  
> -    migration_completion_failed(s, current_active_state);
> +    migrate_set_state(&s->state, current_active_state,
> +                      MIGRATION_STATUS_FAILED);
>  }
>  
>  /**
> @@ -3286,6 +3237,11 @@ static void migration_iteration_finish(MigrationState *s)
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_CANCELLED:
>      case MIGRATION_STATUS_CANCELLING:

Pre-existing, but can we even reach here with CANCELLED? If we can start
the VM with both CANCELLED and CANCELLING, that means the
MIG_EVENT_PRECOPY_FAILED notifier is not being consistently called. So I
think CANCELLED here must be unreachable...

> +        /*
> +         * Re-activate the block drives if they're inactivated.  Note, COLO
> +         * shouldn't use block_active at all, so it should be no-op there.
> +         */
> +        migration_block_activate(NULL);
>          if (runstate_is_live(s->vm_old_state)) {
>              if (!runstate_check(RUN_STATE_SHUTDOWN)) {
>                  vm_start();
> @@ -3858,6 +3814,8 @@ static void migration_instance_init(Object *obj)
>      ms->state = MIGRATION_STATUS_NONE;
>      ms->mbps = -1;
>      ms->pages_per_second = -1;
> +    /* Freshly started QEMU owns all the block devices */
> +    migration_block_active_setup(true);
>      qemu_sem_init(&ms->pause_sem, 0);
>      qemu_mutex_init(&ms->error_mutex);
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 706b77ffab..969a994a85 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1547,19 +1547,18 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>      }
>  
>      if (inactivate_disks) {
> -        /* Inactivate before sending QEMU_VM_EOF so that the
> -         * bdrv_activate_all() on the other end won't fail. */
> -        ret = bdrv_inactivate_all();
> -        if (ret) {
> -            error_setg(&local_err, "%s: bdrv_inactivate_all() failed (%d)",
> -                       __func__, ret);
> +        /*
> +         * Inactivate before sending QEMU_VM_EOF so that the
> +         * bdrv_activate_all() on the other end won't fail.
> +         */
> +        if (!migration_block_inactivate()) {
> +            error_setg(&local_err, "%s: bdrv_inactivate_all() failed",
> +                       __func__);
>              migrate_set_error(ms, local_err);
>              error_report_err(local_err);
> -            qemu_file_set_error(f, ret);
> +            qemu_file_set_error(f, -EFAULT);
>              return ret;
>          }
> -        /* Remember that we did this */
> -        s->block_inactive = true;
>      }
>      if (!in_postcopy) {
>          /* Postcopy stream will still be going */
> @@ -2123,7 +2122,6 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis)
>  
>  static void loadvm_postcopy_handle_run_bh(void *opaque)
>  {
> -    Error *local_err = NULL;
>      MigrationIncomingState *mis = opaque;
>  
>      trace_vmstate_downtime_checkpoint("dst-postcopy-bh-enter");
> @@ -2146,12 +2144,11 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
>           * Make sure all file formats throw away their mutable metadata.
>           * If we get an error here, just don't restart the VM yet.
>           */
> -        bdrv_activate_all(&local_err);
> +        bool success = migration_block_activate(NULL);
> +
>          trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cache-invalidated");
> -        if (local_err) {
> -            error_report_err(local_err);
> -            local_err = NULL;
> -        } else {
> +
> +        if (success) {
>              vm_start();
>          }
>      } else {
> @@ -3193,11 +3190,7 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>           * side of the migration take control of the images.
>           */
>          if (live && !saved_vm_running) {
> -            ret = bdrv_inactivate_all();
> -            if (ret) {
> -                error_setg(errp, "%s: bdrv_inactivate_all() failed (%d)",
> -                           __func__, ret);
> -            }
> +            migration_block_inactivate();
>          }
>      }
>  
> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
> index 76f21e8af3..6f76d9beaf 100644
> --- a/monitor/qmp-cmds.c
> +++ b/monitor/qmp-cmds.c
> @@ -31,6 +31,7 @@
>  #include "qapi/type-helpers.h"
>  #include "hw/mem/memory-device.h"
>  #include "hw/intc/intc.h"
> +#include "migration/misc.h"
>  
>  NameInfo *qmp_query_name(Error **errp)
>  {
> @@ -103,13 +104,8 @@ void qmp_cont(Error **errp)
>           * Continuing after completed migration. Images have been
>           * inactivated to allow the destination to take control. Need to
>           * get control back now.
> -         *
> -         * If there are no inactive block nodes (e.g. because the VM was
> -         * just paused rather than completing a migration),
> -         * bdrv_inactivate_all() simply doesn't do anything.
>           */
> -        bdrv_activate_all(&local_err);
> -        if (local_err) {
> +        if (!migration_block_activate(&local_err)) {
>              error_propagate(errp, local_err);

Could use errp directly here.

>              return;
>          }
> diff --git a/migration/meson.build b/migration/meson.build
> index d53cf3417a..dac687ee3a 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -11,6 +11,7 @@ migration_files = files(
>  
>  system_ss.add(files(
>    'block-dirty-bitmap.c',
> +  'block-active.c',
>    'channel.c',
>    'channel-block.c',
>    'cpu-throttle.c',
> diff --git a/migration/trace-events b/migration/trace-events
> index bb0e0cc6dc..b82a1c5e40 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -383,3 +383,6 @@ migration_pagecache_insert(void) "Error allocating page"
>  # cpu-throttle.c
>  cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
>  cpu_throttle_dirty_sync(void) ""
> +
> +# block-active.c
> +migration_block_activation(const char *name) "%s"


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 6/6] migration/block: Rewrite disk activation
  2024-12-16 15:58   ` Fabiano Rosas
@ 2024-12-16 18:33     ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-12-16 18:33 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, qemu-block, Vladimir Sementsov-Ogievskiy,
	Stefan Hajnoczi, Andrey Drobyshev, Eric Blake,
	Dr . David Alan Gilbert, Kevin Wolf, Daniel P . Berrangé

On Mon, Dec 16, 2024 at 12:58:58PM -0300, Fabiano Rosas wrote:
> > @@ -3286,6 +3237,11 @@ static void migration_iteration_finish(MigrationState *s)
> >      case MIGRATION_STATUS_FAILED:
> >      case MIGRATION_STATUS_CANCELLED:
> >      case MIGRATION_STATUS_CANCELLING:
> 
> Pre-existing, but can we even reach here with CANCELLED? If we can start
> the VM with both CANCELLED and CANCELLING, that means the
> MIG_EVENT_PRECOPY_FAILED notifier is not being consistently called. So I
> think CANCELLED here must be unreachable...

Yeah I can't see how it's reachable, because the only place that we can set
it to CANCELLED is:

migrate_fd_cleanup():
    if (s->state == MIGRATION_STATUS_CANCELLING) {
        migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
                          MIGRATION_STATUS_CANCELLED);
    }

In this specific context, it (as a bottom half) can only be scheduled after
this path.

Looks like the event MIG_EVENT_PRECOPY_FAILED will work regardless, though?
As that's also done after above:

    type = migration_has_failed(s) ? MIG_EVENT_PRECOPY_FAILED :
                                     MIG_EVENT_PRECOPY_DONE;
    migration_call_notifiers(s, type, NULL);

So looks like no matter it was CANCELLING or CANCELLED, it'll always be
CANCELLED when reaching migration_has_failed().

[...]

> > @@ -103,13 +104,8 @@ void qmp_cont(Error **errp)
> >           * Continuing after completed migration. Images have been
> >           * inactivated to allow the destination to take control. Need to
> >           * get control back now.
> > -         *
> > -         * If there are no inactive block nodes (e.g. because the VM was
> > -         * just paused rather than completing a migration),
> > -         * bdrv_inactivate_all() simply doesn't do anything.
> >           */
> > -        bdrv_activate_all(&local_err);
> > -        if (local_err) {
> > +        if (!migration_block_activate(&local_err)) {
> >              error_propagate(errp, local_err);
> 
> Could use errp directly here.

True..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] migration/block: disk activation rewrite
  2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
                   ` (5 preceding siblings ...)
  2024-12-06 23:08 ` [PATCH v2 6/6] migration/block: Rewrite disk activation Peter Xu
@ 2024-12-17 15:26 ` Fabiano Rosas
  6 siblings, 0 replies; 14+ messages in thread
From: Fabiano Rosas @ 2024-12-17 15:26 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: peterx, qemu-block, Vladimir Sementsov-Ogievskiy, Stefan Hajnoczi,
	Andrey Drobyshev, Eric Blake, Dr . David Alan Gilbert, Kevin Wolf,
	Daniel P . Berrangé

Peter Xu <peterx@redhat.com> writes:

> CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033
>  (note: it's a pipeline of two patchsets, to save CI credits and time)
>
> v1: https://lore.kernel.org/r/20241204005138.702289-1-peterx@redhat.com
>
> This is v2 of the series, removing RFC tag, because my goal is to have them
> (or some newer version) merged.
>
> The major change is I merged last three patches, and did quite some changes
> here and there, to make sure the global disk activation status is always
> consistent.  The whole idea is still the same.  I say changelog won't help.
>
> I also temporarily dropped Fabiano's ping-pong test cases to avoid
> different versions floating on the list (as I know a new version is coming
> at some point. Fabiano: you're taking over the 10.0 pulls, so I assume
> you're aware so there's no concern on order of merges).  I'll review the
> test cases separately when they're ready, but this series is still tested
> with that pingpong test and it keeps working.
>
> I started looking at this problem as a whole when reviewing Fabiano's
> series, especially the patch (for a QEMU crash [1]):
>
> https://lore.kernel.org/r/20241125144612.16194-5-farosas@suse.de
>
> The proposed patch could work, but it's unwanted to add such side effect to
> migration.  So I start to think about whether we can provide a cleaner
> approach, because migration doesn't need the disks to be active to work at
> all.  Hence we should try to avoid adding a migration ABI (which doesn't
> matter now, but may matter some day) into prepare phase on disk activation
> status.  Migration should happen with disks inactivated.
>
> It's also a pure wish that, if bdrv_inactivate_all() could be benign to be
> called even if all disks are already inactive.  Then the bug is also gone.
> After all, similar call on bdrv_activate_all() upon all-active disks is all
> fine.  I hope that wish could still be fair.  But I don't know well on
> block layer to say anything meaningful.
>
> And when I was looking at that, I found more things spread all over the
> place on disk activation.  I decided to clean all of them up, while
> hopefully fixing the QEMU crash [1] too.
>
> For this v2, I did some more tests, I want to make sure all the past paths
> keep working at least on failure or cancel races, also in postcopy failure
> cases.  So I did below and they all run pass (when I said "emulated" below,
> I meant I hacked something to trigger those race / rare failures, because
> they aren't easy to trigger with vanilla binary):
>
> - Tested generic migrate_cancel during precopy, disk activation won't be
>   affected.  Disk status reports correct values in tracepoints.
>
> - Test Fabiano's ping-pong migration tests on PAUSED state VM.
>
> - Emulated precopy failure before sending non-iterable, disk inactivation
>   won't happen, and also activation won't trigger after migration cleanups
>   (even if activation on top of activate disk is benign, I checked traces
>   to make sure it'll provide consistent disk status, skipping activation).
>
> - Emulated precopy failure right after sending non-iterable. Disks will be
>   inactivated, but then can be reactivated properly before VM starts.
>
> - Emulated postcopy failure when sending the packed data (which is after
>   disk invalidated), and making sure src VM will get back to live properly,
>   re-activate the disks before starting.
>
> - Emulated concurrent migrate_cancel at the end of migration_completion()
>   of precopy, after disk inactivated.  Disks can be reactivated properly.
>
>   NOTE: here if dest QEMU didn't quit before migrate_cancel,
>   bdrv_activate_all() can crash src QEMU.  This behavior should be the same
>   before/after this patch.
>
> Comments welcomed, thanks.
>
> [1] https://gitlab.com/qemu-project/qemu/-/issues/2395
>
> Peter Xu (6):
>   migration: Add helper to get target runstate
>   qmp/cont: Only activate disks if migration completed
>   migration/block: Make late-block-active the default
>   migration/block: Apply late-block-active behavior to postcopy
>   migration/block: Fix possible race with block_inactive
>   migration/block: Rewrite disk activation
>
>  include/migration/misc.h |   4 ++
>  migration/migration.h    |   6 +-
>  migration/block-active.c |  94 +++++++++++++++++++++++++++
>  migration/colo.c         |   2 +-
>  migration/migration.c    | 136 +++++++++++++++------------------------
>  migration/savevm.c       |  46 ++++++-------
>  monitor/qmp-cmds.c       |  22 +++----
>  migration/meson.build    |   1 +
>  migration/trace-events   |   3 +
>  9 files changed, 188 insertions(+), 126 deletions(-)
>  create mode 100644 migration/block-active.c

Queued, thanks!


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-12-17 15:27 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
2024-12-06 23:08 ` [PATCH v2 1/6] migration: Add helper to get target runstate Peter Xu
2024-12-06 23:08 ` [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed Peter Xu
2024-12-16 15:10   ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 3/6] migration/block: Make late-block-active the default Peter Xu
2024-12-16 15:16   ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy Peter Xu
2024-12-16 15:17   ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 5/6] migration/block: Fix possible race with block_inactive Peter Xu
2024-12-16 15:18   ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 6/6] migration/block: Rewrite disk activation Peter Xu
2024-12-16 15:58   ` Fabiano Rosas
2024-12-16 18:33     ` Peter Xu
2024-12-17 15:26 ` [PATCH v2 0/6] migration/block: disk activation rewrite Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).