From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: peterx@redhat.com, qemu-block@nongnu.org,
"Vladimir Sementsov-Ogievskiy" <vsementsov@yandex-team.ru>,
"Stefan Hajnoczi" <stefanha@redhat.com>,
"Andrey Drobyshev" <andrey.drobyshev@virtuozzo.com>,
"Eric Blake" <eblake@redhat.com>,
"Dr . David Alan Gilbert" <dave@treblig.org>,
"Kevin Wolf" <kwolf@redhat.com>,
"Daniel P . Berrangé" <berrange@redhat.com>
Subject: Re: [PATCH v2 0/6] migration/block: disk activation rewrite
Date: Tue, 17 Dec 2024 12:26:44 -0300 [thread overview]
Message-ID: <87h672wcsr.fsf@suse.de> (raw)
In-Reply-To: <20241206230838.1111496-1-peterx@redhat.com>
Peter Xu <peterx@redhat.com> writes:
> CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033
> (note: it's a pipeline of two patchsets, to save CI credits and time)
>
> v1: https://lore.kernel.org/r/20241204005138.702289-1-peterx@redhat.com
>
> This is v2 of the series, removing RFC tag, because my goal is to have them
> (or some newer version) merged.
>
> The major change is I merged last three patches, and did quite some changes
> here and there, to make sure the global disk activation status is always
> consistent. The whole idea is still the same. I say changelog won't help.
>
> I also temporarily dropped Fabiano's ping-pong test cases to avoid
> different versions floating on the list (as I know a new version is coming
> at some point. Fabiano: you're taking over the 10.0 pulls, so I assume
> you're aware so there's no concern on order of merges). I'll review the
> test cases separately when they're ready, but this series is still tested
> with that pingpong test and it keeps working.
>
> I started looking at this problem as a whole when reviewing Fabiano's
> series, especially the patch (for a QEMU crash [1]):
>
> https://lore.kernel.org/r/20241125144612.16194-5-farosas@suse.de
>
> The proposed patch could work, but it's unwanted to add such side effect to
> migration. So I start to think about whether we can provide a cleaner
> approach, because migration doesn't need the disks to be active to work at
> all. Hence we should try to avoid adding a migration ABI (which doesn't
> matter now, but may matter some day) into prepare phase on disk activation
> status. Migration should happen with disks inactivated.
>
> It's also a pure wish that, if bdrv_inactivate_all() could be benign to be
> called even if all disks are already inactive. Then the bug is also gone.
> After all, similar call on bdrv_activate_all() upon all-active disks is all
> fine. I hope that wish could still be fair. But I don't know well on
> block layer to say anything meaningful.
>
> And when I was looking at that, I found more things spread all over the
> place on disk activation. I decided to clean all of them up, while
> hopefully fixing the QEMU crash [1] too.
>
> For this v2, I did some more tests, I want to make sure all the past paths
> keep working at least on failure or cancel races, also in postcopy failure
> cases. So I did below and they all run pass (when I said "emulated" below,
> I meant I hacked something to trigger those race / rare failures, because
> they aren't easy to trigger with vanilla binary):
>
> - Tested generic migrate_cancel during precopy, disk activation won't be
> affected. Disk status reports correct values in tracepoints.
>
> - Test Fabiano's ping-pong migration tests on PAUSED state VM.
>
> - Emulated precopy failure before sending non-iterable, disk inactivation
> won't happen, and also activation won't trigger after migration cleanups
> (even if activation on top of activate disk is benign, I checked traces
> to make sure it'll provide consistent disk status, skipping activation).
>
> - Emulated precopy failure right after sending non-iterable. Disks will be
> inactivated, but then can be reactivated properly before VM starts.
>
> - Emulated postcopy failure when sending the packed data (which is after
> disk invalidated), and making sure src VM will get back to live properly,
> re-activate the disks before starting.
>
> - Emulated concurrent migrate_cancel at the end of migration_completion()
> of precopy, after disk inactivated. Disks can be reactivated properly.
>
> NOTE: here if dest QEMU didn't quit before migrate_cancel,
> bdrv_activate_all() can crash src QEMU. This behavior should be the same
> before/after this patch.
>
> Comments welcomed, thanks.
>
> [1] https://gitlab.com/qemu-project/qemu/-/issues/2395
>
> Peter Xu (6):
> migration: Add helper to get target runstate
> qmp/cont: Only activate disks if migration completed
> migration/block: Make late-block-active the default
> migration/block: Apply late-block-active behavior to postcopy
> migration/block: Fix possible race with block_inactive
> migration/block: Rewrite disk activation
>
> include/migration/misc.h | 4 ++
> migration/migration.h | 6 +-
> migration/block-active.c | 94 +++++++++++++++++++++++++++
> migration/colo.c | 2 +-
> migration/migration.c | 136 +++++++++++++++------------------------
> migration/savevm.c | 46 ++++++-------
> monitor/qmp-cmds.c | 22 +++----
> migration/meson.build | 1 +
> migration/trace-events | 3 +
> 9 files changed, 188 insertions(+), 126 deletions(-)
> create mode 100644 migration/block-active.c
Queued, thanks!
prev parent reply other threads:[~2024-12-17 15:27 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-06 23:08 [PATCH v2 0/6] migration/block: disk activation rewrite Peter Xu
2024-12-06 23:08 ` [PATCH v2 1/6] migration: Add helper to get target runstate Peter Xu
2024-12-06 23:08 ` [PATCH v2 2/6] qmp/cont: Only activate disks if migration completed Peter Xu
2024-12-16 15:10 ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 3/6] migration/block: Make late-block-active the default Peter Xu
2024-12-16 15:16 ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 4/6] migration/block: Apply late-block-active behavior to postcopy Peter Xu
2024-12-16 15:17 ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 5/6] migration/block: Fix possible race with block_inactive Peter Xu
2024-12-16 15:18 ` Fabiano Rosas
2024-12-06 23:08 ` [PATCH v2 6/6] migration/block: Rewrite disk activation Peter Xu
2024-12-16 15:58 ` Fabiano Rosas
2024-12-16 18:33 ` Peter Xu
2024-12-17 15:26 ` Fabiano Rosas [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h672wcsr.fsf@suse.de \
--to=farosas@suse.de \
--cc=andrey.drobyshev@virtuozzo.com \
--cc=berrange@redhat.com \
--cc=dave@treblig.org \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@yandex-team.ru \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).