From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: "wangjie (P)" <wangjie88@huawei.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
mreitz@redhat.com, quintela@redhat.com,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"fuweiwei (C)" <fuweiwei2@huawei.com>,
"eblake@redhat.com" <eblake@redhat.com>,
"berrange@redhat.com kchamart@redhat.com" <kchamart@redhat.com>,
"famz@redhat.com" <famz@redhat.com>
Subject: Re: [Qemu-devel] ping RE: question: I found a qemu crash about migration
Date: Mon, 9 Oct 2017 12:55:37 +0100 [thread overview]
Message-ID: <20171009115536.GJ2374@work-vm> (raw)
In-Reply-To: <20170929204400.GA13218@localhost.localdomain>
* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 29.09.2017 um 21:06 hat Dr. David Alan Gilbert geschrieben:
> > * Kevin Wolf (kwolf@redhat.com) wrote:
> > > Am 28.09.2017 um 19:01 hat Dr. David Alan Gilbert geschrieben:
> > > > Hi,
> > > > This is a 'fun' bug; I had a good chat to kwolf about it earlier.
> > > > A proper fix really needs to be done together with libvirt so that we
> > > > can sequence:
> > > > a) The stopping of the CPU on the source
> > > > b) The termination of the mirroring block job
> > > > c) The inactivation of the block devices on the source
> > > > (bdrv_inactivate_all)
> > > > d) The activation of the block devices on the destination
> > > > (bdrv_invalidate_cache_all)
> > > > e) The start of the CPU on the destination
> > > >
> > > > It looks like you're hitting a race between b/c; we've had races
> > > > between c/d in the past and moved the bdrv_inactivate_all.
> > > >
> > > > During the discussion we ended up with two proposed solutions;
> > > > both of them require one extra command and one extra migration
> > > > capability.
> > > >
> > > > The block way
> > > > -------------
> > > > 1) Add a new migration capability pause-at-complete
> > > > 2) Add a new migration state almost-complete
> > > > 3) After saving devices, if pause-at-complete is set,
> > > > transition to almost-complete
> > > > 4) Add a new command (migration-continue) that
> > > > causes the migration to inactivate the devices (c)
> > > > and send the final EOF to the destination.
> > > >
> > > > You set pause-at-complete, wait until migrate hits almost-complete;
> > > > cleanup the mirror job, and then do migration-continue. When it
> > > > completes do 'cont' on the destination.
> > >
> > > Especially since you're concerned about the exact position of the pause,
> > > the following would be a little safer (using the opportunity to suggest
> > > slightly different names,too):
> > >
> > > 1) Add a new migration capability pause-at-complete
> > > 2) Add a new migration state ready
> > > 3) After migration has converged (i.e. the background phase of the
> > > migration has completed) and we are ready to actually switch to the
> > > destination, stop the VM, transition to 'ready' and emit a QMP event
> > > 4) Continue processing QMP commands in the 'ready' phase. This is where
> > > libvirt is supposed to tear down any running non-VM activities like
> > > block jobs, NBD servers, etc.
> > > 5) Add a new command (migration-complete) that does the final part of
> > > migration, including sending the remaining dirty memory, device state
> > > and the final EOF that signals the destination to resume the VM if -S
> > > isn't given.
> >
> > What worries me here is whether some other subsection is going to say
> > that this pause must happen after the device state save, e.g. if the
> > device state save causes them to push out the last block/packet/etc -
> > and I've got no real way to say whether that point is any better or
> > worse than the point I was suggestion.
> > And the position in the cycle when the pause happens is part of the API.
>
> The important point here is that the VM is stopped. If device emulation
> is doing anything by itself while the VM is stopped, that is a bug.
> In other words, that last block/packet/etc. must already have been
> pushed out when we stopped the VM, so there is nothing left to push out
> when we save the state.
>
> With correctly written device code, there is no difference whether we
> save the device state first or last, without external influence it will
> be the same.
>
> The obvious external thing that I can see is monitor commands, e.g. the
> monitor allows to access I/O ports, which can change the device state.
> However, monitor commands are completely under the control of the user,
> so they can always choose the order that they need.
>
> In any case, saving the device state only immediately before doing the
> switch makes sure that we consider any changes that have still been
> made.
>
> > > The main reason why I advocate this "block way" is that it looks a lot
> > > less hacky, but more future-proof to me. Instead of adding a one-off
> > > hack for the problem at hand, it introduces a phase where libvirt can
> > > cleanly tear down any activity it has still running on the source qemu.
> >
> > Yes, if we get that phase right.
> >
> > > We got away without doing this because we never did a clean handover of
> > > resources from the source to the destination. From the perspective of a
> > > management tool, we had these distinct phases during migration:
> > >
> > > a) 'migrate' command:
> > > Source VM is still running and in control of all resources (like disk
> > > images), migrating some state in the background
> > >
> > > b) Migration converges:
> > > Both VMs are stopped (assuming -S is given on the destination,
> > > otherwise this phase is skipped), the destination is in control of
> > > the resources
> > >
> > > c) 'cont' command:
> > > Destination VM is running and in control of the resources
> > >
> > > There is an asymmetry here because there is no phase where the VMs are
> > > stopped, but the source is in control of the resources (this is the
> > > additional state that my proposal introduces).
> > >
> > > As long as we didn't really manage control of the resources with locks,
> > > we could abuse b) for things where the source still accessed the
> > > resources if we knew that the destination wouldn't use them yet (it's
> > > not an assumption that I'm willing to trust too much) and it would still
> > > kind of work in the common cases. But it's not really the correct thing
> > > to do, it has probably always caused bugs in corner cases, and the
> > > locking mechanisms are pointing that out now.
> > >
> > > We're now noticing this with block device because we're using locking,
> > > but I wouldn't be surprised if we found sooner or later that there are
> > > more resources that should really be handed over in a cleanly designed
> > > way and that work only by accident as long as you don't do that. So I
> > > don't think a local hack for mirroring (or even all block jobs) is
> > > sufficient to be future-proof.
> >
> > Without the block-job we were actually pretty safe - up until the EOF
> > we knew the destination wouldn't start, and similarly we wouldn't start
> > using a device on the destination before the source had saved that
> > device.
>
> s/Without the block-job/With only the guest device/
>
> As long as only a guest device is using a resource, we're pretty safe,
> yes. This is because the VM is stopped when we actually migrate, so the
> user is already shut down without management tool interaction and the
> resource isn't actually in use at this point.
>
> Anything that keeps using a resource while the VM is stopped is a
> potential problem. Apart from block jobs, this includes at least the
> built-in NBD server, which also needs to be shut down before we can
> complete migration. And probably other things I'm forgetting right now.
>
> > My worry is where the edge of 'source is in control of the device is' -
> > is that upto the point at which the end of the device save? Before that?
> > After that? What? It seems to be pretty arbitrary. Unless we can
> > understand why it's where it is, then I'm not sure if it's a 'clean
> > migration phase model'.
>
> I can tell you where in the phase model it is: The source gives up
> control of the image during 'migration-complete', i.e. at the transition
> between the 'ready' and the 'completed' phase on the source.
>
> Within this transition, it's less clear where it has to be, but that's
> the same with its current place. I expect all of this code to stay
> mostly unchanged, but just called a bit later, so the place where we
> inactivate images today sounds like a good candidate. It's important
> that it's done before sending the EOF packet that triggers the
> destination to try and take control of the images.
>
> > > > The migration way
> > > > -----------------
> > > > 1) Stop doing (d) when the destination is started with -S
> > > > since it happens anyway when 'cont' is issued
> > > > 2) Add a new migration capability ext-manage-storage
> > > > 3) When 'ext-manage-storage' is set, we don't bother doing (c)
> > > > 4) Add a new command 'block-inactivate' on the source
> > > >
> > > > You set ext-manage-storage, do the migrate and when it's finished
> > > > clean up the block job, block-inactivate on the source, and
> > > > then cont on the destination.
> > >
> > > You would have to inactivate every single block node, which is not
> > > a thing libvirt can currently do (they don't even really manage
> > > individual nodes yet). Apart from that I believe it is much too
> > > low-level and users shouldn't have to worry about what implications
> > > migration has to block devices. A clean migration phase model should be
> > > much easier to understand.
> >
> > Yep, I see that.
> > (Having said that, this extra round-trip with management is adding extra
> > downtime, and we'll have to work out things like what happens if you
> > cancel or fail in that time).
>
> libvirt can opt in to the additional phase only when it actually needs
> it.
>
> I'm not sure if the requirement can change while migration is already
> running. I don't think so currently, but if yes, we might need a way for
> management tools to change the capability setting after the fact.
OK, I'm going to look at doing this in the next few days.
Dave
> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2017-10-09 11:55 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-28 7:38 [Qemu-devel] ping RE: question: I found a qemu crash about migration wangjie (P)
2017-09-28 17:01 ` Dr. David Alan Gilbert
2017-09-29 9:35 ` Kevin Wolf
2017-09-29 19:06 ` Dr. David Alan Gilbert
2017-09-29 20:44 ` Kevin Wolf
2017-10-09 11:55 ` Dr. David Alan Gilbert [this message]
2017-10-11 1:34 ` wangjie (P)
2017-10-11 8:28 ` Kevin Wolf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171009115536.GJ2374@work-vm \
--to=dgilbert@redhat.com \
--cc=eblake@redhat.com \
--cc=famz@redhat.com \
--cc=fuweiwei2@huawei.com \
--cc=kchamart@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=wangjie88@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).