Re: Properly quitting qemu immediately after failing migration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Max Reitz <mreitz@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Juan Quintela <quintela@redhat.com>
Subject: Re: Properly quitting qemu immediately after failing migration
Date: Mon, 29 Jun 2020 17:46:18 +0100	[thread overview]
Message-ID: <20200629164618.GL2908@work-vm> (raw)
In-Reply-To: <fb611d49-4fd9-ef1a-1383-1d15137f658d@redhat.com>

* Max Reitz (mreitz@redhat.com) wrote:
> On 29.06.20 17:41, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mreitz@redhat.com) wrote:
> >> Hi,
> >>
> >> In an iotest, I’m trying to quit qemu immediately after a migration has
> >> failed.  Unfortunately, that doesn’t seem to be possible in a clean way:
> >> migrate_fd_cleanup() runs only at some point after the migration state
> >> is already “failed”, so if I just wait for that “failed” state and
> >> immediately quit, some cleanup functions may not have been run yet.
> > 
> > Yeh this is hard; I always take the end of migrate_fd_cleanup to be the
> > real end.
> 
> Yes, unfortunately I don’t seem to have a way to look for that end. :(
> 
> > It always happens on the main thread I think (it's done as a bh in some
> > cases).
> > 
> >> This is a problem with dirty bitmap migration at least, because it
> >> increases the refcount on all block devices that are to be migrated, so
> >> if we don’t call the cleanup function before quitting, the refcount will
> >> stay elevated and bdrv_close_all() will hit an assertion because those
> >> block devices are still around after blk_remove_all_bs() and
> >> blockdev_close_all_bdrv_states().
> >>
> >> In practice this particular issue might not be that big of a problem,
> >> because it just means qemu aborts when the user intended to let it quit
> >> anyway.  But on one hand I could imagine that there are other clean-up
> >> paths that should definitely run before qemu quits (although I don’t
> >> know), and on the other, it’s a problem for my test.
> > 
> > 'quit' varies - there are a lot of incoming failures that just assert;
> > very few of them cause a clean exit (I think there are more clean ones
> > after Peter's work on restartable postcopy a year or two ago).
> 
> Well, my problem is about the source side, where there is still a VM
> running that I would expect to be in a sane state even after a failed
> migration.

Oh! Source side; the source side I worry much more about; if the
destination implodes after a failed migration I'm not too worried - but
the source side *shall not* fail.

> > I do see the end of migrate_fd_cleanup calls the notifier list; but it's
> > not clear to me that it's alwyas going to see the first transition to
> > 'failed' at that point.
> 
> What exactly do you mean?  It appears to me that both query-status and
> the MIGRATION events signal the failed state before migrate_fd_cleanup()
> is invoked.

OK, I was getting confused between event sending and the notifiers; the
event happens a bit before.  The state gets set to failed and then we
call fd_cleanup in most cases; the state is what query-status is keyed
off, and is also what causes the event to be sent.

> If you mean I could add a notifier to that list to do something™, I’m
> not sure what exactly it is I’d so.

Yeh, that's why the notifiers are there....
However, if you need to do some cleanup at the end of a migration, then
I think adding a call inside migrate_fd_cleanup is perfectly fine.

>  My test can’t do it, because it’s
> an iotest, and even if it could, I suppose I’d want to wait until even
> after all notifiers have been invoked (which isn’t guaranteed if I’d add
> a notifier myself).
> 
> >> I tried working around the problem for my test by waiting on “Unable to
> >> write” appearing on stderr, because that indicates that
> >> migrate_fd_cleanup()’s error_report_err() has been reached.  But on one
> >> hand, that isn’t really nice, and on the other, it doesn’t even work
> >> when the failure is on the source side (because then there is no
> >> s->error for migrate_fd_cleanup() to report).
> >>
> >> In all, I’m asking:
> >> (1) Is there a nice solution for me now to delay quitting qemu until the
> >> failed migration has been fully resolved, including the clean-up?
> > 
> > In vl.c, I added a call to migration_shutdown in qemu_cleanup - although
> > that seems to be mostly about cleaning up the *outgoing* side; you could
> > add some incoming cleanup there.
> 
> So you mean waiting until migrate_fd_cleanup() has run?  Maybe I’ll try
> that tomorrow, although I’d hoped I could get this done without having
> to modify the code base...  (I.e., I’d hoped there would be some
> QMP-queriable flag somewhere that could tell me whether the
> migrate_fd_cleanup() has run)

Please be a little careful around there; i.e. try as far as possible not
to hang on dead block devices.

> >> (2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at
> >> the wrong time?  Like, maybe lingering subprocesses when using “exec”?
> > 
> > Yeh that should be cleaner, but isn't.
> 
> :(

Randomly quitting is safer than it used to be (again that qemu_cleanup
code is what should make it so).

Dave

> OK then.  Thanks for your insights!
> 
> Max
> 



--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

next prev parent reply	other threads:[~2020-06-29 16:48 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 13:48 Properly quitting qemu immediately after failing migration Max Reitz
2020-06-29 14:18 ` Vladimir Sementsov-Ogievskiy
2020-06-29 15:00   ` Max Reitz
2020-07-01 16:16     ` Vladimir Sementsov-Ogievskiy
2020-07-02  7:23       ` Max Reitz
2020-07-02 11:44         ` Vladimir Sementsov-Ogievskiy
2020-07-02 12:57           ` Vladimir Sementsov-Ogievskiy
2020-06-29 15:41 ` Dr. David Alan Gilbert
2020-06-29 16:08   ` Max Reitz
2020-06-29 16:46     ` Dr. David Alan Gilbert [this message]
2020-06-29 15:45 ` Daniel P. Berrangé
2020-06-29 16:00   ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200629164618.GL2908@work-vm \
    --to=dgilbert@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.