Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dave@treblig.org>
To: "Jiří Denemark" <jdenemar@redhat.com>
Cc: Peter Xu <peterx@redhat.com>, Juraj Marcin <jmarcin@redhat.com>,
	qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state
Date: Wed, 1 Oct 2025 15:53:46 +0000	[thread overview]
Message-ID: <aN1OijFpZpu-EssC@gallifrey> (raw)
In-Reply-To: <aN06MaKywizt1VbF@orkuz.int.mamuti.net>

* Jiří Denemark (jdenemar@redhat.com) wrote:
> On Wed, Oct 01, 2025 at 11:05:59 +0000, Dr. David Alan Gilbert wrote:
> > * Jiří Denemark (jdenemar@redhat.com) wrote:
> > > On Tue, Sep 30, 2025 at 16:04:54 -0400, Peter Xu wrote:
> > > > On Tue, Sep 30, 2025 at 09:53:31AM +0200, Jiří Denemark wrote:
> > > > > On Thu, Sep 25, 2025 at 14:22:06 -0400, Peter Xu wrote:
> > > > > > On Thu, Sep 25, 2025 at 01:54:40PM +0200, Jiří Denemark wrote:
> > > > > > > On Mon, Sep 15, 2025 at 13:59:15 +0200, Juraj Marcin wrote:
> > > > > > So far, dest QEMU will try to resume the VM after getting RUN command, that
> > > > > > is what loadvm_postcopy_handle_run_bh() does, and it will (when autostart=1
> > > > > > set): (1) firstly try to activate all block devices, iff it succeeded, (2)
> > > > > > do vm_start(), at the end of which RESUME event will be generated.  So
> > > > > > RESUME currently implies both disk activation success, and vm start worked.
> > > > > > 
> > > > > > > may still fail when locking disks fails (not sure if this is the only
> > > > > > > way cont may fail). In this case we cannot cancel the migration on the
> > > > > > 
> > > > > > Is there any known issue with locking disks that dest would fail?  This
> > > > > > really sound like we should have the admin taking a look.
> > > > > 
> > > > > Oh definitely, it would be some kind of an storage access issue on the
> > > > > destination. But we'd like to give the admin an option to actually do
> > > > > anything else than just killing the VM :-) Either by automatically
> > > > > canceling the migration or allowing recovery once storage issues are
> > > > > solved.
> > > > 
> > > > The problem is, if the storage locking stopped working properly, then how
> > > > to guarantee the shared storage itself is working properly?
> > > > 
> > > > When I was replying previously, I was expecting the admin taking a look to
> > > > fix the storage, I didn't expect the VM can still be recovered anymore if
> > > > there's no confidence that the block devices will work all fine.  The
> > > > locking errors to me may imply a block corruption already, or should I not
> > > > see it like that?
> > > 
> > > If the storage itself is broken, there's clearly nothing we can do. But
> > > the thing is we're accessing it from two distinct hosts. So while it may
> > > work on the source, it can be broken on the destination. For example,
> > > connection between the destination host and the storage may be broken.
> > > Not sure how often this can happen in real life, but we have a bug
> > > report that (artificially) breaking storage access on the destination
> > > results in paused VM on the source which can only be killed.
> > 
> > I've got a vague memory that a tricky case is when some of your storage
> > devices are broken on the destination, but not all.
> > So you tell the block layer you want to take them on the destination
> > some take their lock, one fails;  now what state are you in?
> > I'm not sure if the block layer had a way of telling you what state
> > you were in when I was last involved in that.
> 
> Wouldn't those locks be automatically released when we kill QEMU on the
> destination as a reaction to a failure to start vCPUs?

Oh hmm, yeh that might work OK.

Dave

> Jirka
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

next prev parent reply	other threads:[~2025-10-01 15:55 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15 11:59 [PATCH 0/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-15 11:59 ` [PATCH 1/4] migration: Do not try to start VM if disk activation fails Juraj Marcin
2025-09-19 16:12   ` Fabiano Rosas
2025-09-15 11:59 ` [PATCH 2/4] migration: Accept MigrationStatus in migration_has_failed() Juraj Marcin
2025-09-19 14:57   ` Peter Xu
2025-09-22 11:26     ` Juraj Marcin
2025-09-15 11:59 ` [PATCH 3/4] migration: Refactor incoming cleanup into migration_incoming_finish() Juraj Marcin
2025-09-19 15:53   ` Peter Xu
2025-09-19 16:46   ` Fabiano Rosas
2025-09-22 12:58     ` Juraj Marcin
2025-09-22 15:51       ` Peter Xu
2025-09-22 17:40         ` Fabiano Rosas
2025-09-22 17:48           ` Peter Xu
2025-09-23 14:58         ` Juraj Marcin
2025-09-23 16:17           ` Peter Xu
2025-09-15 11:59 ` [PATCH 4/4] migration: Introduce POSTCOPY_DEVICE state Juraj Marcin
2025-09-19 16:58   ` Peter Xu
2025-09-19 17:50     ` Peter Xu
2025-09-22 13:34       ` Juraj Marcin
2025-09-22 16:16         ` Peter Xu
2025-09-23 14:23           ` Juraj Marcin
2025-09-25 11:54   ` Jiří Denemark
2025-09-25 18:22     ` Peter Xu
2025-09-30  7:53       ` Jiří Denemark
2025-09-30 20:04         ` Peter Xu
2025-10-01  8:43           ` Jiří Denemark
2025-10-01 11:05             ` Dr. David Alan Gilbert
2025-10-01 14:26               ` Jiří Denemark
2025-10-01 15:53                 ` Dr. David Alan Gilbert [this message]
2025-10-01 15:10               ` Daniel P. Berrangé
2025-10-02 12:17                 ` Jiří Denemark
2025-10-02 13:12                   ` Dr. David Alan Gilbert
2025-10-01 10:09           ` Juraj Marcin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aN1OijFpZpu-EssC@gallifrey \
    --to=dave@treblig.org \
    --cc=farosas@suse.de \
    --cc=jdenemar@redhat.com \
    --cc=jmarcin@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.