From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
Sean Christopherson <seanjc@google.com>,
Leonardo Bras Soares Passos <lsoaresp@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Richard Henderson <rth@twiddle.net>,
Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp
Date: Mon, 13 Jun 2022 12:13:05 +0100 [thread overview]
Message-ID: <Yqcbwemb7I/MpGWG@work-vm> (raw)
In-Reply-To: <YqNTDSV4P05pb+9l@xz-m1.local>
* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Jun 09, 2022 at 05:02:29PM -0400, Peter Xu wrote:
> > On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
> > > > /* TODO we should move all of this lot into postcopy_ram.c or a shared code
> > > > * in migration.c
> > > > */
> > > > - cpu_synchronize_all_post_init();
> > > > + cpu_synchronize_all_post_init(&local_err);
> > > > + if (local_err) {
> > > > + /*
> > > > + * TODO: a better way to do this is to tell the src that we cannot
> > > > + * run the VM here so hopefully we can keep the VM running on src
> > > > + * and immediately halt the switch-over. But that needs work.
> > >
> > > Yes, I think it is possible; unlike some of the later errors in the same
> > > function, in this case we know no disks/network/etc have been touched,
> > > so we should be able to recover.
> > > I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> > > out of loadvm_postcopy_handle_run to after this point.
> > >
> > > We've already got the return path, so we should be able to signal the
> > > failure unless we're very unlucky.
> >
> > Right. It's just that for the new ACK we may need to modify the return
> > path protocol for sure, because none of the existing ones can notify such
> > an information.
> >
> > One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
> > recovery before to do the final handshake with offload=1 only (which is
> > defined as MIGRATION_RESUME_ACK_VALUE). We could try to fill in the
> > payload with some !1 value, to tell the source that we NACK the migration
> > then src fails the migration as long as possible?
> >
> > That seems to be even compatibile with one old qemu migrating to a new qemu
> > scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
> > message with !1 payload, it'll mark the rp bad:
>
> Oh it won't be compatible.. The clean way to do this is we need to modify
> the src qemu to halt in postcopy_start() to wait for that ack before
> continue. That may need another cap/param to enable.
OK; I was wondering aobut sending a RP_MSG_SHUT with a failure; but if
you'd need to change the source it's still a problem.
> The thing is I'm not very sure whether this will be worth it.
>
> Non-compatible migrations should be rare on put register failures. For the
> issue I was working on, it was actually a kernel bug that triggered it but
> it's just hard to figure out where's wrong. With properly working kernels
> and matching hosts they should just not really heppen. I'm worried adding
> too much complexity could over-engineer things without much benefits.
OK that makes sense.
> In that case, I'd think it proper if we start with what this patchset
> provides, which at least allows us to fail in a crystal clear way?
Yes, the clear error is important.
Dave
> >
> > if (migrate_handle_rp_resume_ack(ms, tmp32)) {
> > mark_source_rp_bad(ms);
> > goto out;
> > }
> >
> > static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> > {
> > trace_source_return_path_thread_resume_ack(value);
> >
> > if (value != MIGRATION_RESUME_ACK_VALUE) {
> > error_report("%s: illegal resume_ack value %"PRIu32,
> > __func__, value);
> > return -1;
> > }
> > ...
> > }
> >
> > If it looks generally good, I can try with such a change in v2.
> >
> > Thanks,
> >
> > --
> > Peter Xu
>
> --
> Peter Xu
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2022-06-13 11:14 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-07 23:06 [PATCH RFC 0/5] CPU: Detect put cpu register errors for migrations Peter Xu
2022-06-07 23:06 ` [PATCH RFC 1/5] cpus-common: Introduce run_on_cpu_func2 which allows error returns Peter Xu
2022-06-07 23:06 ` [PATCH RFC 2/5] cpus-common: Add run_on_cpu2() Peter Xu
2022-06-07 23:06 ` [PATCH RFC 3/5] accel: Allow synchronize_post_init() to take an Error** Peter Xu
2022-06-07 23:06 ` [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp Peter Xu
2022-06-08 17:05 ` Dr. David Alan Gilbert
2022-06-09 21:02 ` Peter Xu
2022-06-10 14:19 ` Peter Xu
2022-06-13 11:13 ` Dr. David Alan Gilbert [this message]
2022-06-07 23:06 ` [PATCH RFC 5/5] KVM: Hook kvm_arch_put_registers() errors to the caller Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yqcbwemb7I/MpGWG@work-vm \
--to=dgilbert@redhat.com \
--cc=imammedo@redhat.com \
--cc=lsoaresp@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=rth@twiddle.net \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).