From: Peter Xu <peterx@redhat.com>
To: Steven Sistare <steven.sistare@oracle.com>
Cc: Fabiano Rosas <farosas@suse.de>, QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: [PATCH V1 00/26] Live update: cpr-exec
Date: Tue, 28 May 2024 12:42:40 -0400 [thread overview]
Message-ID: <ZlYJgHJwiVkGA7nG@x1n> (raw)
In-Reply-To: <102b10d6-034f-453a-8f06-97e8d5869364@oracle.com>
On Tue, May 28, 2024 at 11:10:27AM -0400, Steven Sistare wrote:
> On 5/27/2024 1:45 PM, Peter Xu wrote:
> > On Tue, May 21, 2024 at 07:46:12AM -0400, Steven Sistare wrote:
> > > I understand, thanks. If I can help with any of your todo list,
> > > just ask - steve
> >
> > Thanks for offering the help, Steve. Started looking at this today, then I
> > found that I miss something high-level. Let me ask here, and let me
> > apologize already for starting to throw multiple questions..
> >
> > IIUC the whole idea of this patchset is to allow efficient QEMU upgrade, in
> > this case not host kernel but QEMU-only, and/or upper.
> >
> > Is there any justification on why the complexity is needed here? It looks
> > to me this one is more involved than cpr-reboot, so I'm thinking how much
> > we can get from the complexity, and whether it's worthwhile. 1000+ LOC is
> > the min support, and if we even expect more to come, that's really
> > important, IMHO.
> >
> > For example, what's the major motivation of this whole work? Is that more
> > on performance, or is it more for supporting the special devices like VFIO
> > which we used to not support, or something else? I can't find them in
> > whatever cover letter I can find, including this one.
> >
> > Firstly, regarding performance, IMHO it'll be always nice to share even
> > some very fundamental downtime measurement comparisons using the new exec
> > mode v.s. the old migration ways to upgrade QEMU binary. Do you perhaps
> > have some number on hand when you started working on this feature years
> > ago? Or maybe some old links on the list would help too, as I didn't
> > follow this work since the start.
> >
> > On VFIO, IIUC you started out this project without VFIO migration being
> > there. Now we have VFIO migration so not sure how much it would work for
> > the upgrade use case. Even with current VFIO migration, we may not want to
> > migrate device states for a local upgrade I suppose, as that can be a lot
> > depending on the type of device assigned. However it'll be nice to discuss
> > this too if this is the major purpose of the series.
> >
> > I think one other challenge on QEMU upgrade with VFIO devices is that the
> > dest QEMU won't be able to open the VFIO device when the src QEMU is still
> > using it as the owner. IIUC this is a similar condition where QEMU wants
> > to have proper ownership transfer of a shared block device, and AFAIR right
> > now we resolved that issue using some form of file lock on the image file.
> > In this case it won't easily apply to a VFIO dev fd, but maybe we still
> > have other approaches, not sure whether you investigated any. E.g. could
> > the VFIO handle be passed over using unix scm rights? I think this might
> > remove one dependency of using exec which can cause quite some difference
> > v.s. a generic migration (from which regard, cpr-reboot is still a pretty
> > generic migration).
> >
> > You also mentioned vhost/tap, is that also a major goal of this series in
> > the follow up patchsets? Is this a problem only because this solution will
> > do exec? Can it work if either the exec()ed qemu or dst qemu create the
> > vhost/tap fds when boot?
> >
> > Meanwhile, could you elaborate a bit on the implication on chardevs? From
> > what I read in the doc update it looks like a major part of work in the
> > future, but I don't yet understand the issue.. Is it also relevant to the
> > exec() approach?
> >
> > In all cases, some of such discussion would be really appreciated. And if
> > you used to consider other approaches to solve this problem it'll be great
> > to mention how you chose this way. Considering this work contains too many
> > things, it'll be nice if such discussion can start with the fundamentals,
> > e.g. on why exec() is a must.
>
> The main goal of cpr-exec is providing a fast and reliable way to update
> qemu. cpr-reboot is not fast enough or general enough. It requires the
> guest to support suspend and resume for all devices, and that takes seconds.
> If one actually reboots the host, that adds more seconds, depending on
> system services. cpr-exec takes 0.1 secs, and works every time, unlike
> like migration which can fail to converge on a busy system. Live migration
> also consumes more system and network resources.
Right, but note that when I was thinking of a comparison between cpr-exec
v.s. normal migration, I didn't mean a "normal live migration". I think
it's more of the case whether exec() can be avoided. I had a feeling that
this exec() will cause a major part of work elsewhere but maybe I am wrong
as I didn't see the whole branch.
AFAIU, "cpr-exec takes 0.1 secs" is a conditional result. I think it at
least should be relevant to what devices are attached to the VM, right?
E.g., I observed at least two things that can drastically enlarge the
blackout window:
1) vcpu save/load sometimes can take ridiculously long time, even if 99%
of them are fine. I still didn't spend time looking at this issue, but
the outlier (of a single cpu save/load, while I don't remember whether
it's save or load, both will contribute to the downtime anyway) can cause
100+ms already for that single vcpu. It'll already get more than 0.1sec.
2) virtio device loads can be sometimes very slow due to virtqueue
manipulations. We used to have developers working in this area,
e.g. this thread:
https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
I don't yet have time to further look. Since you mentioned vhost I was
wondering whether you hit similar issues, and if not why yet. IIRC it
was only during VM loads so dest QEMU only. Again that'll contribute to
the overall downtime too and that can also be 100ms or more, but that may
depend on VM memory topology and device setup.
When we compare the solutions, we definitely don't need to make it "live":
it could be a migration starting with VM paused already, skipping all dirty
tracking just like cpr-reboot, but in this case it's can be a relatively
normal migration, so that we still invoke the new qemu binary and load that
on the fly, perhaps taking the fds via scm rights. Then compare these two
solutions with/without exec(). Note that I'm not requesting for such data;
it's not fair if that takes a lot of work already first to implement such
idea, but what I wanted to say is that it might be interesting to first
analyze what caused the downtime, and whether that can be logically
resolved too without exec(); hence the below question on "why exec()" in
the first place, as I still feel like that's somewhere we should avoid
unless extremely necessary..
> cpr-exec seamlessly preserves client connections by preserving chardevs,
> and overall provides a much nicer user experience.
I see. However this is a common issue to migration, am I right? I mean,
if we have some chardevs on src host, then we migrate the VM from src to
dst, then a reconnect will be needed anyway. It looks to me that as long
as the old live migration is supported, there's already a solution and apps
are ok with reconnecting to the new ports. From that POV, I am curious
whether this can be seen as a (kind of separate) work besides the cpr-exec,
however perhaps only a new feature only be valid for cpr-exec?
Meanwhile, is there some elaborations on what would be the major change of
nicer user experience with the new solution?
>
> chardev's are preserved by keeping their fd open across the exec, and
> remembering the value of the fd in precreate vmstate so that new qemu
> can associate the fd with the chardev rather than opening a new one.
>
> The approach of preserving open file descriptors is very general and applicable
> to all kinds of devices, regardless of whether they support live migration
> in hardware. Device fd's are preserved using the same mechanism as for
> chardevs.
>
> Devices that support live migration in hardware do not like to live migrate
> in place to the same node. It is not what they are designed for, and some
> implementations will flat out fail because the source and target interfaces
> are the same.
>
> For vhost/tap, sometimes the management layer opens the dev and passes an
> fd to qemu, and sometimes qemu opens the dev. The upcoming vhost/tap support
> allows both. For the case where qemu opens the dev, the fd is preserved
> using the same mechanism as for chardevs.
>
> The fundamental requirements of this work are:
> - precreate vmstate
> - preserve open file descriptors
>
> Direct exec from old to new qemu is not a hard requirement.
Great to know..
> However, it is simple, with few complications, and works with Oracle's
> cloud containers, so it is the method I am most interested in finishing
> first.
>
> I believe everything could also be made to work by using SCM_RIGHTS to
> send fd's to a new qemu process that is started by some external means.
> It would be requested with MIG_MODE_CPR_SCM (or some better name), and
> would co-exist with MIG_MODE_CPR_EXEC.
That sounds like a better thing to me, so that live migration framework is
not changed as drastic. I just still feel like exec() is too powerful, and
evil can reside, just like black magic in the fairy tales; magicians try to
avoid using it unless extremely necessary.
I think the next step for my review is to understand what is implied with
exec(). I'll wait for you to push your tree somewhere so maybe I can read
that and understand better. A base commit would work too if you can share
so I can apply the series, as it doesn't seem to apply to master now.
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2024-05-28 16:43 UTC|newest]
Thread overview: 122+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-29 15:55 [PATCH V1 00/26] Live update: cpr-exec Steve Sistare
2024-04-29 15:55 ` [PATCH V1 01/26] oslib: qemu_clear_cloexec Steve Sistare
2024-05-06 23:27 ` Fabiano Rosas
2024-05-07 8:56 ` Daniel P. Berrangé
2024-05-07 13:54 ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 02/26] vl: helper to request re-exec Steve Sistare
2024-04-29 15:55 ` [PATCH V1 03/26] migration: SAVEVM_FOREACH Steve Sistare
2024-05-06 23:17 ` Fabiano Rosas
2024-05-13 19:27 ` Steven Sistare
2024-05-27 18:14 ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 04/26] migration: delete unused parameter mis Steve Sistare
2024-05-06 21:50 ` Fabiano Rosas
2024-05-27 18:02 ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 05/26] migration: precreate vmstate Steve Sistare
2024-05-07 21:02 ` Fabiano Rosas
2024-05-13 19:28 ` Steven Sistare
2024-05-24 13:56 ` Fabiano Rosas
2024-05-27 18:16 ` Peter Xu
2024-05-28 15:09 ` Steven Sistare via
2024-05-29 18:39 ` Peter Xu
2024-05-30 17:04 ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 06/26] migration: precreate vmstate for exec Steve Sistare
2024-05-06 23:34 ` Fabiano Rosas
2024-05-13 19:28 ` Steven Sistare
2024-05-13 21:21 ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 07/26] migration: VMStateId Steve Sistare
2024-05-07 21:03 ` Fabiano Rosas
2024-05-27 18:20 ` Peter Xu
2024-05-28 15:10 ` Steven Sistare via
2024-05-28 17:44 ` Peter Xu
2024-05-29 17:30 ` Steven Sistare via
2024-05-29 18:53 ` Peter Xu
2024-05-30 17:11 ` Steven Sistare via
2024-05-30 18:03 ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 08/26] migration: vmstate_info_void_ptr Steve Sistare
2024-05-07 21:33 ` Fabiano Rosas
2024-05-27 18:31 ` Peter Xu
2024-05-28 15:10 ` Steven Sistare via
2024-05-28 18:21 ` Peter Xu
2024-05-29 17:30 ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 09/26] migration: vmstate_register_named Steve Sistare
2024-05-09 14:19 ` Fabiano Rosas
2024-05-09 14:32 ` Fabiano Rosas
2024-05-13 19:29 ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 10/26] migration: vmstate_unregister_named Steve Sistare
2024-04-29 15:55 ` [PATCH V1 11/26] migration: vmstate_register at init time Steve Sistare
2024-04-29 15:55 ` [PATCH V1 12/26] migration: vmstate factory object Steve Sistare
2024-04-29 15:55 ` [PATCH V1 13/26] physmem: ram_block_create Steve Sistare
2024-05-13 18:37 ` Fabiano Rosas
2024-05-13 19:30 ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 14/26] physmem: hoist guest_memfd creation Steve Sistare
2024-04-29 15:55 ` [PATCH V1 15/26] physmem: hoist host memory allocation Steve Sistare
2024-04-29 15:55 ` [PATCH V1 16/26] physmem: set ram block idstr earlier Steve Sistare
2024-04-29 15:55 ` [PATCH V1 17/26] machine: memfd-alloc option Steve Sistare
2024-05-28 21:12 ` Peter Xu
2024-05-29 17:31 ` Steven Sistare via
2024-05-29 19:14 ` Peter Xu
2024-05-30 17:11 ` Steven Sistare via
2024-05-30 18:14 ` Peter Xu
2024-05-31 19:32 ` Steven Sistare via
2024-06-03 21:48 ` Peter Xu
2024-06-04 7:13 ` Daniel P. Berrangé
2024-06-04 15:58 ` Peter Xu
2024-06-04 16:14 ` David Hildenbrand
2024-06-04 16:41 ` Peter Xu
2024-06-04 17:16 ` David Hildenbrand
2024-06-03 10:17 ` Daniel P. Berrangé
2024-06-03 11:59 ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 18/26] migration: cpr-exec-args parameter Steve Sistare
2024-05-02 12:23 ` Markus Armbruster
2024-05-02 16:00 ` Steven Sistare
2024-05-21 8:13 ` Daniel P. Berrangé
2024-04-29 15:55 ` [PATCH V1 19/26] physmem: preserve ram blocks for cpr Steve Sistare
2024-05-28 21:44 ` Peter Xu
2024-05-29 17:31 ` Steven Sistare via
2024-05-29 19:25 ` Peter Xu
2024-05-30 17:12 ` Steven Sistare via
2024-05-30 18:39 ` Peter Xu
2024-05-31 19:32 ` Steven Sistare via
2024-06-03 22:29 ` Peter Xu
2024-04-29 15:55 ` [PATCH V1 20/26] migration: cpr-exec mode Steve Sistare
2024-05-02 12:23 ` Markus Armbruster
2024-05-02 16:00 ` Steven Sistare
2024-05-03 6:26 ` Markus Armbruster
2024-05-21 8:20 ` Daniel P. Berrangé
2024-05-24 14:58 ` Fabiano Rosas
2024-05-27 18:54 ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 21/26] migration: migrate_add_blocker_mode Steve Sistare
2024-05-09 17:47 ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 22/26] migration: ram block cpr-exec blockers Steve Sistare
2024-05-09 18:01 ` Fabiano Rosas
2024-05-13 19:29 ` Steven Sistare
2024-04-29 15:55 ` [PATCH V1 23/26] migration: misc " Steve Sistare
2024-05-09 18:05 ` Fabiano Rosas
2024-05-24 12:40 ` Fabiano Rosas
2024-05-27 19:02 ` Steven Sistare via
2024-04-29 15:55 ` [PATCH V1 24/26] seccomp: cpr-exec blocker Steve Sistare
2024-05-09 18:16 ` Fabiano Rosas
2024-05-10 7:54 ` Daniel P. Berrangé
2024-05-13 19:29 ` Steven Sistare
2024-05-21 7:14 ` Daniel P. Berrangé
2024-04-29 15:55 ` [PATCH V1 25/26] migration: fix mismatched GPAs during cpr-exec Steve Sistare
2024-05-09 18:39 ` Fabiano Rosas
2024-04-29 15:55 ` [PATCH V1 26/26] migration: only-migratable-modes Steve Sistare
2024-05-09 19:14 ` Fabiano Rosas
2024-05-13 19:48 ` Steven Sistare
2024-05-13 21:57 ` Fabiano Rosas
2024-05-21 8:05 ` Daniel P. Berrangé
2024-05-02 16:13 ` cpr-exec doc (was Re: [PATCH V1 00/26] Live update: cpr-exec) Steven Sistare
2024-05-02 18:15 ` Peter Xu
2024-05-20 18:30 ` [PATCH V1 00/26] Live update: cpr-exec Steven Sistare
2024-05-20 22:28 ` Fabiano Rosas
2024-05-21 2:31 ` Peter Xu
2024-05-21 11:46 ` Steven Sistare
2024-05-27 17:45 ` Peter Xu
2024-05-28 15:10 ` Steven Sistare via
2024-05-28 16:42 ` Peter Xu [this message]
2024-05-30 17:17 ` Steven Sistare via
2024-05-30 19:23 ` Peter Xu
2024-05-24 13:02 ` Fabiano Rosas
2024-05-24 14:07 ` Steven Sistare
2024-05-27 18:07 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZlYJgHJwiVkGA7nG@x1n \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=steven.sistare@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).