Re: [PATCH V3 7/9] migration: cpr-exec mode

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Steven Sistare <steven.sistare@oracle.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
	Markus Armbruster <armbru@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	"Dr. David Alan Gilbert" <dave@treblig.org>
Subject: Re: [PATCH V3 7/9] migration: cpr-exec mode
Date: Tue, 9 Sep 2025 15:27:46 -0400	[thread overview]
Message-ID: <aMB_stJSkPgHzug0@x1.local> (raw)
In-Reply-To: <60f44830-d306-4dec-8f0d-65d3b32b3a2e@oracle.com>

On Tue, Sep 09, 2025 at 02:10:14PM -0400, Steven Sistare wrote:
> On 9/9/2025 12:32 PM, Peter Xu wrote:
> > On Thu, Aug 14, 2025 at 10:17:21AM -0700, Steve Sistare wrote:
> > > Add the cpr-exec migration mode.  Usage:
> > >    qemu-system-$arch -machine aux-ram-share=on ...
> > >    migrate_set_parameter mode cpr-exec
> > >    migrate_set_parameter cpr-exec-command \
> > >      <arg1> <arg2> ... -incoming <uri-1> \
> > >    migrate -d <uri-1>
> > > 
> > > The migrate command stops the VM, saves state to uri-1,
> > > directly exec's a new version of QEMU on the same host,
> > > replacing the original process while retaining its PID, and
> > > loads state from uri-1.  Guest RAM is preserved in place,
> > > albeit with new virtual addresses.
> > > 
> > > The new QEMU process is started by exec'ing the command
> > > specified by the @cpr-exec-command parameter.  The first word of
> > > the command is the binary, and the remaining words are its
> > > arguments.  The command may be a direct invocation of new QEMU,
> > > or may be a non-QEMU command that exec's the new QEMU binary.
> > > 
> > > This mode creates a second migration channel that is not visible
> > > to the user.  At the start of migration, old QEMU saves CPR state
> > > to the second channel, and at the end of migration, it tells the
> > > main loop to call cpr_exec.  New QEMU loads CPR state early, before
> > > objects are created.
> > > 
> > > Because old QEMU terminates when new QEMU starts, one cannot
> > > stream data between the two, so uri-1 must be a type,
> > > such as a file, that accepts all data before old QEMU exits.
> > > Otherwise, old QEMU may quietly block writing to the channel.
> > > 
> > > Memory-backend objects must have the share=on attribute, but
> > > memory-backend-epc is not supported.  The VM must be started with
> > > the '-machine aux-ram-share=on' option, which allows anonymous
> > > memory to be transferred in place to the new process.  The memfds
> > > are kept open across exec by clearing the close-on-exec flag, their
> > > values are saved in CPR state, and they are mmap'd in new QEMU.
> > 
> > Some generic questions around exec..
> > 
> > How do we know we can already safely kill all threads?
> > 
> > IIUC vcpu threads must be all stopped.  I wonder if we want to assert that
> > in the exec helper below.
> > 
> > What about rest threads?  RCU threads should be for freeing resources,
> > looks ok if to be ignored.  But others?
> 
> These threads are dormant, just as they are in the post migration state.
> There is no difference.  They can be safely killed, just as they can be
> post migration.
> 
> > Or would process states still matter in some cases? e.g. when QEMU is
> > talking to another vhost-user, or vfio-user, or virtio-fs, or ... whatever
> > other process, then suddenly the other process doesn't recognize this QEMU
> > anymore?
> 
> These cases need more development to work with cpr.  The external process
> can be used by new qemu if the socket connection (fd) is preserved in new QEMU.
> 
> > What about file locks or similiar shared locks that can be running in an
> > iothread?  Is it possible that old QEMU took some shared locks, suddenly
> > qemu exec(), then the lock is never released?
> 
> Same as the post-migrate state.

IIUC the difference is "migrate" for cpr-transfer triggers migration only;
another "quit" required to gracefully stop the src QEMU instance from mgmt.
But for cpr-exec, it's attached to migration cleanup -> exec in a roll.

I'm not sure if things can be missing within the period.  For example,
libvirt may have logic making sure "quit" runs only after dest QEMU evicts
some event.  But I confess I don't have an explicit example of what would
cause issues, so it's a pure question.

> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > ---
> > >   qapi/migration.json       | 25 +++++++++++++++-
> > >   include/migration/cpr.h   |  1 +
> > >   migration/cpr-exec.c      | 74 +++++++++++++++++++++++++++++++++++++++++++++++
> > >   migration/cpr.c           | 26 ++++++++++++++++-
> > >   migration/migration.c     | 10 ++++++-
> > >   migration/ram.c           |  1 +
> > >   migration/vmstate-types.c |  8 +++++
> > >   migration/trace-events    |  1 +
> > >   8 files changed, 143 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index ea410fd..cbc90e8 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -694,9 +694,32 @@
> > >   #     until you issue the `migrate-incoming` command.
> > >   #
> > >   #     (since 10.0)
> > > +#
> > > +# @cpr-exec: The migrate command stops the VM, saves state to the
> > > +#     migration channel, directly exec's a new version of QEMU on the
> > > +#     same host, replacing the original process while retaining its
> > > +#     PID, and loads state from the channel.  Guest RAM is preserved
> > > +#     in place.  Devices and their pinned pages are also preserved for
> > > +#     VFIO and IOMMUFD.
> > > +#
> > > +#     Old QEMU starts new QEMU by exec'ing the command specified by
> > > +#     the @cpr-exec-command parameter.  The command may be a direct
> > > +#     invocation of new QEMU, or may be a non-QEMU command that exec's
> > > +#     the new QEMU binary.
> > > +#
> > > +#     Because old QEMU terminates when new QEMU starts, one cannot
> > > +#     stream data between the two, so the channel must be a type,
> > > +#     such as a file, that accepts all data before old QEMU exits.
> > > +#     Otherwise, old QEMU may quietly block writing to the channel.
> > 
> > The CPR channel (in case of exec mode) is persisted via env var.  Why not
> > do that too for the main migration stream?
> > 
> > Does it has something to do with the size of the binary chunk to store all
> > device states (and some private mem)?  Or other concerns?
> 
> It was not necessary to add code for a new way to move migration data for
> the main stream when the existing code and interface works just fine.  One
> of the design principles pushed on me was to make cpr look as much like live
> migration as possible, and cpr-exec does that.  It has no issues juggling
> 2 streams, and no delayed start of the monitor. cpr-transfer is actually the
> oddball.
>  > It just feels like it would look cleaner for cpr-exec to not need -incoming
> > XXX at all, e.g. if the series already used envvar anyway, we can use that
> > too so new QEMU would know it's cpr-exec incoming migration, without
> > -incoming parameter at all.
> > >> +#
> > > +#     Memory-backend objects must have the share=on attribute, but
> > > +#     memory-backend-epc is not supported.  The VM must be started
> > > +#     with the '-machine aux-ram-share=on' option.
> > > +#
> > > +#     (since 10.2)
> > >   ##
> > >   { 'enum': 'MigMode',
> > > -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
> > > +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
> > >   ##
> > >   # @ZeroPageDetection:
> > > diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> > > index aaeec02..e99e48e 100644
> > > --- a/include/migration/cpr.h
> > > +++ b/include/migration/cpr.h
> > > @@ -54,6 +54,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index, bool cpr,
> > >   QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
> > >   QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
> > > +void cpr_exec_init(void);
> > >   QEMUFile *cpr_exec_output(Error **errp);
> > >   QEMUFile *cpr_exec_input(Error **errp);
> > >   void cpr_exec_persist_state(QEMUFile *f);
> > > diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
> > > index 2c32e9c..7d0429f 100644
> > > --- a/migration/cpr-exec.c
> > > +++ b/migration/cpr-exec.c
> > > @@ -6,15 +6,20 @@
> > >   #include "qemu/osdep.h"
> > >   #include "qemu/cutils.h"
> > > +#include "qemu/error-report.h"
> > >   #include "qemu/memfd.h"
> > >   #include "qapi/error.h"
> > >   #include "io/channel-file.h"
> > >   #include "io/channel-socket.h"
> > > +#include "block/block-global-state.h"
> > > +#include "qemu/main-loop.h"
> > >   #include "migration/cpr.h"
> > >   #include "migration/qemu-file.h"
> > > +#include "migration/migration.h"
> > >   #include "migration/misc.h"
> > >   #include "migration/vmstate.h"
> > >   #include "system/runstate.h"
> > > +#include "trace.h"
> > >   #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
> > > @@ -92,3 +97,72 @@ QEMUFile *cpr_exec_input(Error **errp)
> > >       lseek(mfd, 0, SEEK_SET);
> > >       return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
> > >   }
> > > +
> > > +static bool preserve_fd(int fd)
> > > +{
> > > +    qemu_clear_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > > +static void cpr_exec(char **argv)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +    Error *err = NULL;
> > > +
> > > +    /*
> > > +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
> > > +     * earlier because they should not persist across miscellaneous fork and
> > > +     * exec calls that are performed during normal operation.
> > > +     */
> > > +    cpr_walk_fd(preserve_fd);
> > > +
> > > +    trace_cpr_exec();
> > > +    execvp(argv[0], argv);
> > > +
> > > +    cpr_walk_fd(unpreserve_fd);
> > > +
> > > +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
> > > +    error_report_err(error_copy(err));
> > 
> > Feel free to ignore my question in the other patch, so we dump some errors
> > here.. which makes sense.
> > 
> > > +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
> > 
> > This is indeed FAILED migration, however it seems to imply it can catch
> > whatever possible failures that incoming could have.  Strictly speaking
> > this is not migration failure, but exec failure..  Maybe we need a comment
> > above this one explaining that we won't be able to capture any migration
> > issues, it's too late after exec() succeeded, so there's higher risk of
> > crashing the VM.
> 
> exec() can fail if the user provided a bogus cpr-exec-command, in which case
> recovery is possible.  exec() should never fail for valid exec arguments,
> unless the system is very sick and running out of resources, in which case
> all bets are off.

I really don't expect that to fail... bogus cpr-exec-command is more or
less a programming bug.  After all, I don't expect normal QEMU users would
use cpr-exec without a proper mgmt providing cpr-exec-command.

Adding some comment here on what the FAILED can capture (and what cannot)?

> 
> > Luckily we still are on the same host, so things like mismatched kernel
> > versions at least won't crash this migration.. aka not as easy to fail a
> > migration as cross- hosts indeed. But still, I'd say I agree with Vladimir
> > that this is a major flaw of the design if so.
> > 
> > > +    migrate_set_error(s, err);
> > > +
> > > +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
> > > +
> > > +    err = NULL;
> > > +    if (!migration_block_activate(&err)) {
> > > +        /* error was already reported */
> > > +        return;
> > > +    }
> > > +
> > > +    if (runstate_is_live(s->vm_old_state)) {
> > > +        vm_start();
> > > +    }
> > > +}
> > > +
> > > +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
> > > +                             Error **errp)
> > > +{
> > > +    MigrationState *s = migrate_get_current();
> > > +
> > > +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
> > > +        assert(s->state == MIGRATION_STATUS_COMPLETED);
> > > +        qemu_system_exec_request(cpr_exec, s->parameters.cpr_exec_command);
> > > +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
> > > +        cpr_exec_unpersist_state();
> > > +    }
> > > +    return 0;
> > > +}
> > > +
> > > +void cpr_exec_init(void)
> > > +{
> > > +    static NotifierWithReturn exec_notifier;
> > > +
> > > +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
> > > +                                MIG_MODE_CPR_EXEC);
> > 
> > Why using a notifier?  IMHO exec() is something important enough to not be
> > hiding in a notifier..  and CPR is already a major part of migration in the
> > framework, IMHO it'll be cleaner to invoke any CPR request in the migration
> > subsystem.  AFAIU notifiers are normally only for outside migration/ purposes.
> 
> This minimizes the number of control flow conditionals in the core migration code.
> That's a good thing, and I thought you would like it.
> 
> The alternative is to add code right after notifiers are called to check the
> mode, and call cpr_exec_notifier.  Seems silly when we have this generic
> mechanism to define callouts to occur at well-defined points during execution.
> 
> Note that cpr_exec_notifier does not directly call exec.  It posts the exec
> request.  It also recovers if cpr failed.

OK, I don't think I feel strongly on this one.

Initially I was concerned at least on some of the notifiers not invoked,
which looks to be completely random.  But I kind of agree you chose the
spot late enough so whatever should really have been done before an exec(),
should hopefully be processed already, maybe while we do or around
vm_stop() phase.

Feel free to keep it then if nobody else asks.

> 
> > > +}
> > > diff --git a/migration/cpr.c b/migration/cpr.c
> > > index 021bd6a..2078d05 100644
> > > --- a/migration/cpr.c
> > > +++ b/migration/cpr.c
> > > @@ -198,6 +198,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       if (mode == MIG_MODE_CPR_TRANSFER) {
> > >           g_assert(channel);
> > >           f = cpr_transfer_output(channel, errp);
> > > +    } else if (mode == MIG_MODE_CPR_EXEC) {
> > > +        f = cpr_exec_output(errp);
> > >       } else {
> > >           return 0;
> > >       }
> > > @@ -215,6 +217,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >           return ret;
> > >       }
> > > +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
> > > +        cpr_exec_persist_state(f);
> > > +    }
> > > +
> > >       /*
> > >        * Close the socket only partially so we can later detect when the other
> > >        * end closes by getting a HUP event.
> > > @@ -226,6 +232,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
> > >       return 0;
> > >   }
> > > +static bool unpreserve_fd(int fd)
> > > +{
> > > +    qemu_set_cloexec(fd);
> > > +    return true;
> > > +}
> > > +
> > >   int cpr_state_load(MigrationChannel *channel, Error **errp)
> > >   {
> > >       int ret;
> > > @@ -237,6 +249,12 @@ int cpr_state_load(MigrationChannel *channel, Error **errp)
> > >           mode = MIG_MODE_CPR_TRANSFER;
> > >           cpr_set_incoming_mode(mode);
> > >           f = cpr_transfer_input(channel, errp);
> > > +    } else if (cpr_exec_has_state()) {
> > > +        mode = MIG_MODE_CPR_EXEC;
> > > +        f = cpr_exec_input(errp);
> > > +        if (channel) {
> > > +            warn_report("ignoring cpr channel for migration mode cpr-exec");
> > 
> > This looks like dead code?  channel can't be set when reaching here, AFAIU..
> 
> The user could define a cpr channel in qemu command line arguments, and it would
> reach here.  In that case the user is confused, but I warn instead of abort, to
> keep new QEMU alive.  I perform this sanity check here, rather than at top level,
> because I have localized awareness of cpr_exec state to here.

The code (after this patch applied) looks like this:

    if (channel) {                                            <------- [*]
        mode = MIG_MODE_CPR_TRANSFER;
        cpr_set_incoming_mode(mode);
        f = cpr_transfer_input(channel, errp);
    } else if (cpr_exec_has_state()) {
        mode = MIG_MODE_CPR_EXEC;
        f = cpr_exec_input(errp);
        if (channel) {
            warn_report("ignoring cpr channel for migration mode cpr-exec");
        }
    } else {
        return 0;
    }

IIUC [*] will capture any channel!=NULL case.

-- 
Peter Xu

next prev parent reply	other threads:[~2025-09-09 19:28 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-14 17:17 [PATCH V3 0/9] Live update: cpr-exec Steve Sistare
2025-08-14 17:17 ` [PATCH V3 1/9] migration: multi-mode notifier Steve Sistare
2025-08-19 13:09   ` Fabiano Rosas
2025-09-09 15:43   ` Peter Xu
2025-09-09 16:40     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 2/9] migration: add cpr_walk_fd Steve Sistare
2025-09-09 15:45   ` Peter Xu
2025-08-14 17:17 ` [PATCH V3 3/9] oslib: qemu_clear_cloexec Steve Sistare
2025-08-14 17:17 ` [PATCH V3 4/9] vl: helper to request exec Steve Sistare
2025-09-09 15:51   ` Peter Xu
2025-09-12 14:49     ` Steven Sistare
2025-09-15 16:35       ` Peter Xu
2025-09-19 15:27         ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 5/9] migration: cpr-exec-command parameter Steve Sistare
2025-09-08 16:07   ` Daniel P. Berrangé
2025-09-09 15:22     ` Steven Sistare
2025-09-11 15:10   ` Markus Armbruster
2025-09-12 14:48     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 6/9] migration: cpr-exec save and load Steve Sistare
2025-09-19 15:35   ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 7/9] migration: cpr-exec mode Steve Sistare
2025-09-09 16:32   ` Peter Xu
2025-09-09 18:10     ` Steven Sistare
2025-09-09 19:27       ` Peter Xu [this message]
2025-09-12 14:49         ` Steven Sistare
2025-09-11 15:09   ` Markus Armbruster
2025-09-12 14:49     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 8/9] migration: cpr-exec docs Steve Sistare
2025-09-15 20:36   ` Fabiano Rosas
2025-09-19 15:28     ` Steven Sistare
2025-08-14 17:17 ` [PATCH V3 9/9] vfio: cpr-exec mode Steve Sistare
2025-08-14 17:20   ` Steven Sistare
2025-09-19 15:35     ` Steven Sistare
2025-09-19 16:30       ` Cédric Le Goater
2025-09-05 16:48 ` [PATCH V3 0/9] Live update: cpr-exec Peter Xu
2025-09-05 17:09   ` Dr. David Alan Gilbert
2025-09-05 17:48     ` Peter Xu
2025-09-09 14:36   ` Steven Sistare
2025-09-09 15:24     ` Peter Xu
2025-09-09 16:03       ` Steven Sistare
2025-09-09 18:37         ` Peter Xu
2025-09-12 14:50           ` Steven Sistare
2025-09-12 15:44             ` Peter Xu
2025-09-19 17:16               ` Steven Sistare
2025-09-23 14:37                 ` Vladimir Sementsov-Ogievskiy
2025-09-09 16:41       ` Vladimir Sementsov-Ogievskiy
2025-09-08 17:02 ` Vladimir Sementsov-Ogievskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aMB_stJSkPgHzug0@x1.local \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=dave@treblig.org \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.