qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
	Markus Armbruster <armbru@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	"Dr. David Alan Gilbert" <dave@treblig.org>,
	Cedric Le Goater <clg@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH V4 6/8] migration: cpr-exec mode
Date: Tue, 30 Sep 2025 14:29:37 -0400	[thread overview]
Message-ID: <71126cc7-934a-4a68-a3c8-a8aa4f113151@oracle.com> (raw)
In-Reply-To: <aNwfakXJCoptr1p8@x1.local>

On 9/30/2025 2:20 PM, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 01:18:34PM -0400, Steven Sistare wrote:
>> On 9/30/2025 12:39 PM, Peter Xu wrote:
>>> On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
>>>> Add the cpr-exec migration mode.  Usage:
>>>>     qemu-system-$arch -machine aux-ram-share=on ...
>>>>     migrate_set_parameter mode cpr-exec
>>>>     migrate_set_parameter cpr-exec-command \
>>>>       <arg1> <arg2> ... -incoming <uri-1> \
>>>>     migrate -d <uri-1>
>>>>
>>>> The migrate command stops the VM, saves state to uri-1,
>>>> directly exec's a new version of QEMU on the same host,
>>>> replacing the original process while retaining its PID, and
>>>> loads state from uri-1.  Guest RAM is preserved in place,
>>>> albeit with new virtual addresses.
>>>>
>>>> The new QEMU process is started by exec'ing the command
>>>> specified by the @cpr-exec-command parameter.  The first word of
>>>> the command is the binary, and the remaining words are its
>>>> arguments.  The command may be a direct invocation of new QEMU,
>>>> or may be a non-QEMU command that exec's the new QEMU binary.
>>>>
>>>> This mode creates a second migration channel that is not visible
>>>> to the user.  At the start of migration, old QEMU saves CPR state
>>>> to the second channel, and at the end of migration, it tells the
>>>> main loop to call cpr_exec.  New QEMU loads CPR state early, before
>>>> objects are created.
>>>>
>>>> Because old QEMU terminates when new QEMU starts, one cannot
>>>> stream data between the two, so uri-1 must be a type,
>>>> such as a file, that accepts all data before old QEMU exits.
>>>> Otherwise, old QEMU may quietly block writing to the channel.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported.  The VM must be started with
>>>> the '-machine aux-ram-share=on' option, which allows anonymous
>>>> memory to be transferred in place to the new process.  The memfds
>>>> are kept open across exec by clearing the close-on-exec flag, their
>>>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Acked-by: Markus Armbruster <armbru@redhat.com>
>>>> ---
>>>>    qapi/migration.json       | 25 +++++++++++++-
>>>>    include/migration/cpr.h   |  1 +
>>>>    migration/cpr-exec.c      | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>>>    migration/cpr.c           | 28 ++++++++++++++--
>>>>    migration/migration.c     | 10 +++++-
>>>>    migration/ram.c           |  1 +
>>>>    migration/vmstate-types.c |  8 +++++
>>>>    system/vl.c               |  4 ++-
>>>>    migration/trace-events    |  1 +
>>>>    9 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 2be8fa1..be0f3fc 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -694,9 +694,32 @@
>>>>    #     until you issue the `migrate-incoming` command.
>>>>    #
>>>>    #     (since 10.0)
>>>> +#
>>>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>>>> +#     migration channel, directly exec's a new version of QEMU on the
>>>> +#     same host, replacing the original process while retaining its
>>>> +#     PID, and loads state from the channel.  Guest RAM is preserved
>>>> +#     in place.  Devices and their pinned pages are also preserved for
>>>> +#     VFIO and IOMMUFD.
>>>> +#
>>>> +#     Old QEMU starts new QEMU by exec'ing the command specified by
>>>> +#     the @cpr-exec-command parameter.  The command may be a direct
>>>> +#     invocation of new QEMU, or may be a wrapper that exec's the new
>>>> +#     QEMU binary.
>>>> +#
>>>> +#     Because old QEMU terminates when new QEMU starts, one cannot
>>>> +#     stream data between the two, so the channel must be a type,
>>>> +#     such as a file, that accepts all data before old QEMU exits.
>>>> +#     Otherwise, old QEMU may quietly block writing to the channel.
>>>> +#
>>>> +#     Memory-backend objects must have the share=on attribute, but
>>>> +#     memory-backend-epc is not supported.  The VM must be started
>>>> +#     with the '-machine aux-ram-share=on' option.
>>>> +#
>>>> +#     (since 10.2)
>>>>    ##
>>>>    { 'enum': 'MigMode',
>>>> -  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>>> +  'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>>>    ##
>>>>    # @ZeroPageDetection:
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index b84389f..beed392 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>>    QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>>    QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> +void cpr_exec_init(void);
>>>>    QEMUFile *cpr_exec_output(Error **errp);
>>>>    QEMUFile *cpr_exec_input(Error **errp);
>>>>    void cpr_exec_persist_state(QEMUFile *f);
>>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>>> index 2c32e9c..8cf55a3 100644
>>>> --- a/migration/cpr-exec.c
>>>> +++ b/migration/cpr-exec.c
>>>> @@ -6,15 +6,21 @@
>>>>    #include "qemu/osdep.h"
>>>>    #include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>>    #include "qemu/memfd.h"
>>>>    #include "qapi/error.h"
>>>> +#include "qapi/type-helpers.h"
>>>>    #include "io/channel-file.h"
>>>>    #include "io/channel-socket.h"
>>>> +#include "block/block-global-state.h"
>>>> +#include "qemu/main-loop.h"
>>>>    #include "migration/cpr.h"
>>>>    #include "migration/qemu-file.h"
>>>> +#include "migration/migration.h"
>>>>    #include "migration/misc.h"
>>>>    #include "migration/vmstate.h"
>>>>    #include "system/runstate.h"
>>>> +#include "trace.h"
>>>>    #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>>>        lseek(mfd, 0, SEEK_SET);
>>>>        return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>>    }
>>>> +
>>>> +static bool preserve_fd(int fd)
>>>> +{
>>>> +    qemu_clear_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void cpr_exec_cb(void *opaque)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +    char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>>>> +    Error *err = NULL;
>>>> +
>>>> +    /*
>>>> +     * Clear the close-on-exec flag for all preserved fd's.  We cannot do so
>>>> +     * earlier because they should not persist across miscellaneous fork and
>>>> +     * exec calls that are performed during normal operation.
>>>> +     */
>>>> +    cpr_walk_fd(preserve_fd);
>>>> +
>>>> +    trace_cpr_exec();
>>>> +    execvp(argv[0], argv);
>>>> +
>>>> +    /*
>>>> +     * exec should only fail if argv[0] is bogus, or has a permissions problem,
>>>> +     * or the system is very short on resources.
>>>> +     */
>>>> +    g_strfreev(argv);
>>>> +    cpr_walk_fd(unpreserve_fd);
>>>> +
>>>> +    error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>>>> +    error_report_err(error_copy(err));
>>>> +    migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>>>
>>> I believe this is the only place we can have the state machine from
>>> COMPLETED->FAILED.  It's pretty hacky.  Maybe add a quick comment?
>>
>> OK.
>>>> +    migrate_set_error(s, err);
>>>> +
>>>> +    migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>>>> +
>>>> +    err = NULL;
>>>> +    if (!migration_block_activate(&err)) {
>>>> +        /* error was already reported */
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (runstate_is_live(s->vm_old_state)) {
>>>> +        vm_start();
>>>> +    }
>>>
>>> We have rollback logic in migration_iteration_finish().  Make a small
>>> helper and reuse the code?
>> Hmm.  There is some overlap, but also subtle differences.  For so littlecode, it does not feel worth any risk of regression (or worth the time to
>> test and verify all conditions).
> 
> We have a fix not yet landed but should likely land soon one way or
> another:
> 
> https://lore.kernel.org/all/20250915115918.3520735-2-jmarcin@redhat.com/
> 
> That should close one gap.
> 
> There's definitely reasons on sharing code, e.g. when we fix the path we
> fix all users, not one.  We also don't make mistake in one path but not in
> the other.  One solid example is, I feel like err is leaked above..
> 
> I'm fine if you prefer landing this first, but I'll still suggest a cleanup
> on top after above patch lands.

OK, let's do that.

>>>> +}
>>>> +
>>>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>>> +                             Error **errp)
>>>> +{
>>>> +    MigrationState *s = migrate_get_current();
>>>> +
>>>> +    if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>>> +        QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>>>> +        assert(s->state == MIGRATION_STATUS_COMPLETED);
>>>> +        qemu_bh_schedule(cpr_exec_bh);
>>>> +        qemu_notify_event();
>>>> +
>>>
>>> Newline can be dropped.
>> OK.
>>
>>>> +    } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>>>> +        cpr_exec_unpersist_state();
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +void cpr_exec_init(void)
>>>> +{
>>>> +    static NotifierWithReturn exec_notifier;
>>>> +
>>>> +    migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>>>> +                                MIG_MODE_CPR_EXEC);
>>>> +}
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index d3e370e..eea3773 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        if (mode == MIG_MODE_CPR_TRANSFER) {
>>>>            g_assert(channel);
>>>>            f = cpr_transfer_output(channel, errp);
>>>> +    } else if (mode == MIG_MODE_CPR_EXEC) {
>>>> +        f = cpr_exec_output(errp);
>>>>        } else {
>>>>            return 0;
>>>>        }
>>>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>            return ret;
>>>>        }
>>>> +    if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>>> +        cpr_exec_persist_state(f);
>>>> +    }
>>>> +
>>>>        /*
>>>>         * Close the socket only partially so we can later detect when the other
>>>>         * end closes by getting a HUP event.
>>>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>>        return 0;
>>>>    }
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> +    qemu_set_cloexec(fd);
>>>> +    return true;
>>>> +}
>>>
>>> Is this function defined twice?
>>
>> Yes, since it is tiny.  I judged that defining this small helper twice, near each
>> of its call sites, was better for the reader.
> 
> I still think we should avoid doing that.
> 
> Btw, I even think this helper should be removed on both places because
> they're almost only used for a cpr_walk_fd() context, so instead looks like
> we need cpr_unpreserve_fds(), which does:
> 
>          cpr_walk_fd(unpreserve_fd);
> 
> Then it can be defined in cpr.c once and export it in cpr.h.  Would that be
> better?

OK, I'll do that.

- Steve



  reply	other threads:[~2025-09-30 18:31 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
2025-09-22 15:18   ` Cédric Le Goater
2025-09-24 18:15     ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 2/8] migration: add cpr_walk_fd Steve Sistare
2025-09-22 13:49 ` [PATCH V4 3/8] oslib: qemu_clear_cloexec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 4/8] migration: cpr-exec-command parameter Steve Sistare
2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
2025-09-22 16:00   ` Cédric Le Goater
2025-09-24 18:16     ` Steven Sistare
2025-09-25  7:11       ` Cédric Le Goater
2025-09-25 20:38         ` Steven Sistare
2025-09-30 16:19         ` Peter Xu
2025-09-30 16:39           ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
2025-09-22 15:28   ` Cédric Le Goater
2025-09-24 18:16     ` Steven Sistare
2025-09-25  7:12       ` Cédric Le Goater
2025-09-30 16:39   ` Peter Xu
2025-09-30 17:18     ` Steven Sistare
2025-09-30 18:20       ` Peter Xu
2025-09-30 18:29         ` Steven Sistare [this message]
2025-09-22 13:49 ` [PATCH V4 7/8] migration: cpr-exec docs Steve Sistare
2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
2025-09-22 15:28   ` Cédric Le Goater
2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
2025-09-30 16:42   ` Peter Xu
2025-09-30 16:52     ` Steven Sistare
2025-09-30 19:49     ` Steven Sistare
2025-09-30 20:40       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71126cc7-934a-4a68-a3c8-a8aa4f113151@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=clg@redhat.com \
    --cc=dave@treblig.org \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).