From: Steven Sistare <steven.sistare@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>,
Markus Armbruster <armbru@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Eric Blake <eblake@redhat.com>,
"Dr. David Alan Gilbert" <dave@treblig.org>,
Cedric Le Goater <clg@redhat.com>,
Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH V4 6/8] migration: cpr-exec mode
Date: Tue, 30 Sep 2025 14:29:37 -0400 [thread overview]
Message-ID: <71126cc7-934a-4a68-a3c8-a8aa4f113151@oracle.com> (raw)
In-Reply-To: <aNwfakXJCoptr1p8@x1.local>
On 9/30/2025 2:20 PM, Peter Xu wrote:
> On Tue, Sep 30, 2025 at 01:18:34PM -0400, Steven Sistare wrote:
>> On 9/30/2025 12:39 PM, Peter Xu wrote:
>>> On Mon, Sep 22, 2025 at 06:49:43AM -0700, Steve Sistare wrote:
>>>> Add the cpr-exec migration mode. Usage:
>>>> qemu-system-$arch -machine aux-ram-share=on ...
>>>> migrate_set_parameter mode cpr-exec
>>>> migrate_set_parameter cpr-exec-command \
>>>> <arg1> <arg2> ... -incoming <uri-1> \
>>>> migrate -d <uri-1>
>>>>
>>>> The migrate command stops the VM, saves state to uri-1,
>>>> directly exec's a new version of QEMU on the same host,
>>>> replacing the original process while retaining its PID, and
>>>> loads state from uri-1. Guest RAM is preserved in place,
>>>> albeit with new virtual addresses.
>>>>
>>>> The new QEMU process is started by exec'ing the command
>>>> specified by the @cpr-exec-command parameter. The first word of
>>>> the command is the binary, and the remaining words are its
>>>> arguments. The command may be a direct invocation of new QEMU,
>>>> or may be a non-QEMU command that exec's the new QEMU binary.
>>>>
>>>> This mode creates a second migration channel that is not visible
>>>> to the user. At the start of migration, old QEMU saves CPR state
>>>> to the second channel, and at the end of migration, it tells the
>>>> main loop to call cpr_exec. New QEMU loads CPR state early, before
>>>> objects are created.
>>>>
>>>> Because old QEMU terminates when new QEMU starts, one cannot
>>>> stream data between the two, so uri-1 must be a type,
>>>> such as a file, that accepts all data before old QEMU exits.
>>>> Otherwise, old QEMU may quietly block writing to the channel.
>>>>
>>>> Memory-backend objects must have the share=on attribute, but
>>>> memory-backend-epc is not supported. The VM must be started with
>>>> the '-machine aux-ram-share=on' option, which allows anonymous
>>>> memory to be transferred in place to the new process. The memfds
>>>> are kept open across exec by clearing the close-on-exec flag, their
>>>> values are saved in CPR state, and they are mmap'd in new QEMU.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Acked-by: Markus Armbruster <armbru@redhat.com>
>>>> ---
>>>> qapi/migration.json | 25 +++++++++++++-
>>>> include/migration/cpr.h | 1 +
>>>> migration/cpr-exec.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>>>> migration/cpr.c | 28 ++++++++++++++--
>>>> migration/migration.c | 10 +++++-
>>>> migration/ram.c | 1 +
>>>> migration/vmstate-types.c | 8 +++++
>>>> system/vl.c | 4 ++-
>>>> migration/trace-events | 1 +
>>>> 9 files changed, 157 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 2be8fa1..be0f3fc 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -694,9 +694,32 @@
>>>> # until you issue the `migrate-incoming` command.
>>>> #
>>>> # (since 10.0)
>>>> +#
>>>> +# @cpr-exec: The migrate command stops the VM, saves state to the
>>>> +# migration channel, directly exec's a new version of QEMU on the
>>>> +# same host, replacing the original process while retaining its
>>>> +# PID, and loads state from the channel. Guest RAM is preserved
>>>> +# in place. Devices and their pinned pages are also preserved for
>>>> +# VFIO and IOMMUFD.
>>>> +#
>>>> +# Old QEMU starts new QEMU by exec'ing the command specified by
>>>> +# the @cpr-exec-command parameter. The command may be a direct
>>>> +# invocation of new QEMU, or may be a wrapper that exec's the new
>>>> +# QEMU binary.
>>>> +#
>>>> +# Because old QEMU terminates when new QEMU starts, one cannot
>>>> +# stream data between the two, so the channel must be a type,
>>>> +# such as a file, that accepts all data before old QEMU exits.
>>>> +# Otherwise, old QEMU may quietly block writing to the channel.
>>>> +#
>>>> +# Memory-backend objects must have the share=on attribute, but
>>>> +# memory-backend-epc is not supported. The VM must be started
>>>> +# with the '-machine aux-ram-share=on' option.
>>>> +#
>>>> +# (since 10.2)
>>>> ##
>>>> { 'enum': 'MigMode',
>>>> - 'data': [ 'normal', 'cpr-reboot', 'cpr-transfer' ] }
>>>> + 'data': [ 'normal', 'cpr-reboot', 'cpr-transfer', 'cpr-exec' ] }
>>>> ##
>>>> # @ZeroPageDetection:
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index b84389f..beed392 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -53,6 +53,7 @@ int cpr_get_fd_param(const char *name, const char *fdname, int index,
>>>> QEMUFile *cpr_transfer_output(MigrationChannel *channel, Error **errp);
>>>> QEMUFile *cpr_transfer_input(MigrationChannel *channel, Error **errp);
>>>> +void cpr_exec_init(void);
>>>> QEMUFile *cpr_exec_output(Error **errp);
>>>> QEMUFile *cpr_exec_input(Error **errp);
>>>> void cpr_exec_persist_state(QEMUFile *f);
>>>> diff --git a/migration/cpr-exec.c b/migration/cpr-exec.c
>>>> index 2c32e9c..8cf55a3 100644
>>>> --- a/migration/cpr-exec.c
>>>> +++ b/migration/cpr-exec.c
>>>> @@ -6,15 +6,21 @@
>>>> #include "qemu/osdep.h"
>>>> #include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>> #include "qemu/memfd.h"
>>>> #include "qapi/error.h"
>>>> +#include "qapi/type-helpers.h"
>>>> #include "io/channel-file.h"
>>>> #include "io/channel-socket.h"
>>>> +#include "block/block-global-state.h"
>>>> +#include "qemu/main-loop.h"
>>>> #include "migration/cpr.h"
>>>> #include "migration/qemu-file.h"
>>>> +#include "migration/migration.h"
>>>> #include "migration/misc.h"
>>>> #include "migration/vmstate.h"
>>>> #include "system/runstate.h"
>>>> +#include "trace.h"
>>>> #define CPR_EXEC_STATE_NAME "QEMU_CPR_EXEC_STATE"
>>>> @@ -92,3 +98,81 @@ QEMUFile *cpr_exec_input(Error **errp)
>>>> lseek(mfd, 0, SEEK_SET);
>>>> return qemu_file_new_fd_input(mfd, CPR_EXEC_STATE_NAME);
>>>> }
>>>> +
>>>> +static bool preserve_fd(int fd)
>>>> +{
>>>> + qemu_clear_cloexec(fd);
>>>> + return true;
>>>> +}
>>>> +
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> + qemu_set_cloexec(fd);
>>>> + return true;
>>>> +}
>>>> +
>>>> +static void cpr_exec_cb(void *opaque)
>>>> +{
>>>> + MigrationState *s = migrate_get_current();
>>>> + char **argv = strv_from_str_list(s->parameters.cpr_exec_command);
>>>> + Error *err = NULL;
>>>> +
>>>> + /*
>>>> + * Clear the close-on-exec flag for all preserved fd's. We cannot do so
>>>> + * earlier because they should not persist across miscellaneous fork and
>>>> + * exec calls that are performed during normal operation.
>>>> + */
>>>> + cpr_walk_fd(preserve_fd);
>>>> +
>>>> + trace_cpr_exec();
>>>> + execvp(argv[0], argv);
>>>> +
>>>> + /*
>>>> + * exec should only fail if argv[0] is bogus, or has a permissions problem,
>>>> + * or the system is very short on resources.
>>>> + */
>>>> + g_strfreev(argv);
>>>> + cpr_walk_fd(unpreserve_fd);
>>>> +
>>>> + error_setg_errno(&err, errno, "execvp %s failed", argv[0]);
>>>> + error_report_err(error_copy(err));
>>>> + migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED);
>>>
>>> I believe this is the only place we can have the state machine from
>>> COMPLETED->FAILED. It's pretty hacky. Maybe add a quick comment?
>>
>> OK.
>>>> + migrate_set_error(s, err);
>>>> +
>>>> + migration_call_notifiers(s, MIG_EVENT_PRECOPY_FAILED, NULL);
>>>> +
>>>> + err = NULL;
>>>> + if (!migration_block_activate(&err)) {
>>>> + /* error was already reported */
>>>> + return;
>>>> + }
>>>> +
>>>> + if (runstate_is_live(s->vm_old_state)) {
>>>> + vm_start();
>>>> + }
>>>
>>> We have rollback logic in migration_iteration_finish(). Make a small
>>> helper and reuse the code?
>> Hmm. There is some overlap, but also subtle differences. For so littlecode, it does not feel worth any risk of regression (or worth the time to
>> test and verify all conditions).
>
> We have a fix not yet landed but should likely land soon one way or
> another:
>
> https://lore.kernel.org/all/20250915115918.3520735-2-jmarcin@redhat.com/
>
> That should close one gap.
>
> There's definitely reasons on sharing code, e.g. when we fix the path we
> fix all users, not one. We also don't make mistake in one path but not in
> the other. One solid example is, I feel like err is leaked above..
>
> I'm fine if you prefer landing this first, but I'll still suggest a cleanup
> on top after above patch lands.
OK, let's do that.
>>>> +}
>>>> +
>>>> +static int cpr_exec_notifier(NotifierWithReturn *notifier, MigrationEvent *e,
>>>> + Error **errp)
>>>> +{
>>>> + MigrationState *s = migrate_get_current();
>>>> +
>>>> + if (e->type == MIG_EVENT_PRECOPY_DONE) {
>>>> + QEMUBH *cpr_exec_bh = qemu_bh_new(cpr_exec_cb, NULL);
>>>> + assert(s->state == MIGRATION_STATUS_COMPLETED);
>>>> + qemu_bh_schedule(cpr_exec_bh);
>>>> + qemu_notify_event();
>>>> +
>>>
>>> Newline can be dropped.
>> OK.
>>
>>>> + } else if (e->type == MIG_EVENT_PRECOPY_FAILED) {
>>>> + cpr_exec_unpersist_state();
>>>> + }
>>>> + return 0;
>>>> +}
>>>> +
>>>> +void cpr_exec_init(void)
>>>> +{
>>>> + static NotifierWithReturn exec_notifier;
>>>> +
>>>> + migration_add_notifier_mode(&exec_notifier, cpr_exec_notifier,
>>>> + MIG_MODE_CPR_EXEC);
>>>> +}
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index d3e370e..eea3773 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -185,6 +185,8 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>> if (mode == MIG_MODE_CPR_TRANSFER) {
>>>> g_assert(channel);
>>>> f = cpr_transfer_output(channel, errp);
>>>> + } else if (mode == MIG_MODE_CPR_EXEC) {
>>>> + f = cpr_exec_output(errp);
>>>> } else {
>>>> return 0;
>>>> }
>>>> @@ -202,6 +204,10 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>> return ret;
>>>> }
>>>> + if (migrate_mode() == MIG_MODE_CPR_EXEC) {
>>>> + cpr_exec_persist_state(f);
>>>> + }
>>>> +
>>>> /*
>>>> * Close the socket only partially so we can later detect when the other
>>>> * end closes by getting a HUP event.
>>>> @@ -213,6 +219,12 @@ int cpr_state_save(MigrationChannel *channel, Error **errp)
>>>> return 0;
>>>> }
>>>> +static bool unpreserve_fd(int fd)
>>>> +{
>>>> + qemu_set_cloexec(fd);
>>>> + return true;
>>>> +}
>>>
>>> Is this function defined twice?
>>
>> Yes, since it is tiny. I judged that defining this small helper twice, near each
>> of its call sites, was better for the reader.
>
> I still think we should avoid doing that.
>
> Btw, I even think this helper should be removed on both places because
> they're almost only used for a cpr_walk_fd() context, so instead looks like
> we need cpr_unpreserve_fds(), which does:
>
> cpr_walk_fd(unpreserve_fd);
>
> Then it can be defined in cpr.c once and export it in cpr.h. Would that be
> better?
OK, I'll do that.
- Steve
next prev parent reply other threads:[~2025-09-30 18:31 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-22 13:49 [PATCH V4 0/8] Live update: cpr-exec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 1/8] migration: multi-mode notifier Steve Sistare
2025-09-22 15:18 ` Cédric Le Goater
2025-09-24 18:15 ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 2/8] migration: add cpr_walk_fd Steve Sistare
2025-09-22 13:49 ` [PATCH V4 3/8] oslib: qemu_clear_cloexec Steve Sistare
2025-09-22 13:49 ` [PATCH V4 4/8] migration: cpr-exec-command parameter Steve Sistare
2025-09-22 13:49 ` [PATCH V4 5/8] migration: cpr-exec save and load Steve Sistare
2025-09-22 16:00 ` Cédric Le Goater
2025-09-24 18:16 ` Steven Sistare
2025-09-25 7:11 ` Cédric Le Goater
2025-09-25 20:38 ` Steven Sistare
2025-09-30 16:19 ` Peter Xu
2025-09-30 16:39 ` Steven Sistare
2025-09-22 13:49 ` [PATCH V4 6/8] migration: cpr-exec mode Steve Sistare
2025-09-22 15:28 ` Cédric Le Goater
2025-09-24 18:16 ` Steven Sistare
2025-09-25 7:12 ` Cédric Le Goater
2025-09-30 16:39 ` Peter Xu
2025-09-30 17:18 ` Steven Sistare
2025-09-30 18:20 ` Peter Xu
2025-09-30 18:29 ` Steven Sistare [this message]
2025-09-22 13:49 ` [PATCH V4 7/8] migration: cpr-exec docs Steve Sistare
2025-09-22 13:49 ` [PATCH V4 8/8] vfio: cpr-exec mode Steve Sistare
2025-09-22 15:28 ` Cédric Le Goater
2025-09-30 15:28 ` [PATCH V4 0/8] Live update: cpr-exec Steven Sistare
2025-09-30 16:42 ` Peter Xu
2025-09-30 16:52 ` Steven Sistare
2025-09-30 19:49 ` Steven Sistare
2025-09-30 20:40 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=71126cc7-934a-4a68-a3c8-a8aa4f113151@oracle.com \
--to=steven.sistare@oracle.com \
--cc=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=clg@redhat.com \
--cc=dave@treblig.org \
--cc=eblake@redhat.com \
--cc=farosas@suse.de \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).