* [PATCH 0/4] migration: Pass network packets received during switchover to dest VM
@ 2026-01-27 14:03 Juraj Marcin
2026-01-27 14:03 ` [PATCH 1/4] migration/qemu-file: Add ability to clear error Juraj Marcin
` (4 more replies)
0 siblings, 5 replies; 23+ messages in thread
From: Juraj Marcin @ 2026-01-27 14:03 UTC (permalink / raw)
To: qemu-devel
Cc: Juraj Marcin, Fabiano Rosas, Michael S. Tsirkin, Peter Xu,
Jason Wang, Vladimir Sementsov-Ogievskiy
During switchover there is a period during which both source and
destination side VMs are paused. During this period, all network packets
are still routed to the source side, but it will never process them.
Once the destination resumes, it is not aware of these packets and they
are lost. This can cause packet loss in unreliable protocols and
extended delays due to retransmission in reliable protocols.
This series resolves this problem by caching packets received once the
source VM pauses and then passing and injecting them on the destination
side. This feature is implemented in the last patch. The caching and
injecting is implemented using network filter interface and should work
with any backend with vhost=off, but only TAP network backend was
explicitly tested.
This series also introduces an RP_VM_STARTED message on the return-path
channel, which is used to correctly calculate downtime for both precopy
and postcopy, and also as a trigger for netpass to forward packets to
the destination. With more data sent through the migration channel after
the destination VM starts, using RP_SHUT wouldn't be accurate anymore,
and in postcopy the downtime calculation was always incorrect.
As netpass requires return-path capability, its capability is also off
by default, but I am open for discussion about making it on by default,
as long as return-path is enabled (i.e. enabling return-path would also
enable netpass unless it is explicitly disabled).
Juraj Marcin (4):
migration/qemu-file: Add ability to clear error
migration: Introduce VM_STARTED return-path message
migration: Convert VMSD early_setup into VMStateSavePhase enum
migration: Pass network packets received during switchover to dest VM
hw/core/machine.c | 4 +-
hw/virtio/virtio-mem.c | 2 +-
include/migration/vmstate.h | 33 +++--
include/net/net.h | 5 +
migration/meson.build | 1 +
migration/migration.c | 83 +++++++++++-
migration/migration.h | 11 ++
migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++
migration/netpass.h | 14 ++
migration/options.c | 29 +++++
migration/options.h | 2 +
migration/qemu-file.c | 6 +
migration/qemu-file.h | 1 +
migration/savevm.c | 44 ++++++-
migration/savevm.h | 2 +
migration/trace-events | 9 ++
net/net.c | 11 ++
net/tap.c | 11 +-
qapi/migration.json | 7 +-
19 files changed, 501 insertions(+), 20 deletions(-)
create mode 100644 migration/netpass.c
create mode 100644 migration/netpass.h
--
2.52.0
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH 1/4] migration/qemu-file: Add ability to clear error 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin @ 2026-01-27 14:03 ` Juraj Marcin 2026-01-27 14:03 ` [PATCH 2/4] migration: Introduce VM_STARTED return-path message Juraj Marcin ` (3 subsequent siblings) 4 siblings, 0 replies; 23+ messages in thread From: Juraj Marcin @ 2026-01-27 14:03 UTC (permalink / raw) To: qemu-devel Cc: Juraj Marcin, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy From: Juraj Marcin <jmarcin@redhat.com> Signed-off-by: Juraj Marcin <jmarcin@redhat.com> --- migration/qemu-file.c | 6 ++++++ migration/qemu-file.h | 1 + 2 files changed, 7 insertions(+) diff --git a/migration/qemu-file.c b/migration/qemu-file.c index 9cf7dc3bd5..bdf6c73d3d 100644 --- a/migration/qemu-file.c +++ b/migration/qemu-file.c @@ -227,6 +227,12 @@ void qemu_file_set_error(QEMUFile *f, int ret) qemu_file_set_error_obj(f, ret, NULL); } +void qemu_file_clear_error(QEMUFile *f) +{ + f->last_error = 0; + error_free(f->last_error_obj); +} + static bool qemu_file_is_writable(QEMUFile *f) { return f->is_writable; diff --git a/migration/qemu-file.h b/migration/qemu-file.h index a8e9bb2ccb..aa24196ffb 100644 --- a/migration/qemu-file.h +++ b/migration/qemu-file.h @@ -68,6 +68,7 @@ int qemu_file_get_error_obj_any(QEMUFile *f1, QEMUFile *f2, Error **errp); void qemu_file_set_error_obj(QEMUFile *f, int ret, Error *err); int qemu_file_get_error_obj(QEMUFile *f, Error **errp); void qemu_file_set_error(QEMUFile *f, int ret); +void qemu_file_clear_error(QEMUFile *f); int qemu_file_shutdown(QEMUFile *f); QEMUFile *qemu_file_get_return_path(QEMUFile *f); int qemu_fflush(QEMUFile *f); -- 2.52.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/4] migration: Introduce VM_STARTED return-path message 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 14:03 ` [PATCH 1/4] migration/qemu-file: Add ability to clear error Juraj Marcin @ 2026-01-27 14:03 ` Juraj Marcin 2026-01-27 22:29 ` Michael S. Tsirkin 2026-01-27 14:03 ` [PATCH 3/4] migration: Convert VMSD early_setup into VMStateSavePhase enum Juraj Marcin ` (2 subsequent siblings) 4 siblings, 1 reply; 23+ messages in thread From: Juraj Marcin @ 2026-01-27 14:03 UTC (permalink / raw) To: qemu-devel Cc: Juraj Marcin, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy From: Juraj Marcin <jmarcin@redhat.com> Currently there is no universal way for the destination to tell the source it has started. In precopy it could be deduced from the RP_SHUT message and in postcopy from the response to the ping just before the POSTCOPY_RUN command, but neither method is precise. Moreover, there is no way to send more data after the destination has started with precopy migration. This patch adds new message type to the return-path which tells the source that the destination VM has just started (or can be started if autostart is false). Source VM can use this message to precisely calculate the downtime regardless of if postcopy is used and can also send more data, for example network packets. Signed-off-by: Juraj Marcin <jmarcin@redhat.com> --- hw/core/machine.c | 4 +++- migration/migration.c | 34 ++++++++++++++++++++++++++++++---- migration/migration.h | 9 +++++++++ migration/options.c | 8 ++++++++ migration/options.h | 1 + migration/savevm.c | 3 +++ 6 files changed, 54 insertions(+), 5 deletions(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index 6411e68856..dc73217a5f 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -38,7 +38,9 @@ #include "hw/acpi/generic_event_device.h" #include "qemu/audio.h" -GlobalProperty hw_compat_10_2[] = {}; +GlobalProperty hw_compat_10_2[] = { + { "migration", "send-vm-started", "off" }, +}; const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2); GlobalProperty hw_compat_10_1[] = { diff --git a/migration/migration.c b/migration/migration.c index b103a82fc0..4871db2365 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -82,6 +82,7 @@ enum mig_rp_message_type { MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */ MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */ MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */ + MIG_RP_MSG_VM_STARTED, /* tell source destination has started */ MIG_RP_MSG_MAX }; @@ -750,6 +751,10 @@ static void process_incoming_migration_bh(void *opaque) runstate_set(global_state_get_runstate()); } trace_vmstate_downtime_checkpoint("dst-precopy-bh-vm-started"); + if (mis->to_src_file && migrate_send_vm_started()) { + migrate_send_rp_vm_started(mis); + } + /* * This must happen after any state changes since as soon as an external * observer sees this event they might start to prod at the VM assuming @@ -996,6 +1001,11 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value) migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf); } +void migrate_send_rp_vm_started(MigrationIncomingState *mis) +{ + migrate_send_rp_message(mis, MIG_RP_MSG_VM_STARTED, 0, NULL); +} + bool migration_is_running(void) { MigrationState *s = current_migration; @@ -1660,6 +1670,9 @@ int migrate_init(MigrationState *s, Error **errp) s->postcopy_package_loaded = false; qemu_event_reset(&s->postcopy_package_loaded_event); + s->dest_vm_started = false; + qemu_event_reset(&s->dest_vm_started_event); + return 0; } @@ -2368,6 +2381,12 @@ static void *source_return_path_thread(void *opaque) trace_source_return_path_thread_switchover_acked(); break; + case MIG_RP_MSG_VM_STARTED: + migration_downtime_end(ms); + ms->dest_vm_started = true; + qemu_event_set(&ms->dest_vm_started_event); + break; + default: break; } @@ -2591,7 +2610,9 @@ static int postcopy_start(MigrationState *ms, Error **errp) */ migration_call_notifiers(MIG_EVENT_PRECOPY_DONE, NULL); - migration_downtime_end(ms); + if (!ms->rp_state.rp_thread_created || !migrate_send_vm_started()) { + migration_downtime_end(ms); + } if (migrate_postcopy_ram()) { /* @@ -3086,7 +3107,9 @@ static void migration_completion_end(MigrationState *s) * - correct ordering of s->mbps update vs. s->state; */ bql_lock(); - migration_downtime_end(s); + if (!s->rp_state.rp_thread_created || !migrate_send_vm_started()) { + migration_downtime_end(s); + } s->total_time = end_time - s->start_time; transfer_time = s->total_time - s->setup_time; if (transfer_time) { @@ -3300,9 +3323,10 @@ static void migration_iteration_finish(MigrationState *s) case MIGRATION_STATUS_FAILED: case MIGRATION_STATUS_CANCELLED: case MIGRATION_STATUS_CANCELLING: - if (!migration_block_activate(&local_err)) { + if (s->dest_vm_started || !migration_block_activate(&local_err)) { /* - * Re-activate the block drives if they're inactivated. + * Re-activate the block drives if they're inactivated and the dest + * vm has not reported that it has started. * * If it fails (e.g. in case of a split brain, where dest QEMU * might have taken some of the drive locks and running!), do @@ -3853,6 +3877,7 @@ static void migration_instance_finalize(Object *obj) qemu_sem_destroy(&ms->postcopy_qemufile_src_sem); error_free(ms->error); qemu_event_destroy(&ms->postcopy_package_loaded_event); + qemu_event_destroy(&ms->dest_vm_started_event); } static void migration_instance_init(Object *obj) @@ -3875,6 +3900,7 @@ static void migration_instance_init(Object *obj) qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0); qemu_mutex_init(&ms->qemu_file_lock); qemu_event_init(&ms->postcopy_package_loaded_event, 0); + qemu_event_init(&ms->dest_vm_started_event, false); } /* diff --git a/migration/migration.h b/migration/migration.h index b6888daced..a3fab4f27e 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -522,6 +522,14 @@ struct MigrationState { * anything as input. */ bool has_block_bitmap_mapping; + + /* + * Do send VM_START message on the return-path when dest VM finishes + * loading device state and switches out of INMIGRATE run state. + */ + bool send_vm_started; + bool dest_vm_started; + QemuEvent dest_vm_started_event; }; void migrate_set_state(MigrationStatus *state, MigrationStatus old_state, @@ -564,6 +572,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis, char *block_name); void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value); int migrate_send_rp_switchover_ack(MigrationIncomingState *mis); +void migrate_send_rp_vm_started(MigrationIncomingState *mis); void dirty_bitmap_mig_before_vm_start(void); void dirty_bitmap_mig_cancel_outgoing(void); diff --git a/migration/options.c b/migration/options.c index 1ffe85a2d8..a5a233183b 100644 --- a/migration/options.c +++ b/migration/options.c @@ -108,6 +108,7 @@ const Property migration_properties[] = { preempt_pre_7_2, false), DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState, multifd_clean_tls_termination, true), + DEFINE_PROP_BOOL("send-vm-started", MigrationState, send_vm_started, true), /* Migration parameters */ DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState, @@ -434,6 +435,13 @@ bool migrate_zero_copy_send(void) return s->capabilities[MIGRATION_CAPABILITY_ZERO_COPY_SEND]; } +bool migrate_send_vm_started(void) +{ + MigrationState *s = migrate_get_current(); + + return s->send_vm_started; +} + /* pseudo capabilities */ bool migrate_multifd_flush_after_each_section(void) diff --git a/migration/options.h b/migration/options.h index b502871097..5fdc8fc6fe 100644 --- a/migration/options.h +++ b/migration/options.h @@ -42,6 +42,7 @@ bool migrate_return_path(void); bool migrate_validate_uuid(void); bool migrate_xbzrle(void); bool migrate_zero_copy_send(void); +bool migrate_send_vm_started(void); /* * pseudo capabilities diff --git a/migration/savevm.c b/migration/savevm.c index 3dc812a7bb..1020094fc8 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2157,6 +2157,9 @@ static void loadvm_postcopy_handle_run_bh(void *opaque) } trace_vmstate_downtime_checkpoint("dst-postcopy-bh-vm-started"); + if (mis->to_src_file && migrate_send_vm_started()) { + migrate_send_rp_vm_started(mis); + } } /* After all discards we can start running and asking for pages */ -- 2.52.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/4] migration: Introduce VM_STARTED return-path message 2026-01-27 14:03 ` [PATCH 2/4] migration: Introduce VM_STARTED return-path message Juraj Marcin @ 2026-01-27 22:29 ` Michael S. Tsirkin 0 siblings, 0 replies; 23+ messages in thread From: Michael S. Tsirkin @ 2026-01-27 22:29 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy On Tue, Jan 27, 2026 at 03:03:08PM +0100, Juraj Marcin wrote: > From: Juraj Marcin <jmarcin@redhat.com> > > Currently there is no universal way for the destination to tell the > source it has started. In precopy it could be deduced from the RP_SHUT > message and in postcopy from the response to the ping just before the > POSTCOPY_RUN command, but neither method is precise. Moreover, there is > no way to send more data after the destination has started with precopy > migration. > > This patch adds new message type to the return-path which tells the > source that the destination VM has just started (or can be started if > autostart is false). Source VM can use this message to precisely > calculate the downtime regardless of if postcopy is used and can also > send more data, for example network packets. > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> I do not think it matters that VM started, at least not for the issue in question. What matters is that a packet is transmitted on behalf of the VM, on the specific interface. > --- > hw/core/machine.c | 4 +++- > migration/migration.c | 34 ++++++++++++++++++++++++++++++---- > migration/migration.h | 9 +++++++++ > migration/options.c | 8 ++++++++ > migration/options.h | 1 + > migration/savevm.c | 3 +++ > 6 files changed, 54 insertions(+), 5 deletions(-) > > diff --git a/hw/core/machine.c b/hw/core/machine.c > index 6411e68856..dc73217a5f 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -38,7 +38,9 @@ > #include "hw/acpi/generic_event_device.h" > #include "qemu/audio.h" > > -GlobalProperty hw_compat_10_2[] = {}; > +GlobalProperty hw_compat_10_2[] = { > + { "migration", "send-vm-started", "off" }, > +}; > const size_t hw_compat_10_2_len = G_N_ELEMENTS(hw_compat_10_2); > > GlobalProperty hw_compat_10_1[] = { > diff --git a/migration/migration.c b/migration/migration.c > index b103a82fc0..4871db2365 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -82,6 +82,7 @@ enum mig_rp_message_type { > MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */ > MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */ > MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */ > + MIG_RP_MSG_VM_STARTED, /* tell source destination has started */ > > MIG_RP_MSG_MAX > }; > @@ -750,6 +751,10 @@ static void process_incoming_migration_bh(void *opaque) > runstate_set(global_state_get_runstate()); > } > trace_vmstate_downtime_checkpoint("dst-precopy-bh-vm-started"); > + if (mis->to_src_file && migrate_send_vm_started()) { > + migrate_send_rp_vm_started(mis); > + } > + > /* > * This must happen after any state changes since as soon as an external > * observer sees this event they might start to prod at the VM assuming > @@ -996,6 +1001,11 @@ void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value) > migrate_send_rp_message(mis, MIG_RP_MSG_RESUME_ACK, sizeof(buf), &buf); > } > > +void migrate_send_rp_vm_started(MigrationIncomingState *mis) > +{ > + migrate_send_rp_message(mis, MIG_RP_MSG_VM_STARTED, 0, NULL); > +} > + > bool migration_is_running(void) > { > MigrationState *s = current_migration; > @@ -1660,6 +1670,9 @@ int migrate_init(MigrationState *s, Error **errp) > s->postcopy_package_loaded = false; > qemu_event_reset(&s->postcopy_package_loaded_event); > > + s->dest_vm_started = false; > + qemu_event_reset(&s->dest_vm_started_event); > + > return 0; > } > > @@ -2368,6 +2381,12 @@ static void *source_return_path_thread(void *opaque) > trace_source_return_path_thread_switchover_acked(); > break; > > + case MIG_RP_MSG_VM_STARTED: > + migration_downtime_end(ms); > + ms->dest_vm_started = true; > + qemu_event_set(&ms->dest_vm_started_event); > + break; > + > default: > break; > } > @@ -2591,7 +2610,9 @@ static int postcopy_start(MigrationState *ms, Error **errp) > */ > migration_call_notifiers(MIG_EVENT_PRECOPY_DONE, NULL); > > - migration_downtime_end(ms); > + if (!ms->rp_state.rp_thread_created || !migrate_send_vm_started()) { > + migration_downtime_end(ms); > + } > > if (migrate_postcopy_ram()) { > /* > @@ -3086,7 +3107,9 @@ static void migration_completion_end(MigrationState *s) > * - correct ordering of s->mbps update vs. s->state; > */ > bql_lock(); > - migration_downtime_end(s); > + if (!s->rp_state.rp_thread_created || !migrate_send_vm_started()) { > + migration_downtime_end(s); > + } > s->total_time = end_time - s->start_time; > transfer_time = s->total_time - s->setup_time; > if (transfer_time) { > @@ -3300,9 +3323,10 @@ static void migration_iteration_finish(MigrationState *s) > case MIGRATION_STATUS_FAILED: > case MIGRATION_STATUS_CANCELLED: > case MIGRATION_STATUS_CANCELLING: > - if (!migration_block_activate(&local_err)) { > + if (s->dest_vm_started || !migration_block_activate(&local_err)) { > /* > - * Re-activate the block drives if they're inactivated. > + * Re-activate the block drives if they're inactivated and the dest > + * vm has not reported that it has started. > * > * If it fails (e.g. in case of a split brain, where dest QEMU > * might have taken some of the drive locks and running!), do > @@ -3853,6 +3877,7 @@ static void migration_instance_finalize(Object *obj) > qemu_sem_destroy(&ms->postcopy_qemufile_src_sem); > error_free(ms->error); > qemu_event_destroy(&ms->postcopy_package_loaded_event); > + qemu_event_destroy(&ms->dest_vm_started_event); > } > > static void migration_instance_init(Object *obj) > @@ -3875,6 +3900,7 @@ static void migration_instance_init(Object *obj) > qemu_sem_init(&ms->postcopy_qemufile_src_sem, 0); > qemu_mutex_init(&ms->qemu_file_lock); > qemu_event_init(&ms->postcopy_package_loaded_event, 0); > + qemu_event_init(&ms->dest_vm_started_event, false); > } > > /* > diff --git a/migration/migration.h b/migration/migration.h > index b6888daced..a3fab4f27e 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -522,6 +522,14 @@ struct MigrationState { > * anything as input. > */ > bool has_block_bitmap_mapping; > + > + /* > + * Do send VM_START message on the return-path when dest VM finishes > + * loading device state and switches out of INMIGRATE run state. > + */ > + bool send_vm_started; > + bool dest_vm_started; > + QemuEvent dest_vm_started_event; > }; > > void migrate_set_state(MigrationStatus *state, MigrationStatus old_state, > @@ -564,6 +572,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis, > char *block_name); > void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value); > int migrate_send_rp_switchover_ack(MigrationIncomingState *mis); > +void migrate_send_rp_vm_started(MigrationIncomingState *mis); > > void dirty_bitmap_mig_before_vm_start(void); > void dirty_bitmap_mig_cancel_outgoing(void); > diff --git a/migration/options.c b/migration/options.c > index 1ffe85a2d8..a5a233183b 100644 > --- a/migration/options.c > +++ b/migration/options.c > @@ -108,6 +108,7 @@ const Property migration_properties[] = { > preempt_pre_7_2, false), > DEFINE_PROP_BOOL("multifd-clean-tls-termination", MigrationState, > multifd_clean_tls_termination, true), > + DEFINE_PROP_BOOL("send-vm-started", MigrationState, send_vm_started, true), > > /* Migration parameters */ > DEFINE_PROP_UINT8("x-throttle-trigger-threshold", MigrationState, > @@ -434,6 +435,13 @@ bool migrate_zero_copy_send(void) > return s->capabilities[MIGRATION_CAPABILITY_ZERO_COPY_SEND]; > } > > +bool migrate_send_vm_started(void) > +{ > + MigrationState *s = migrate_get_current(); > + > + return s->send_vm_started; > +} > + > /* pseudo capabilities */ > > bool migrate_multifd_flush_after_each_section(void) > diff --git a/migration/options.h b/migration/options.h > index b502871097..5fdc8fc6fe 100644 > --- a/migration/options.h > +++ b/migration/options.h > @@ -42,6 +42,7 @@ bool migrate_return_path(void); > bool migrate_validate_uuid(void); > bool migrate_xbzrle(void); > bool migrate_zero_copy_send(void); > +bool migrate_send_vm_started(void); > > /* > * pseudo capabilities > diff --git a/migration/savevm.c b/migration/savevm.c > index 3dc812a7bb..1020094fc8 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2157,6 +2157,9 @@ static void loadvm_postcopy_handle_run_bh(void *opaque) > } > > trace_vmstate_downtime_checkpoint("dst-postcopy-bh-vm-started"); > + if (mis->to_src_file && migrate_send_vm_started()) { > + migrate_send_rp_vm_started(mis); > + } > } > > /* After all discards we can start running and asking for pages */ > -- > 2.52.0 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/4] migration: Convert VMSD early_setup into VMStateSavePhase enum 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 14:03 ` [PATCH 1/4] migration/qemu-file: Add ability to clear error Juraj Marcin 2026-01-27 14:03 ` [PATCH 2/4] migration: Introduce VM_STARTED return-path message Juraj Marcin @ 2026-01-27 14:03 ` Juraj Marcin 2026-01-27 14:03 ` [PATCH 4/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 18:21 ` [PATCH 0/4] " Stefano Brivio 4 siblings, 0 replies; 23+ messages in thread From: Juraj Marcin @ 2026-01-27 14:03 UTC (permalink / raw) To: qemu-devel Cc: Juraj Marcin, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy From: Juraj Marcin <jmarcin@redhat.com> This allows devices to specify when during migration their fields should be saved. For now there are two Save Phases defined. EARLY_START These devices are saved during qemu_savevm_state_setup(), and corresponds to migration SETUP state, same behavior as the former early_setup flag. COMPLETE These devices are saved during migration completion or switch-over with qemu_savevm_state_complete_precopy_non_iterable(), this corresponds to the migration DEVICE state. This is the default phase if none is specified explicitly. This also allows introduction of other phases in the future, for example ITERATE_LIVE and POSTCOPY once support for iterative devices and postcopy is implemented in VMSD, and for the NETPASS phase implemented in this series. Signed-off-by: Juraj Marcin <jmarcin@redhat.com> --- hw/virtio/virtio-mem.c | 2 +- include/migration/vmstate.h | 27 +++++++++++++++++++-------- migration/savevm.c | 4 ++-- 3 files changed, 22 insertions(+), 11 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index c1e2defb68..6d3e3746e5 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -1438,7 +1438,7 @@ static const VMStateDescription vmstate_virtio_mem_device_early = { .name = "virtio-mem-device-early", .minimum_version_id = 1, .version_id = 1, - .early_setup = true, + .phase = VMS_PHASE_EARLY_SETUP, .post_load = virtio_mem_post_load_early, .fields = (const VMStateField[]) { VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks, diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index ed9095a466..62d7e9fe38 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -186,18 +186,29 @@ struct VMStateField { bool (*field_exists)(void *opaque, int version_id); }; -struct VMStateDescription { - const char *name; - bool unmigratable; +typedef enum { /* - * This VMSD describes something that should be sent during setup phase - * of migration. It plays similar role as save_setup() for explicitly + * Specifies a VMSD of a device that should be migrated during the migration + * completion phase (switch-over). (Default behavior, same behavior as + * before the introduction of save phase.) + */ + VMS_PHASE_COMPLETE = 0, + /* + * Specifies a VMSD of a device that should be saved during setup phase of + * migration. It plays similar role as save_setup() for explicitly * registered vmstate entries, so it can be seen as a way to describe * save_setup() in VMSD structures. - * + */ + VMS_PHASE_EARLY_SETUP, +} VMStateSavePhase; + +struct VMStateDescription { + const char *name; + bool unmigratable; + /* * Note that for now, a SaveStateEntry cannot have a VMSD and * operations (e.g., save_setup()) set at the same time. Consequently, - * save_setup() and a VMSD with early_setup set to true are mutually + * save_setup() and a VMSD with phase set to EARLY_SETUP are mutually * exclusive. For this reason, also early_setup VMSDs are migrated in a * QEMU_VM_SECTION_FULL section, while save_setup() data is migrated in * a QEMU_VM_SECTION_START section. @@ -213,7 +224,7 @@ struct VMStateDescription { * <0 on error where -value is an error number from errno.h */ - bool early_setup; + VMStateSavePhase phase; int version_id; int minimum_version_id; MigrationPriority priority; diff --git a/migration/savevm.c b/migration/savevm.c index 1020094fc8..78eb1d6165 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1370,7 +1370,7 @@ int qemu_savevm_state_setup(QEMUFile *f, Error **errp) trace_savevm_state_setup(); QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { - if (se->vmsd && se->vmsd->early_setup) { + if (se->vmsd && se->vmsd->phase == VMS_PHASE_EARLY_SETUP) { ret = vmstate_save(f, se, vmdesc, errp); if (ret) { migrate_error_propagate(ms, error_copy(*errp)); @@ -1672,7 +1672,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, cpu_synchronize_all_states(); QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { - if (se->vmsd && se->vmsd->early_setup) { + if (se->vmsd && se->vmsd->phase != VMS_PHASE_COMPLETE) { /* Already saved during qemu_savevm_state_setup(). */ continue; } -- 2.52.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin ` (2 preceding siblings ...) 2026-01-27 14:03 ` [PATCH 3/4] migration: Convert VMSD early_setup into VMStateSavePhase enum Juraj Marcin @ 2026-01-27 14:03 ` Juraj Marcin 2026-01-27 14:25 ` Daniel P. Berrangé 2026-01-28 2:55 ` Jason Wang 2026-01-27 18:21 ` [PATCH 0/4] " Stefano Brivio 4 siblings, 2 replies; 23+ messages in thread From: Juraj Marcin @ 2026-01-27 14:03 UTC (permalink / raw) To: qemu-devel Cc: Juraj Marcin, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy From: Juraj Marcin <jmarcin@redhat.com> During migration switchover both the source and the destination machines are paused (compute downtime). During this period network still routes network packets to the source machine, as this is the last place where the recipient MAC address has been seen. Once the destination side starts and sends network announcement, all subsequent frames are routed correctly. However, frames delivered to the source machine are never processed and lost. This causes also a network downtime with roughly the same duration as compute downtime. This can cause problems not only for protocols that cannot handle packet loss, but can also introduce delays in protocols that can handle them. To resolve this, this feature instantiates a network filter for each network backend present during migration setup on both migration sides. On the source side, this filter caches all packets received from the backend during switchover. Once the destination machine starts, all cached packets are sent through the migration channel and the respective filter object on the destination side injects them to the NIC attached to the backend. Signed-off-by: Juraj Marcin <jmarcin@redhat.com> --- include/migration/vmstate.h | 6 + include/net/net.h | 5 + migration/meson.build | 1 + migration/migration.c | 49 ++++++- migration/migration.h | 2 + migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ migration/netpass.h | 14 ++ migration/options.c | 21 +++ migration/options.h | 1 + migration/savevm.c | 37 ++++++ migration/savevm.h | 2 + migration/trace-events | 9 ++ net/net.c | 11 ++ net/tap.c | 11 +- qapi/migration.json | 7 +- 15 files changed, 418 insertions(+), 4 deletions(-) create mode 100644 migration/netpass.c create mode 100644 migration/netpass.h diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 62d7e9fe38..7987e6c85a 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -200,6 +200,12 @@ typedef enum { * save_setup() in VMSD structures. */ VMS_PHASE_EARLY_SETUP, + /* + * Specifies a netpass VMSD, these devices are copied right after the + * destination is started regardless of precopy/postcopy. Failure in this + * phase does not fail the migration in case of precopy. + */ + VMS_PHASE_NETPASS, } VMStateSavePhase; struct VMStateDescription { diff --git a/include/net/net.h b/include/net/net.h index 45bc86fc86..510908845b 100644 --- a/include/net/net.h +++ b/include/net/net.h @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); typedef bool (SetSteeringEBPF)(NetClientState *, int); typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); typedef struct NetClientInfo { NetClientDriver type; @@ -130,6 +131,9 @@ struct NetClientState { bool is_netdev; bool do_not_pad; /* do not pad to the minimum ethernet frame length */ bool is_datapath; + bool netpass_enabled; + NetpassEnabledNotify *netpass_enabled_notify; + void *netpass_enabled_notify_opaque; QTAILQ_HEAD(, NetFilterState) filters; }; @@ -198,6 +202,7 @@ void qemu_flush_queued_packets(NetClientState *nc); void qemu_flush_or_purge_queued_packets(NetClientState *nc, bool purge); void qemu_set_info_str(NetClientState *nc, const char *fmt, ...) G_GNUC_PRINTF(2, 3); +void qemu_set_netpass_enabled(NetClientState *nc, bool enabled); void qemu_format_nic_info_str(NetClientState *nc, uint8_t macaddr[6]); bool qemu_has_ufo(NetClientState *nc); bool qemu_has_uso(NetClientState *nc); diff --git a/migration/meson.build b/migration/meson.build index c7f39bdb55..a501256979 100644 --- a/migration/meson.build +++ b/migration/meson.build @@ -30,6 +30,7 @@ system_ss.add(files( 'multifd-nocomp.c', 'multifd-zlib.c', 'multifd-zero-page.c', + 'netpass.c', 'options.c', 'postcopy-ram.c', 'ram.c', diff --git a/migration/migration.c b/migration/migration.c index 4871db2365..959719dd61 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -63,6 +63,7 @@ #include "system/dirtylimit.h" #include "qemu/sockets.h" #include "system/kvm.h" +#include "netpass.h" #define NOTIFIER_ELEM_INIT(array, elem) \ [elem] = NOTIFIER_WITH_RETURN_LIST_INITIALIZER((array)[elem]) @@ -488,6 +489,10 @@ void migration_incoming_state_destroy(void) mis->postcopy_qemufile_dst = NULL; } + if (migrate_netpass()) { + migration_netpass_cleanup(); + } + cpr_set_incoming_mode(MIG_MODE_NONE); yank_unregister_instance(MIGRATION_YANK_INSTANCE); } @@ -755,6 +760,10 @@ static void process_incoming_migration_bh(void *opaque) migrate_send_rp_vm_started(mis); } + if (migrate_netpass()) { + qemu_loadvm_state_netpass(mis->from_src_file, mis); + } + /* * This must happen after any state changes since as soon as an external * observer sees this event they might start to prod at the VM assuming @@ -775,6 +784,13 @@ process_incoming_migration_co(void *opaque) assert(mis->from_src_file); + if (migrate_netpass()) { + ret = migration_netpass_setup(&local_err); + if (ret < 0) { + goto fail; + } + } + mis->largest_page_size = qemu_ram_pagesize_largest(); postcopy_state_set(POSTCOPY_INCOMING_NONE); migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP, @@ -811,8 +827,7 @@ process_incoming_migration_co(void *opaque) goto out; fail: - migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE, - MIGRATION_STATUS_FAILED); + migrate_set_state(&mis->state, mis->state, MIGRATION_STATUS_FAILED); migrate_error_propagate(s, local_err); migration_incoming_state_destroy(); @@ -1336,6 +1351,10 @@ static void migration_cleanup(MigrationState *s) qemu_fclose(tmp); } + if (migrate_netpass()) { + migration_netpass_cleanup(); + } + assert(!migration_is_active()); if (s->state == MIGRATION_STATUS_CANCELLING) { @@ -1673,6 +1692,8 @@ int migrate_init(MigrationState *s, Error **errp) s->dest_vm_started = false; qemu_event_reset(&s->dest_vm_started_event); + s->netpass_state_sent = false; + return 0; } @@ -2729,6 +2750,10 @@ static bool migration_switchover_start(MigrationState *s, Error **errp) { ERRP_GUARD(); + if (migrate_netpass()) { + migration_netpass_activate(); + } + if (!migration_switchover_prepare(s)) { error_setg(errp, "Switchover is interrupted"); return false; @@ -2821,6 +2846,14 @@ static void migration_completion(MigrationState *s) goto fail; } + if (migrate_netpass() && !s->netpass_state_sent) { + qemu_event_wait(&s->dest_vm_started_event); + qemu_savevm_state_netpass(s->to_dst_file); + s->netpass_state_sent = true; + qemu_put_byte(s->to_dst_file, QEMU_VM_EOF); + qemu_fflush(s->to_dst_file); + } + if (close_return_path_on_source(s)) { goto fail; } @@ -3251,6 +3284,11 @@ static MigIterateState migration_iteration_run(MigrationState *s) migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_DEVICE, MIGRATION_STATUS_POSTCOPY_ACTIVE); } + + if (s->dest_vm_started && migrate_netpass() && !s->netpass_state_sent) { + qemu_savevm_state_netpass(s->to_dst_file); + s->netpass_state_sent = true; + } } else { /* * Exact pending reporting is only needed for precopy. Taking RAM @@ -3774,6 +3812,13 @@ void migration_start_outgoing(MigrationState *s) s->expected_downtime = migrate_downtime_limit(); + if (migrate_netpass()) { + ret = migration_netpass_setup(&local_err); + if (ret < 0) { + goto fail; + } + } + if (resume) { /* This is a resumed migration */ rate_limit = migrate_max_postcopy_bandwidth(); diff --git a/migration/migration.h b/migration/migration.h index a3fab4f27e..a0d9560254 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -530,6 +530,8 @@ struct MigrationState { bool send_vm_started; bool dest_vm_started; QemuEvent dest_vm_started_event; + + bool netpass_state_sent; }; void migrate_set_state(MigrationStatus *state, MigrationStatus old_state, diff --git a/migration/netpass.c b/migration/netpass.c new file mode 100644 index 0000000000..92b2522c83 --- /dev/null +++ b/migration/netpass.c @@ -0,0 +1,246 @@ +#include "qemu/osdep.h" +#include "netpass.h" + +#include "migration/migration.h" +#include "migration/vmstate.h" +#include "net/queue.h" +#include "net/filter.h" +#include "net/net.h" +#include "net/vhost_net.h" +#include "qapi/error.h" +#include "qemu/error-report.h" +#include "qemu/iov.h" +#include "qemu/typedefs.h" +#include "qom/object.h" +#include "trace.h" + +struct NetPassState { + NetFilterState parent_obj; + bool active; + size_t packet_count; + uint32_t qlength; + uint32_t qcapacity; + uint8_t *qbuffer; + SocketReadState rs; + QTAILQ_ENTRY(NetPassState) next; +}; + +static void netpass_queue_clear(NetPassState *s) +{ + g_free(s->qbuffer); + s->qbuffer = NULL; + s->qcapacity = 0; + s->qlength = 0; + s->packet_count = 0; +} + +OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(NetPassState, filter_netpass, + FILTER_NETPASS, NETFILTER, + { TYPE_VMSTATE_IF }, { } ) + +static bool netpass_vmstate_pre_save(void *opaque, Error **errp) +{ + NetPassState *s = opaque; + s->active = false; + return true; +} + +static int netpass_vmstate_post_save(void *opaque) +{ + NetPassState *s = opaque; + trace_migration_netpass_passed_packet_count(NETFILTER(s)->netdev_id, s->packet_count); + netpass_queue_clear(s); + return 0; +} + +static void netpass_vmstate_post_load_bh(void *opaque) +{ + NetPassState *s = opaque; + + int ret = net_fill_rstate(&s->rs, s->qbuffer, s->qlength); + if (ret == -1) { + warn_report("migration: Failed to fill netpass rstate during load"); + } + trace_migration_netpass_received_packet_count(NETFILTER(s)->netdev_id, s->packet_count); + netpass_queue_clear(s); +} + +static bool netpass_vmstate_post_load(void *opaque, int version_id, Error **errp) +{ + /* + * Schedule on the main thread in case this function is running on the + * postcopy listen thread and there is a fault during packet injection. + */ + migration_bh_schedule(netpass_vmstate_post_load_bh, opaque); + return true; +} + +static char *filter_netpass_vmstate_if_get_id(VMStateIf *obj) +{ + NetFilterState *nf = NETFILTER(obj); + return g_strconcat("filter-netpass/", nf->netdev_id, NULL); +} + +static const VMStateDescription vmstate_netpass = { + .name = "filter-netpass", + .version_id = 1, + .minimum_version_id = 1, + .phase = VMS_PHASE_NETPASS, + .fields = (const VMStateField[]) { + VMSTATE_UINT32(qlength, NetPassState), + VMSTATE_UINT32(qcapacity, NetPassState), + VMSTATE_VBUFFER_ALLOC_UINT32(qbuffer, NetPassState, 0, NULL, qcapacity), + VMSTATE_END_OF_LIST(), + }, + .pre_save_errp = netpass_vmstate_pre_save, + .post_save = netpass_vmstate_post_save, + .post_load_errp = netpass_vmstate_post_load, +}; + +QTAILQ_HEAD(, NetPassState) filters = QTAILQ_HEAD_INITIALIZER(filters); + +static void netpass_rs_finalize(SocketReadState *rs) +{ + NetPassState *s = container_of(rs, NetPassState, rs); + NetFilterState *nf = NETFILTER(s); + + struct iovec iov = { + .iov_len = rs->packet_len, + .iov_base = rs->buf, + }; + qemu_netfilter_pass_to_next(nf->netdev, 0, &iov, 1, nf); + s->packet_count++; +} + +static void filter_netpass_setup(NetFilterState *nf, Error **errp) +{ + NetPassState *s = FILTER_NETPASS(nf); + + s->active = false; + s->qbuffer = NULL; + s->qcapacity = 0; + s->qlength = 0; + s->packet_count = 0; + net_socket_rs_init(&s->rs, netpass_rs_finalize, true); +} + +static void filter_netpass_cleanup(NetFilterState *nf) +{ + NetPassState *s = FILTER_NETPASS(nf); + + s->active = false; + netpass_queue_clear(s); + if (nf->netdev) { + qemu_set_netpass_enabled(nf->netdev, false); + } +} + +static ssize_t filter_netpass_receive_iov(NetFilterState *nf, + NetClientState *sender, + unsigned flags, + const struct iovec *iov, + int iovcnt, + NetPacketSent *sent_cb) +{ + NetPassState *s = FILTER_NETPASS(nf); + + if (!s->active) { + return 0; + } + + uint32_t total_size = iov_size(iov, iovcnt); + size_t req_cap = sizeof(uint32_t) + sizeof(uint32_t) + total_size; + if (s->qcapacity - s->qlength < req_cap) { + size_t new_capacity = s->qcapacity; + while (new_capacity - s->qlength < req_cap) { + new_capacity += 4096; + } + s->qbuffer = g_realloc(s->qbuffer, new_capacity); + s->qcapacity = new_capacity; + } + uint32_t total_size_be = htonl(total_size); + memcpy(&s->qbuffer[s->qlength], &total_size_be, sizeof(uint32_t)); + s->qlength += sizeof(uint32_t); + uint32_t vnet_hdr_len_be = htonl(sender->vnet_hdr_len); + memcpy(&s->qbuffer[s->qlength], &vnet_hdr_len_be, sizeof(uint32_t)); + s->qlength += sizeof(uint32_t); + iov_to_buf_full(iov, iovcnt, 0, &s->qbuffer[s->qlength], total_size); + s->qlength += total_size; + s->packet_count++; + + return 0; +} + +static void filter_netpass_class_init(ObjectClass *oc, const void *data) +{ + NetFilterClass *nfc = NETFILTER_CLASS(oc); + VMStateIfClass *vc = VMSTATE_IF_CLASS(oc); + + nfc->setup = filter_netpass_setup; + nfc->cleanup = filter_netpass_cleanup; + nfc->receive_iov = filter_netpass_receive_iov; + + vc->get_id = filter_netpass_vmstate_if_get_id; +} + +static void filter_netpass_init(Object *obj) +{ +} + +static void filter_netpass_finalize(Object *obj) +{ + NetPassState *s = FILTER_NETPASS(obj); + (void)s; +} + +int migration_netpass_setup(Error **errp) +{ + NetClientState *nc; + + QTAILQ_FOREACH(nc, &net_clients, next) { + if (!nc->is_netdev) { + continue; + } + if (get_vhost_net(nc)) { + warn_report("migration: netpass is not supported with vhost=on"); + continue; + } + g_autofree char *filter_id = g_strconcat("netpass-", nc->name, NULL); + Object *obj = object_new_with_props(TYPE_FILTER_NETPASS, + object_get_objects_root(), + filter_id, errp, + "netdev", nc->name, + "queue", "tx", + NULL); + if (!obj) { + error_prepend(errp, "Failed to setup migration netpass"); + return -1; + } + trace_migration_netpass_setup_created_filter(nc->name); + object_ref(obj); + QTAILQ_INSERT_TAIL(&filters, FILTER_NETPASS(obj), next); + vmstate_register(VMSTATE_IF(obj), VMSTATE_INSTANCE_ID_ANY, + &vmstate_netpass, obj); + } + return 0; +} + +void migration_netpass_activate(void) +{ + NetPassState *s; + QTAILQ_FOREACH(s, &filters, next) { + s->packet_count = 0; + s->active = true; + qemu_set_netpass_enabled(NETFILTER(s)->netdev, true); + } +} + +void migration_netpass_cleanup(void) +{ + NetPassState *s, *ns; + QTAILQ_FOREACH_SAFE(s, &filters, next, ns) { + QTAILQ_REMOVE(&filters, s, next); + vmstate_unregister(VMSTATE_IF(s), &vmstate_netpass, s); + object_unref(s); + } +} diff --git a/migration/netpass.h b/migration/netpass.h new file mode 100644 index 0000000000..8618cf4c73 --- /dev/null +++ b/migration/netpass.h @@ -0,0 +1,14 @@ +#ifndef QEMU_MIGRATION_NETPASS_H +#define QEMU_MIGRATION_NETPASS_H + +#include "qemu/typedefs.h" +#include "qom/object.h" + +#define TYPE_FILTER_NETPASS "filter-netpass" +OBJECT_DECLARE_SIMPLE_TYPE(NetPassState, FILTER_NETPASS) + +int migration_netpass_setup(Error **errp); +void migration_netpass_activate(void); +void migration_netpass_cleanup(void); + +#endif diff --git a/migration/options.c b/migration/options.c index a5a233183b..e6e2d441b0 100644 --- a/migration/options.c +++ b/migration/options.c @@ -211,6 +211,7 @@ const Property migration_properties[] = { DEFINE_PROP_MIG_CAP("mapped-ram", MIGRATION_CAPABILITY_MAPPED_RAM), DEFINE_PROP_MIG_CAP("x-ignore-shared", MIGRATION_CAPABILITY_X_IGNORE_SHARED), + DEFINE_PROP_MIG_CAP("netpass", MIGRATION_CAPABILITY_NETPASS), }; const size_t migration_properties_count = ARRAY_SIZE(migration_properties); @@ -442,6 +443,13 @@ bool migrate_send_vm_started(void) return s->send_vm_started; } +bool migrate_netpass(void) +{ + MigrationState *s = migrate_get_current(); + + return s->capabilities[MIGRATION_CAPABILITY_NETPASS]; +} + /* pseudo capabilities */ bool migrate_multifd_flush_after_each_section(void) @@ -723,6 +731,19 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp) } } + if (new_caps[MIGRATION_CAPABILITY_NETPASS]) { + if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) { + error_setg(errp, "Capability 'netpass' requires capability " + "'return-path'"); + return false; + } + if (!migrate_send_vm_started()) { + error_setg(errp, "Capability 'netpass' requires support for VM_STARTED " + "return-path message"); + return false; + } + } + /* * On destination side, check the cases that capability is being set * after incoming thread has started. diff --git a/migration/options.h b/migration/options.h index 5fdc8fc6fe..151eaef86c 100644 --- a/migration/options.h +++ b/migration/options.h @@ -43,6 +43,7 @@ bool migrate_validate_uuid(void); bool migrate_xbzrle(void); bool migrate_zero_copy_send(void); bool migrate_send_vm_started(void); +bool migrate_netpass(void); /* * pseudo capabilities diff --git a/migration/savevm.c b/migration/savevm.c index 78eb1d6165..b930f27fa9 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -279,6 +279,7 @@ static bool should_validate_capability(int capability) switch (capability) { case MIGRATION_CAPABILITY_X_IGNORE_SHARED: case MIGRATION_CAPABILITY_MAPPED_RAM: + case MIGRATION_CAPABILITY_NETPASS: return true; default: return false; @@ -1731,6 +1732,29 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only) return qemu_fflush(f); } +void qemu_savevm_state_netpass(QEMUFile *f) +{ + MigrationState *ms = migrate_get_current(); + JSONWriter *vmdesc = ms->vmdesc; + SaveStateEntry *se; + Error *local_err = NULL; + int ret; + + trace_savevm_state_netpass_begin(); + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) { + if (!se->vmsd || se->vmsd->phase != VMS_PHASE_NETPASS) { + continue; + } + ret = vmstate_save(f, se, vmdesc, &local_err); + if (ret) { + warn_report_err(local_err); + qemu_file_clear_error(f); + break; + } + } + trace_savevm_state_netpass_end(ret); +} + /* Give an estimate of the amount left to be transferred, * the result is split into the amount for units that can and * for units that can't do postcopy. @@ -3148,6 +3172,19 @@ int qemu_load_device_state(QEMUFile *f, Error **errp) return 0; } +void qemu_loadvm_state_netpass(QEMUFile *f, MigrationIncomingState *mis) +{ + Error *local_errp; + trace_loadvm_state_netpass_begin(); + int ret = qemu_loadvm_state_main(mis->from_src_file, mis, &local_errp); + trace_loadvm_state_netpass_end(ret); + if (ret < 0) { + warn_reportf_err(local_errp, + "Error while loading netpass data, this error will be ignored"); + qemu_file_clear_error(f); + } +} + int qemu_loadvm_approve_switchover(void) { MigrationIncomingState *mis = migration_incoming_get_current(); diff --git a/migration/savevm.h b/migration/savevm.h index 125a2507b7..53220c40cf 100644 --- a/migration/savevm.h +++ b/migration/savevm.h @@ -42,6 +42,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy); void qemu_savevm_state_cleanup(void); void qemu_savevm_state_complete_postcopy(QEMUFile *f); int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only); +void qemu_savevm_state_netpass(QEMUFile *f); void qemu_savevm_state_pending_exact(uint64_t *must_precopy, uint64_t *can_postcopy); void qemu_savevm_state_pending_estimate(uint64_t *must_precopy, @@ -71,6 +72,7 @@ void qemu_loadvm_state_cleanup(MigrationIncomingState *mis); int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis, Error **errp); int qemu_load_device_state(QEMUFile *f, Error **errp); +void qemu_loadvm_state_netpass(QEMUFile *f, MigrationIncomingState *mis); int qemu_loadvm_approve_switchover(void); int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f, bool in_postcopy); diff --git a/migration/trace-events b/migration/trace-events index 91d7506634..eb25944d1b 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -10,6 +10,8 @@ qemu_savevm_send_packaged(void) "" loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u" loadvm_state_setup(void) "" loadvm_state_cleanup(void) "" +loadvm_state_netpass_begin(void) "" +loadvm_state_netpass_end(int ret) "ret=%d" loadvm_handle_cmd_packaged(unsigned int length) "%u" loadvm_handle_cmd_packaged_main(int ret) "%d" loadvm_handle_cmd_packaged_received(int ret) "%d" @@ -45,6 +47,8 @@ savevm_state_resume_prepare(void) "" savevm_state_header(void) "" savevm_state_iterate(void) "" savevm_state_cleanup(void) "" +savevm_state_netpass_begin(void) "" +savevm_state_netpass_end(int ret) "ret=%d" vmstate_save(const char *idstr, const char *vmsd_name) "%s, %s" vmstate_load(const char *idstr, const char *vmsd_name) "%s, %s" vmstate_downtime_save(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64 @@ -401,3 +405,8 @@ cpu_throttle_dirty_sync(void) "" # block-active.c migration_block_activation(const char *name) "%s" + +# netpass.c +migration_netpass_setup_created_filter(const char *netdev) "netdev=%s" +migration_netpass_passed_packet_count(const char *netdev, size_t count) "netdev=%s count=%zu" +migration_netpass_received_packet_count(const char *netdev, size_t count) "netdev=%s count=%zu" diff --git a/net/net.c b/net/net.c index a176936f9b..81540fefc1 100644 --- a/net/net.c +++ b/net/net.c @@ -158,6 +158,14 @@ void qemu_set_info_str(NetClientState *nc, const char *fmt, ...) va_end(ap); } +void qemu_set_netpass_enabled(NetClientState *nc, bool enabled) +{ + nc->netpass_enabled = enabled; + if (nc->netpass_enabled_notify) { + nc->netpass_enabled_notify(nc, nc->netpass_enabled_notify_opaque); + } +} + void qemu_format_nic_info_str(NetClientState *nc, uint8_t macaddr[6]) { qemu_set_info_str(nc, "model=%s,macaddr=%02x:%02x:%02x:%02x:%02x:%02x", @@ -287,6 +295,9 @@ static void qemu_net_client_setup(NetClientState *nc, nc->incoming_queue = qemu_new_net_queue(qemu_deliver_packet_iov, nc); nc->destructor = destructor; nc->is_datapath = is_datapath; + nc->netpass_enabled = false; + nc->netpass_enabled_notify = NULL; + nc->netpass_enabled_notify_opaque = NULL; QTAILQ_INIT(&nc->filters); } diff --git a/net/tap.c b/net/tap.c index 8d7ab6ba6f..dcc03a3f03 100644 --- a/net/tap.c +++ b/net/tap.c @@ -109,7 +109,8 @@ static char *tap_parse_script(const char *script_arg, const char *default_path) static void tap_update_fd_handler(TAPState *s) { qemu_set_fd_handler(s->fd, - s->read_poll && s->enabled ? tap_send : NULL, + (s->read_poll || s->nc.netpass_enabled) && s->enabled ? + tap_send : NULL, s->write_poll && s->enabled ? tap_writable : NULL, s); } @@ -412,6 +413,11 @@ static NetClientInfo net_tap_info = { .get_vhost_net = tap_get_vhost_net, }; +static void tap_netpass_enabled_nofity(NetClientState *nc, void *opaque) +{ + tap_update_fd_handler(opaque); +} + static TAPState *net_tap_fd_init(NetClientState *peer, const char *model, const char *name, @@ -444,6 +450,9 @@ static TAPState *net_tap_fd_init(NetClientState *peer, tap_read_poll(s, true); s->vhost_net = NULL; + nc->netpass_enabled_notify = &tap_netpass_enabled_nofity; + nc->netpass_enabled_notify_opaque = s; + return s; } diff --git a/qapi/migration.json b/qapi/migration.json index f925e5541b..d637b22c80 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -520,6 +520,11 @@ # each RAM page. Requires a migration URI that supports seeking, # such as a file. (since 9.0) # +# @netpass: Collect packets received by network backedns after source +# VM is paused and send them to the destination once it resumes. +# This (almost) completely eliminates packet loss caused by +# switchover. (since 11.0) +# # Features: # # @unstable: Members @x-colo and @x-ignore-shared are experimental. @@ -536,7 +541,7 @@ { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] }, 'validate-uuid', 'background-snapshot', 'zero-copy-send', 'postcopy-preempt', 'switchover-ack', - 'dirty-limit', 'mapped-ram'] } + 'dirty-limit', 'mapped-ram', 'netpass'] } ## # @MigrationCapabilityStatus: -- 2.52.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:03 ` [PATCH 4/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin @ 2026-01-27 14:25 ` Daniel P. Berrangé 2026-01-27 22:27 ` Michael S. Tsirkin 2026-01-28 12:23 ` Juraj Marcin 2026-01-28 2:55 ` Jason Wang 1 sibling, 2 replies; 23+ messages in thread From: Daniel P. Berrangé @ 2026-01-27 14:25 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy On Tue, Jan 27, 2026 at 03:03:10PM +0100, Juraj Marcin wrote: > From: Juraj Marcin <jmarcin@redhat.com> > > During migration switchover both the source and the destination machines > are paused (compute downtime). During this period network still routes > network packets to the source machine, as this is the last place where > the recipient MAC address has been seen. Once the destination side > starts and sends network announcement, all subsequent frames are routed > correctly. However, frames delivered to the source machine are never > processed and lost. This causes also a network downtime with roughly the > same duration as compute downtime. > > This can cause problems not only for protocols that cannot handle packet > loss, but can also introduce delays in protocols that can handle them. > > To resolve this, this feature instantiates a network filter for each > network backend present during migration setup on both migration sides. > On the source side, this filter caches all packets received from the > backend during switchover. Once the destination machine starts, all > cached packets are sent through the migration channel and the respective > filter object on the destination side injects them to the NIC attached > to the backend. If the dest QEMU has started, I presume this means that the ARP announcement has been sent ? IOW, the packets being forwarded over the migration stream are guaranteed to be delivered "out of order" wrt the sender. Should be safe for TCP, but may have an impact on other protocols. Though apps should be aware of that risk in general, they may not frequently encounter it, and it could still cause service disruption > diff --git a/qapi/migration.json b/qapi/migration.json > index f925e5541b..d637b22c80 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -520,6 +520,11 @@ > # each RAM page. Requires a migration URI that supports seeking, > # such as a file. (since 9.0) > # > +# @netpass: Collect packets received by network backedns after source > +# VM is paused and send them to the destination once it resumes. > +# This (almost) completely eliminates packet loss caused by > +# switchover. (since 11.0) Should mention they will be deliver "out of order" > +# > # Features: > # > # @unstable: Members @x-colo and @x-ignore-shared are experimental. > @@ -536,7 +541,7 @@ > { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] }, > 'validate-uuid', 'background-snapshot', > 'zero-copy-send', 'postcopy-preempt', 'switchover-ack', > - 'dirty-limit', 'mapped-ram'] } > + 'dirty-limit', 'mapped-ram', 'netpass'] } > > ## > # @MigrationCapabilityStatus: > -- > 2.52.0 > > With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:25 ` Daniel P. Berrangé @ 2026-01-27 22:27 ` Michael S. Tsirkin 2026-01-28 12:23 ` Juraj Marcin 1 sibling, 0 replies; 23+ messages in thread From: Michael S. Tsirkin @ 2026-01-27 22:27 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Juraj Marcin, qemu-devel, Fabiano Rosas, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy On Tue, Jan 27, 2026 at 02:25:23PM +0000, Daniel P. Berrangé wrote: > On Tue, Jan 27, 2026 at 03:03:10PM +0100, Juraj Marcin wrote: > > From: Juraj Marcin <jmarcin@redhat.com> > > > > During migration switchover both the source and the destination machines > > are paused (compute downtime). During this period network still routes > > network packets to the source machine, as this is the last place where > > the recipient MAC address has been seen. Once the destination side > > starts and sends network announcement, all subsequent frames are routed > > correctly. However, frames delivered to the source machine are never > > processed and lost. This causes also a network downtime with roughly the > > same duration as compute downtime. > > > > This can cause problems not only for protocols that cannot handle packet > > loss, but can also introduce delays in protocols that can handle them. > > > > To resolve this, this feature instantiates a network filter for each > > network backend present during migration setup on both migration sides. > > On the source side, this filter caches all packets received from the > > backend during switchover. Once the destination machine starts, all > > cached packets are sent through the migration channel and the respective > > filter object on the destination side injects them to the NIC attached > > to the backend. > > If the dest QEMU has started, I presume this means that the ARP > announcement has been sent ? For example, with virtio guest announcements, it's sent by the dest VM. Besides, arp "announcements" are not necessary to reprogram the network. But if you want to abolutely avoid reordering, you can wait until there's an attempt to transfer something, buffer that something, process everything from the source (pass it to the VM), then send whatever VM wants to send. Thinkably, qemu initiated packets can be handled the same way. > IOW, the packets being forwarded > over the migration stream are guaranteed to be delivered "out > of order" wrt the sender. Should be safe for TCP, but may have > an impact on other protocols. Though apps should be aware of > that risk in general, they may not frequently encounter it, and > it could still cause service disruption > > > diff --git a/qapi/migration.json b/qapi/migration.json > > index f925e5541b..d637b22c80 100644 > > --- a/qapi/migration.json > > +++ b/qapi/migration.json > > @@ -520,6 +520,11 @@ > > # each RAM page. Requires a migration URI that supports seeking, > > # such as a file. (since 9.0) > > # > > +# @netpass: Collect packets received by network backedns after source > > +# VM is paused and send them to the destination once it resumes. > > +# This (almost) completely eliminates packet loss caused by > > +# switchover. (since 11.0) > > Should mention they will be deliver "out of order" > > > +# > > # Features: > > # > > # @unstable: Members @x-colo and @x-ignore-shared are experimental. > > @@ -536,7 +541,7 @@ > > { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] }, > > 'validate-uuid', 'background-snapshot', > > 'zero-copy-send', 'postcopy-preempt', 'switchover-ack', > > - 'dirty-limit', 'mapped-ram'] } > > + 'dirty-limit', 'mapped-ram', 'netpass'] } > > > > ## > > # @MigrationCapabilityStatus: > > -- > > 2.52.0 > > > > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:25 ` Daniel P. Berrangé 2026-01-27 22:27 ` Michael S. Tsirkin @ 2026-01-28 12:23 ` Juraj Marcin 1 sibling, 0 replies; 23+ messages in thread From: Juraj Marcin @ 2026-01-28 12:23 UTC (permalink / raw) To: Daniel P. Berrangé Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy Hi Daniel, On 2026-01-27 14:25, Daniel P. Berrangé wrote: > On Tue, Jan 27, 2026 at 03:03:10PM +0100, Juraj Marcin wrote: > > From: Juraj Marcin <jmarcin@redhat.com> > > > > During migration switchover both the source and the destination machines > > are paused (compute downtime). During this period network still routes > > network packets to the source machine, as this is the last place where > > the recipient MAC address has been seen. Once the destination side > > starts and sends network announcement, all subsequent frames are routed > > correctly. However, frames delivered to the source machine are never > > processed and lost. This causes also a network downtime with roughly the > > same duration as compute downtime. > > > > This can cause problems not only for protocols that cannot handle packet > > loss, but can also introduce delays in protocols that can handle them. > > > > To resolve this, this feature instantiates a network filter for each > > network backend present during migration setup on both migration sides. > > On the source side, this filter caches all packets received from the > > backend during switchover. Once the destination machine starts, all > > cached packets are sent through the migration channel and the respective > > filter object on the destination side injects them to the NIC attached > > to the backend. > > If the dest QEMU has started, I presume this means that the ARP > announcement has been sent ? IOW, the packets being forwarded > over the migration stream are guaranteed to be delivered "out > of order" wrt the sender. Should be safe for TCP, but may have > an impact on other protocols. Though apps should be aware of > that risk in general, they may not frequently encounter it, and > it could still cause service disruption Yes, after ARP announcement from dest. Forwarded packets could get delivered out-of-order, although it would depend on the traffic rate, in my testing I encountered out-of-order packets only a couple of times. As is, this feature allows choosing between risk of packet loss or out of order delivery, both of which could also happen outside the migration scope. I could also update it and defer the delivery of new packets on the destination until packets from the source side are processed as Michael suggested, that should prevent out of order delivery. > > > diff --git a/qapi/migration.json b/qapi/migration.json > > index f925e5541b..d637b22c80 100644 > > --- a/qapi/migration.json > > +++ b/qapi/migration.json > > @@ -520,6 +520,11 @@ > > # each RAM page. Requires a migration URI that supports seeking, > > # such as a file. (since 9.0) > > # > > +# @netpass: Collect packets received by network backedns after source > > +# VM is paused and send them to the destination once it resumes. > > +# This (almost) completely eliminates packet loss caused by > > +# switchover. (since 11.0) > > Should mention they will be deliver "out of order" > > > +# > > # Features: > > # > > # @unstable: Members @x-colo and @x-ignore-shared are experimental. > > @@ -536,7 +541,7 @@ > > { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] }, > > 'validate-uuid', 'background-snapshot', > > 'zero-copy-send', 'postcopy-preempt', 'switchover-ack', > > - 'dirty-limit', 'mapped-ram'] } > > + 'dirty-limit', 'mapped-ram', 'netpass'] } > > > > ## > > # @MigrationCapabilityStatus: > > -- > > 2.52.0 > > > > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:03 ` [PATCH 4/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 14:25 ` Daniel P. Berrangé @ 2026-01-28 2:55 ` Jason Wang 2026-01-28 2:56 ` Jason Wang 2026-01-28 13:49 ` Juraj Marcin 1 sibling, 2 replies; 23+ messages in thread From: Jason Wang @ 2026-01-28 2:55 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Cindy Lu, Zhang Chen, eperezma On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > From: Juraj Marcin <jmarcin@redhat.com> > > During migration switchover both the source and the destination machines > are paused (compute downtime). During this period network still routes > network packets to the source machine, as this is the last place where > the recipient MAC address has been seen. Once the destination side > starts and sends network announcement, all subsequent frames are routed > correctly. However, frames delivered to the source machine are never > processed and lost. This causes also a network downtime with roughly the > same duration as compute downtime. > > This can cause problems not only for protocols that cannot handle packet > loss, but can also introduce delays in protocols that can handle them. > > To resolve this, this feature instantiates a network filter for each > network backend present during migration setup on both migration sides. > On the source side, this filter caches all packets received from the > backend during switchover. Once the destination machine starts, all > cached packets are sent through the migration channel and the respective > filter object on the destination side injects them to the NIC attached > to the backend. > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > --- > include/migration/vmstate.h | 6 + > include/net/net.h | 5 + > migration/meson.build | 1 + > migration/migration.c | 49 ++++++- > migration/migration.h | 2 + > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > migration/netpass.h | 14 ++ > migration/options.c | 21 +++ > migration/options.h | 1 + > migration/savevm.c | 37 ++++++ > migration/savevm.h | 2 + > migration/trace-events | 9 ++ > net/net.c | 11 ++ > net/tap.c | 11 +- > qapi/migration.json | 7 +- > 15 files changed, 418 insertions(+), 4 deletions(-) > create mode 100644 migration/netpass.c > create mode 100644 migration/netpass.h > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > index 62d7e9fe38..7987e6c85a 100644 > --- a/include/migration/vmstate.h > +++ b/include/migration/vmstate.h > @@ -200,6 +200,12 @@ typedef enum { > * save_setup() in VMSD structures. > */ > VMS_PHASE_EARLY_SETUP, > + /* > + * Specifies a netpass VMSD, these devices are copied right after the > + * destination is started regardless of precopy/postcopy. Failure in this > + * phase does not fail the migration in case of precopy. > + */ > + VMS_PHASE_NETPASS, > } VMStateSavePhase; > > struct VMStateDescription { > diff --git a/include/net/net.h b/include/net/net.h > index 45bc86fc86..510908845b 100644 > --- a/include/net/net.h > +++ b/include/net/net.h > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > typedef bool (SetSteeringEBPF)(NetClientState *, int); > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > typedef struct NetClientInfo { > NetClientDriver type; > @@ -130,6 +131,9 @@ struct NetClientState { > bool is_netdev; > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > bool is_datapath; > + bool netpass_enabled; > + NetpassEnabledNotify *netpass_enabled_notify; > + void *netpass_enabled_notify_opaque; > QTAILQ_HEAD(, NetFilterState) filters; > }; > Adding Cindy, Eugenio can Chen. I think we can simple reuse the existing filters: redirector: which can redirect traffic from the source to the destination via chardev buffer: which can hold the packets until the destination is released And let the libvirt install/uninstall those filters at the correct time. Which means: On the source: there would be a redirector that can be enabled when vm is paused, and it redirect the traffic to a socket/chardev On the destination: there would be a redirector as well as the buffer, redirector receives packets from the socket and send it to buffer, buffer will hold those packets until VM in the destination is resumed. The current filters need some tweaks (e.g letting filters (redirector) work when VM is paused). The advantages of this are: 1) reuse the existing filters 2) don't need to care about the vhost support on the source as vhost is disabled, for vDPA we can reuse shadow virtqueue 3) for the destination we can install a redirector to packet socket to let vhost works like socket -> redirector -> buffer -> redirector -> packet socket. Thanks ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 2:55 ` Jason Wang @ 2026-01-28 2:56 ` Jason Wang 2026-01-28 9:07 ` Cindy Lu 2026-01-28 13:49 ` Juraj Marcin 1 sibling, 1 reply; 23+ messages in thread From: Jason Wang @ 2026-01-28 2:56 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Cindy Lu, Zhang Chen, eperezma On Wed, Jan 28, 2026 at 10:55 AM Jason Wang <jasowang@redhat.com> wrote: > > On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > From: Juraj Marcin <jmarcin@redhat.com> > > > > During migration switchover both the source and the destination machines > > are paused (compute downtime). During this period network still routes > > network packets to the source machine, as this is the last place where > > the recipient MAC address has been seen. Once the destination side > > starts and sends network announcement, all subsequent frames are routed > > correctly. However, frames delivered to the source machine are never > > processed and lost. This causes also a network downtime with roughly the > > same duration as compute downtime. > > > > This can cause problems not only for protocols that cannot handle packet > > loss, but can also introduce delays in protocols that can handle them. > > > > To resolve this, this feature instantiates a network filter for each > > network backend present during migration setup on both migration sides. > > On the source side, this filter caches all packets received from the > > backend during switchover. Once the destination machine starts, all > > cached packets are sent through the migration channel and the respective > > filter object on the destination side injects them to the NIC attached > > to the backend. > > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > > --- > > include/migration/vmstate.h | 6 + > > include/net/net.h | 5 + > > migration/meson.build | 1 + > > migration/migration.c | 49 ++++++- > > migration/migration.h | 2 + > > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > > migration/netpass.h | 14 ++ > > migration/options.c | 21 +++ > > migration/options.h | 1 + > > migration/savevm.c | 37 ++++++ > > migration/savevm.h | 2 + > > migration/trace-events | 9 ++ > > net/net.c | 11 ++ > > net/tap.c | 11 +- > > qapi/migration.json | 7 +- > > 15 files changed, 418 insertions(+), 4 deletions(-) > > create mode 100644 migration/netpass.c > > create mode 100644 migration/netpass.h > > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > > index 62d7e9fe38..7987e6c85a 100644 > > --- a/include/migration/vmstate.h > > +++ b/include/migration/vmstate.h > > @@ -200,6 +200,12 @@ typedef enum { > > * save_setup() in VMSD structures. > > */ > > VMS_PHASE_EARLY_SETUP, > > + /* > > + * Specifies a netpass VMSD, these devices are copied right after the > > + * destination is started regardless of precopy/postcopy. Failure in this > > + * phase does not fail the migration in case of precopy. > > + */ > > + VMS_PHASE_NETPASS, > > } VMStateSavePhase; > > > > struct VMStateDescription { > > diff --git a/include/net/net.h b/include/net/net.h > > index 45bc86fc86..510908845b 100644 > > --- a/include/net/net.h > > +++ b/include/net/net.h > > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > > typedef bool (SetSteeringEBPF)(NetClientState *, int); > > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > > > typedef struct NetClientInfo { > > NetClientDriver type; > > @@ -130,6 +131,9 @@ struct NetClientState { > > bool is_netdev; > > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > > bool is_datapath; > > + bool netpass_enabled; > > + NetpassEnabledNotify *netpass_enabled_notify; > > + void *netpass_enabled_notify_opaque; > > QTAILQ_HEAD(, NetFilterState) filters; > > }; > > > > Adding Cindy, Eugenio can Chen. > > I think we can simple reuse the existing filters: > > redirector: which can redirect traffic from the source to the > destination via chardev > buffer: which can hold the packets until the destination is released > > And let the libvirt install/uninstall those filters at the correct time. > > Which means: > > On the source: there would be a redirector that can be enabled when vm > is paused, and it redirect the traffic to a socket/chardev > On the destination: there would be a redirector as well as the buffer, > redirector receives packets from the socket and send it to buffer, > buffer will hold those packets until VM in the destination is resumed. > > The current filters need some tweaks (e.g letting filters (redirector) > work when VM is paused). The advantages of this are: > > 1) reuse the existing filters > 2) don't need to care about the vhost support on the source as vhost > is disabled, for vDPA we can reuse shadow virtqueue > 3) for the destination we can install a redirector to packet socket to > let vhost works like socket -> redirector -> buffer -> redirector -> > packet socket. and 4) there's no need to touch migration code in Qemu. Thanks > > Thanks ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 2:56 ` Jason Wang @ 2026-01-28 9:07 ` Cindy Lu 0 siblings, 0 replies; 23+ messages in thread From: Cindy Lu @ 2026-01-28 9:07 UTC (permalink / raw) To: Jason Wang Cc: Juraj Marcin, qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Zhang Chen, eperezma On Wed, Jan 28, 2026 at 10:57 AM Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jan 28, 2026 at 10:55 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > From: Juraj Marcin <jmarcin@redhat.com> > > > > > > During migration switchover both the source and the destination machines > > > are paused (compute downtime). During this period network still routes > > > network packets to the source machine, as this is the last place where > > > the recipient MAC address has been seen. Once the destination side > > > starts and sends network announcement, all subsequent frames are routed > > > correctly. However, frames delivered to the source machine are never > > > processed and lost. This causes also a network downtime with roughly the > > > same duration as compute downtime. > > > > > > This can cause problems not only for protocols that cannot handle packet > > > loss, but can also introduce delays in protocols that can handle them. > > > > > > To resolve this, this feature instantiates a network filter for each > > > network backend present during migration setup on both migration sides. > > > On the source side, this filter caches all packets received from the > > > backend during switchover. Once the destination machine starts, all > > > cached packets are sent through the migration channel and the respective > > > filter object on the destination side injects them to the NIC attached > > > to the backend. > > > > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > > > --- > > > include/migration/vmstate.h | 6 + > > > include/net/net.h | 5 + > > > migration/meson.build | 1 + > > > migration/migration.c | 49 ++++++- > > > migration/migration.h | 2 + > > > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > > > migration/netpass.h | 14 ++ > > > migration/options.c | 21 +++ > > > migration/options.h | 1 + > > > migration/savevm.c | 37 ++++++ > > > migration/savevm.h | 2 + > > > migration/trace-events | 9 ++ > > > net/net.c | 11 ++ > > > net/tap.c | 11 +- > > > qapi/migration.json | 7 +- > > > 15 files changed, 418 insertions(+), 4 deletions(-) > > > create mode 100644 migration/netpass.c > > > create mode 100644 migration/netpass.h > > > > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > > > index 62d7e9fe38..7987e6c85a 100644 > > > --- a/include/migration/vmstate.h > > > +++ b/include/migration/vmstate.h > > > @@ -200,6 +200,12 @@ typedef enum { > > > * save_setup() in VMSD structures. > > > */ > > > VMS_PHASE_EARLY_SETUP, > > > + /* > > > + * Specifies a netpass VMSD, these devices are copied right after the > > > + * destination is started regardless of precopy/postcopy. Failure in this > > > + * phase does not fail the migration in case of precopy. > > > + */ > > > + VMS_PHASE_NETPASS, > > > } VMStateSavePhase; > > > > > > struct VMStateDescription { > > > diff --git a/include/net/net.h b/include/net/net.h > > > index 45bc86fc86..510908845b 100644 > > > --- a/include/net/net.h > > > +++ b/include/net/net.h > > > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > > > typedef bool (SetSteeringEBPF)(NetClientState *, int); > > > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > > > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > > > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > > > > > typedef struct NetClientInfo { > > > NetClientDriver type; > > > @@ -130,6 +131,9 @@ struct NetClientState { > > > bool is_netdev; > > > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > > > bool is_datapath; > > > + bool netpass_enabled; > > > + NetpassEnabledNotify *netpass_enabled_notify; > > > + void *netpass_enabled_notify_opaque; > > > QTAILQ_HEAD(, NetFilterState) filters; > > > }; > > > > > > > Adding Cindy, Eugenio can Chen. > > > > I think we can simple reuse the existing filters: > > > > redirector: which can redirect traffic from the source to the > > destination via chardev > > buffer: which can hold the packets until the destination is released > > > > And let the libvirt install/uninstall those filters at the correct time. > > > > Which means: > > > > On the source: there would be a redirector that can be enabled when vm > > is paused, and it redirect the traffic to a socket/chardev > > On the destination: there would be a redirector as well as the buffer, > > redirector receives packets from the socket and send it to buffer, > > buffer will hold those packets until VM in the destination is resumed. > > > > The current filters need some tweaks (e.g letting filters (redirector) > > work when VM is paused). The advantages of this are: > > > > 1) reuse the existing filters > > 2) don't need to care about the vhost support on the source as vhost > > is disabled, for vDPA we can reuse shadow virtqueue > > 3) for the destination we can install a redirector to packet socket to > > let vhost works like socket -> redirector -> buffer -> redirector -> > > packet socket. Actually, I've already started working on this to support vhost. there would be based on filter working as this explain and coding is on going , hope we can have a draft for this soon Thanks Cindy > > and 4) there's no need to touch migration code in Qemu. > > Thanks > > > > > Thanks > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 2:55 ` Jason Wang 2026-01-28 2:56 ` Jason Wang @ 2026-01-28 13:49 ` Juraj Marcin 2026-01-29 1:05 ` Jason Wang 1 sibling, 1 reply; 23+ messages in thread From: Juraj Marcin @ 2026-01-28 13:49 UTC (permalink / raw) To: Jason Wang Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Cindy Lu, Zhang Chen, eperezma Hi Jason, On 2026-01-28 10:55, Jason Wang wrote: > On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > From: Juraj Marcin <jmarcin@redhat.com> > > > > During migration switchover both the source and the destination machines > > are paused (compute downtime). During this period network still routes > > network packets to the source machine, as this is the last place where > > the recipient MAC address has been seen. Once the destination side > > starts and sends network announcement, all subsequent frames are routed > > correctly. However, frames delivered to the source machine are never > > processed and lost. This causes also a network downtime with roughly the > > same duration as compute downtime. > > > > This can cause problems not only for protocols that cannot handle packet > > loss, but can also introduce delays in protocols that can handle them. > > > > To resolve this, this feature instantiates a network filter for each > > network backend present during migration setup on both migration sides. > > On the source side, this filter caches all packets received from the > > backend during switchover. Once the destination machine starts, all > > cached packets are sent through the migration channel and the respective > > filter object on the destination side injects them to the NIC attached > > to the backend. > > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > > --- > > include/migration/vmstate.h | 6 + > > include/net/net.h | 5 + > > migration/meson.build | 1 + > > migration/migration.c | 49 ++++++- > > migration/migration.h | 2 + > > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > > migration/netpass.h | 14 ++ > > migration/options.c | 21 +++ > > migration/options.h | 1 + > > migration/savevm.c | 37 ++++++ > > migration/savevm.h | 2 + > > migration/trace-events | 9 ++ > > net/net.c | 11 ++ > > net/tap.c | 11 +- > > qapi/migration.json | 7 +- > > 15 files changed, 418 insertions(+), 4 deletions(-) > > create mode 100644 migration/netpass.c > > create mode 100644 migration/netpass.h > > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > > index 62d7e9fe38..7987e6c85a 100644 > > --- a/include/migration/vmstate.h > > +++ b/include/migration/vmstate.h > > @@ -200,6 +200,12 @@ typedef enum { > > * save_setup() in VMSD structures. > > */ > > VMS_PHASE_EARLY_SETUP, > > + /* > > + * Specifies a netpass VMSD, these devices are copied right after the > > + * destination is started regardless of precopy/postcopy. Failure in this > > + * phase does not fail the migration in case of precopy. > > + */ > > + VMS_PHASE_NETPASS, > > } VMStateSavePhase; > > > > struct VMStateDescription { > > diff --git a/include/net/net.h b/include/net/net.h > > index 45bc86fc86..510908845b 100644 > > --- a/include/net/net.h > > +++ b/include/net/net.h > > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > > typedef bool (SetSteeringEBPF)(NetClientState *, int); > > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > > > typedef struct NetClientInfo { > > NetClientDriver type; > > @@ -130,6 +131,9 @@ struct NetClientState { > > bool is_netdev; > > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > > bool is_datapath; > > + bool netpass_enabled; > > + NetpassEnabledNotify *netpass_enabled_notify; > > + void *netpass_enabled_notify_opaque; > > QTAILQ_HEAD(, NetFilterState) filters; > > }; > > > > Adding Cindy, Eugenio can Chen. > > I think we can simple reuse the existing filters: > > redirector: which can redirect traffic from the source to the > destination via chardev > buffer: which can hold the packets until the destination is released > > And let the libvirt install/uninstall those filters at the correct time. > > Which means: > > On the source: there would be a redirector that can be enabled when vm > is paused, and it redirect the traffic to a socket/chardev > On the destination: there would be a redirector as well as the buffer, > redirector receives packets from the socket and send it to buffer, > buffer will hold those packets until VM in the destination is resumed. > > The current filters need some tweaks (e.g letting filters (redirector) > work when VM is paused). The advantages of this are: I tested the idea of filters for the forwarding using the existing filters first and it does work with mentioned tweaks, however this requires additional channel between chardevs attached to filters. In my opinion it was better to reuse already existing migration channel, so there are no necessary changes in higher layers. Furthermore, by implementing this directly in QEMU, this feature can be used anywhere, even if libvirt is not used. > > 1) reuse the existing filters > 2) don't need to care about the vhost support on the source as vhost > is disabled, for vDPA we can reuse shadow virtqueue Can you elaborate more on vhost being disabled? IUUC QEMU net filters don't support vhost=on, including the redirector filter. > 3) for the destination we can install a redirector to packet socket to > let vhost works like socket -> redirector -> buffer -> redirector -> > packet socket. > > Thanks > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 13:49 ` Juraj Marcin @ 2026-01-29 1:05 ` Jason Wang 2026-01-29 16:07 ` Zhang Chen 0 siblings, 1 reply; 23+ messages in thread From: Jason Wang @ 2026-01-29 1:05 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Cindy Lu, Zhang Chen, eperezma On Wed, Jan 28, 2026 at 9:49 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > Hi Jason, > > On 2026-01-28 10:55, Jason Wang wrote: > > On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > From: Juraj Marcin <jmarcin@redhat.com> > > > > > > During migration switchover both the source and the destination machines > > > are paused (compute downtime). During this period network still routes > > > network packets to the source machine, as this is the last place where > > > the recipient MAC address has been seen. Once the destination side > > > starts and sends network announcement, all subsequent frames are routed > > > correctly. However, frames delivered to the source machine are never > > > processed and lost. This causes also a network downtime with roughly the > > > same duration as compute downtime. > > > > > > This can cause problems not only for protocols that cannot handle packet > > > loss, but can also introduce delays in protocols that can handle them. > > > > > > To resolve this, this feature instantiates a network filter for each > > > network backend present during migration setup on both migration sides. > > > On the source side, this filter caches all packets received from the > > > backend during switchover. Once the destination machine starts, all > > > cached packets are sent through the migration channel and the respective > > > filter object on the destination side injects them to the NIC attached > > > to the backend. > > > > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > > > --- > > > include/migration/vmstate.h | 6 + > > > include/net/net.h | 5 + > > > migration/meson.build | 1 + > > > migration/migration.c | 49 ++++++- > > > migration/migration.h | 2 + > > > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > > > migration/netpass.h | 14 ++ > > > migration/options.c | 21 +++ > > > migration/options.h | 1 + > > > migration/savevm.c | 37 ++++++ > > > migration/savevm.h | 2 + > > > migration/trace-events | 9 ++ > > > net/net.c | 11 ++ > > > net/tap.c | 11 +- > > > qapi/migration.json | 7 +- > > > 15 files changed, 418 insertions(+), 4 deletions(-) > > > create mode 100644 migration/netpass.c > > > create mode 100644 migration/netpass.h > > > > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > > > index 62d7e9fe38..7987e6c85a 100644 > > > --- a/include/migration/vmstate.h > > > +++ b/include/migration/vmstate.h > > > @@ -200,6 +200,12 @@ typedef enum { > > > * save_setup() in VMSD structures. > > > */ > > > VMS_PHASE_EARLY_SETUP, > > > + /* > > > + * Specifies a netpass VMSD, these devices are copied right after the > > > + * destination is started regardless of precopy/postcopy. Failure in this > > > + * phase does not fail the migration in case of precopy. > > > + */ > > > + VMS_PHASE_NETPASS, > > > } VMStateSavePhase; > > > > > > struct VMStateDescription { > > > diff --git a/include/net/net.h b/include/net/net.h > > > index 45bc86fc86..510908845b 100644 > > > --- a/include/net/net.h > > > +++ b/include/net/net.h > > > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > > > typedef bool (SetSteeringEBPF)(NetClientState *, int); > > > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > > > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > > > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > > > > > typedef struct NetClientInfo { > > > NetClientDriver type; > > > @@ -130,6 +131,9 @@ struct NetClientState { > > > bool is_netdev; > > > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > > > bool is_datapath; > > > + bool netpass_enabled; > > > + NetpassEnabledNotify *netpass_enabled_notify; > > > + void *netpass_enabled_notify_opaque; > > > QTAILQ_HEAD(, NetFilterState) filters; > > > }; > > > > > > > Adding Cindy, Eugenio can Chen. > > > > I think we can simple reuse the existing filters: > > > > redirector: which can redirect traffic from the source to the > > destination via chardev > > buffer: which can hold the packets until the destination is released > > > > And let the libvirt install/uninstall those filters at the correct time. > > > > Which means: > > > > On the source: there would be a redirector that can be enabled when vm > > is paused, and it redirect the traffic to a socket/chardev > > On the destination: there would be a redirector as well as the buffer, > > redirector receives packets from the socket and send it to buffer, > > buffer will hold those packets until VM in the destination is resumed. > > > > The current filters need some tweaks (e.g letting filters (redirector) > > work when VM is paused). The advantages of this are: > > I tested the idea of filters for the forwarding using the existing > filters first and it does work with mentioned tweaks, however this > requires additional channel between chardevs attached to filters. It requires some changes in the redirector. One of the major but trivial changes is to make it work when the VM is paused. > > In my opinion it was better to reuse already existing migration channel, > so there are no necessary changes in higher layers. Furthermore, by > implementing this directly in QEMU, this feature can be used anywhere, > even if libvirt is not used. This requires more thought, leaving the policy to the upper may give us flexibility. > > > > > 1) reuse the existing filters > > 2) don't need to care about the vhost support on the source as vhost > > is disabled, for vDPA we can reuse shadow virtqueue > > Can you elaborate more on vhost being disabled? IUUC QEMU net filters > don't support vhost=on, including the redirector filter. We only need the filter work when vm is paused, in this case vhost is disabled. We can use packet socket to read or inject packet to tap. > > > 3) for the destination we can install a redirector to packet socket to > > let vhost works like socket -> redirector -> buffer -> redirector -> > > packet socket. > > > > Thanks > > > Thanks ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] migration: Pass network packets received during switchover to dest VM 2026-01-29 1:05 ` Jason Wang @ 2026-01-29 16:07 ` Zhang Chen 0 siblings, 0 replies; 23+ messages in thread From: Zhang Chen @ 2026-01-29 16:07 UTC (permalink / raw) To: Jason Wang Cc: Juraj Marcin, qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Vladimir Sementsov-Ogievskiy, Cindy Lu, eperezma On Thu, Jan 29, 2026 at 9:06 AM Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Jan 28, 2026 at 9:49 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > Hi Jason, > > > > On 2026-01-28 10:55, Jason Wang wrote: > > > On Tue, Jan 27, 2026 at 10:04 PM Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > > > From: Juraj Marcin <jmarcin@redhat.com> > > > > > > > > During migration switchover both the source and the destination machines > > > > are paused (compute downtime). During this period network still routes > > > > network packets to the source machine, as this is the last place where > > > > the recipient MAC address has been seen. Once the destination side > > > > starts and sends network announcement, all subsequent frames are routed > > > > correctly. However, frames delivered to the source machine are never > > > > processed and lost. This causes also a network downtime with roughly the > > > > same duration as compute downtime. > > > > > > > > This can cause problems not only for protocols that cannot handle packet > > > > loss, but can also introduce delays in protocols that can handle them. > > > > > > > > To resolve this, this feature instantiates a network filter for each > > > > network backend present during migration setup on both migration sides. > > > > On the source side, this filter caches all packets received from the > > > > backend during switchover. Once the destination machine starts, all > > > > cached packets are sent through the migration channel and the respective > > > > filter object on the destination side injects them to the NIC attached > > > > to the backend. > > > > > > > > Signed-off-by: Juraj Marcin <jmarcin@redhat.com> > > > > --- > > > > include/migration/vmstate.h | 6 + > > > > include/net/net.h | 5 + > > > > migration/meson.build | 1 + > > > > migration/migration.c | 49 ++++++- > > > > migration/migration.h | 2 + > > > > migration/netpass.c | 246 ++++++++++++++++++++++++++++++++++++ > > > > migration/netpass.h | 14 ++ > > > > migration/options.c | 21 +++ > > > > migration/options.h | 1 + > > > > migration/savevm.c | 37 ++++++ > > > > migration/savevm.h | 2 + > > > > migration/trace-events | 9 ++ > > > > net/net.c | 11 ++ > > > > net/tap.c | 11 +- > > > > qapi/migration.json | 7 +- > > > > 15 files changed, 418 insertions(+), 4 deletions(-) > > > > create mode 100644 migration/netpass.c > > > > create mode 100644 migration/netpass.h > > > > > > > > diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h > > > > index 62d7e9fe38..7987e6c85a 100644 > > > > --- a/include/migration/vmstate.h > > > > +++ b/include/migration/vmstate.h > > > > @@ -200,6 +200,12 @@ typedef enum { > > > > * save_setup() in VMSD structures. > > > > */ > > > > VMS_PHASE_EARLY_SETUP, > > > > + /* > > > > + * Specifies a netpass VMSD, these devices are copied right after the > > > > + * destination is started regardless of precopy/postcopy. Failure in this > > > > + * phase does not fail the migration in case of precopy. > > > > + */ > > > > + VMS_PHASE_NETPASS, > > > > } VMStateSavePhase; > > > > > > > > struct VMStateDescription { > > > > diff --git a/include/net/net.h b/include/net/net.h > > > > index 45bc86fc86..510908845b 100644 > > > > --- a/include/net/net.h > > > > +++ b/include/net/net.h > > > > @@ -82,6 +82,7 @@ typedef void (NetAnnounce)(NetClientState *); > > > > typedef bool (SetSteeringEBPF)(NetClientState *, int); > > > > typedef bool (NetCheckPeerType)(NetClientState *, ObjectClass *, Error **); > > > > typedef struct vhost_net *(GetVHostNet)(NetClientState *nc); > > > > +typedef void (NetpassEnabledNotify)(NetClientState *nc, void *opaque); > > > > > > > > typedef struct NetClientInfo { > > > > NetClientDriver type; > > > > @@ -130,6 +131,9 @@ struct NetClientState { > > > > bool is_netdev; > > > > bool do_not_pad; /* do not pad to the minimum ethernet frame length */ > > > > bool is_datapath; > > > > + bool netpass_enabled; > > > > + NetpassEnabledNotify *netpass_enabled_notify; > > > > + void *netpass_enabled_notify_opaque; > > > > QTAILQ_HEAD(, NetFilterState) filters; > > > > }; > > > > > > > > > > Adding Cindy, Eugenio can Chen. > > > > > > I think we can simple reuse the existing filters: > > > > > > redirector: which can redirect traffic from the source to the > > > destination via chardev > > > buffer: which can hold the packets until the destination is released > > > > > > And let the libvirt install/uninstall those filters at the correct time. > > > > > > Which means: > > > > > > On the source: there would be a redirector that can be enabled when vm > > > is paused, and it redirect the traffic to a socket/chardev > > > On the destination: there would be a redirector as well as the buffer, > > > redirector receives packets from the socket and send it to buffer, > > > buffer will hold those packets until VM in the destination is resumed. > > > > > > The current filters need some tweaks (e.g letting filters (redirector) > > > work when VM is paused). The advantages of this are: > > > > I tested the idea of filters for the forwarding using the existing > > filters first and it does work with mentioned tweaks, however this > > requires additional channel between chardevs attached to filters. > > It requires some changes in the redirector. One of the major but > trivial changes is to make it work when the VM is paused. Actually, current netfilter doesn't depend on the running or paused state of the virtual machine. It just handle packet in qemu side. > > > > > In my opinion it was better to reuse already existing migration channel, > > so there are no necessary changes in higher layers. Furthermore, by > > implementing this directly in QEMU, this feature can be used anywhere, > > even if libvirt is not used. > > This requires more thought, leaving the policy to the upper may give > us flexibility. Agree, the use cases for this series is very similar to COLO project. https://wiki.qemu.org/Features/COLO We handled network related issue by introduce the qemu network filters. It's even possible to transparently modify TCP packet headers without the virtual machine being aware of it. https://github.com/qemu/qemu/blob/master/docs/colo-proxy.txt > > > > > > > > > 1) reuse the existing filters > > > 2) don't need to care about the vhost support on the source as vhost > > > is disabled, for vDPA we can reuse shadow virtqueue > > > > Can you elaborate more on vhost being disabled? IUUC QEMU net filters > > don't support vhost=on, including the redirector filter. > > We only need the filter work when vm is paused, in this case vhost is > disabled. We can use packet socket to read or inject packet to tap. Agree, COLO proxy use the filter-redirctor inject network packet to dest VM. By the way, COLO is a HA/FT project in QEMU, based on our previous practical experience, the VM stop time will handled by netfilter and TCP/IP retransmission mechanism, for the stateless protocol like UDP, the application will handle it by default. The only difference with COLO is the normal live migration need filter-buffer in dest side. And the filter-buffer and VM paused time can also cause problems with network transmission. Another issue is when VM paused time became longer, filter-redirector still running, but the source side QEMU read packet from tap device path need to be double checked. You can know how the multiple network filters co-work in the COLO-proxy docs. Thanks Chen > > > > > > 3) for the destination we can install a redirector to packet socket to > > > let vhost works like socket -> redirector -> buffer -> redirector -> > > > packet socket. > > > > > > Thanks > > > > > > > Thanks > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin ` (3 preceding siblings ...) 2026-01-27 14:03 ` [PATCH 4/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin @ 2026-01-27 18:21 ` Stefano Brivio 2026-01-28 13:06 ` Juraj Marcin 2026-02-03 12:03 ` Laurent Vivier 4 siblings, 2 replies; 23+ messages in thread From: Stefano Brivio @ 2026-01-27 18:21 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson [Cc'ing Laurent and David] On Tue, 27 Jan 2026 15:03:06 +0100 Juraj Marcin <jmarcin@redhat.com> wrote: > During switchover there is a period during which both source and > destination side VMs are paused. During this period, all network packets > are still routed to the source side, but it will never process them. > Once the destination resumes, it is not aware of these packets and they > are lost. This can cause packet loss in unreliable protocols and > extended delays due to retransmission in reliable protocols. > > This series resolves this problem by caching packets received once the > source VM pauses and then passing and injecting them on the destination > side. This feature is implemented in the last patch. The caching and > injecting is implemented using network filter interface and should work > with any backend with vhost=off, but only TAP network backend was > explicitly tested. I haven't had a chance to try this change with passt(1) yet (the backend can be enabled using "-net passt" or by starting it separately). Given that passt implements migration on its own (in deeper detail in some sense, as TCP connections are preserved if IP addresses match), I wonder if it this might affect or break it somehow. Did you perhaps have some thoughts about that already? For context, we didn't really write comprehensive documentation about it yet, but: - KubeVirt's enhancement repository has a detailed description at: https://github.com/kubevirt/enhancements/blob/main/veps/sig-network/passt/passt-migration-proposal.md#live-migration-with-passt - the QEMU-facing details are outlined in: https://archives.passt.top/passt-dev/20241219111400.2352110-1-lvivier@redhat.com/ - usage of TCP_REPAIR is briefly described in passt-repair(1) -- Stefano ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 18:21 ` [PATCH 0/4] " Stefano Brivio @ 2026-01-28 13:06 ` Juraj Marcin 2026-01-28 17:27 ` Stefano Brivio 2026-02-03 12:03 ` Laurent Vivier 1 sibling, 1 reply; 23+ messages in thread From: Juraj Marcin @ 2026-01-28 13:06 UTC (permalink / raw) To: Stefano Brivio Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson Hi Stefano, On 2026-01-27 19:21, Stefano Brivio wrote: > [Cc'ing Laurent and David] > > On Tue, 27 Jan 2026 15:03:06 +0100 > Juraj Marcin <jmarcin@redhat.com> wrote: > > > During switchover there is a period during which both source and > > destination side VMs are paused. During this period, all network packets > > are still routed to the source side, but it will never process them. > > Once the destination resumes, it is not aware of these packets and they > > are lost. This can cause packet loss in unreliable protocols and > > extended delays due to retransmission in reliable protocols. > > > > This series resolves this problem by caching packets received once the > > source VM pauses and then passing and injecting them on the destination > > side. This feature is implemented in the last patch. The caching and > > injecting is implemented using network filter interface and should work > > with any backend with vhost=off, but only TAP network backend was > > explicitly tested. > > I haven't had a chance to try this change with passt(1) yet (the > backend can be enabled using "-net passt" or by starting it > separately). > > Given that passt implements migration on its own (in deeper detail in > some sense, as TCP connections are preserved if IP addresses match), I > wonder if it this might affect or break it somehow. > > Did you perhaps have some thoughts about that already? I'm aware of passt migrating its state and passt-repair, but I also haven't tested it as I couldn't get passt-repair to work. Does it also handle other protocols, or just preserves TCP connections? The main focus of this feature are protocols that cannot handle packet loss on their own in environments where IP address is preserved (and thus also TCP connections). So, mainly tap/bridge, with the idea that other network backends could also benefit from it. However, if it causes problems with other backends, I could limit it just to tap. > > For context, we didn't really write comprehensive documentation about > it yet, but: > > - KubeVirt's enhancement repository has a detailed description at: > https://github.com/kubevirt/enhancements/blob/main/veps/sig-network/passt/passt-migration-proposal.md#live-migration-with-passt > > - the QEMU-facing details are outlined in: > https://archives.passt.top/passt-dev/20241219111400.2352110-1-lvivier@redhat.com/ > > - usage of TCP_REPAIR is briefly described in passt-repair(1) > > -- > Stefano > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 13:06 ` Juraj Marcin @ 2026-01-28 17:27 ` Stefano Brivio 2026-01-30 14:40 ` Juraj Marcin 0 siblings, 1 reply; 23+ messages in thread From: Stefano Brivio @ 2026-01-28 17:27 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson On Wed, 28 Jan 2026 14:06:11 +0100 Juraj Marcin <jmarcin@redhat.com> wrote: > Hi Stefano, > > On 2026-01-27 19:21, Stefano Brivio wrote: > > [Cc'ing Laurent and David] > > > > On Tue, 27 Jan 2026 15:03:06 +0100 > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > During switchover there is a period during which both source and > > > destination side VMs are paused. During this period, all network packets > > > are still routed to the source side, but it will never process them. > > > Once the destination resumes, it is not aware of these packets and they > > > are lost. This can cause packet loss in unreliable protocols and > > > extended delays due to retransmission in reliable protocols. > > > > > > This series resolves this problem by caching packets received once the > > > source VM pauses and then passing and injecting them on the destination > > > side. This feature is implemented in the last patch. The caching and > > > injecting is implemented using network filter interface and should work > > > with any backend with vhost=off, but only TAP network backend was > > > explicitly tested. > > > > I haven't had a chance to try this change with passt(1) yet (the > > backend can be enabled using "-net passt" or by starting it > > separately). > > > > Given that passt implements migration on its own (in deeper detail in > > some sense, as TCP connections are preserved if IP addresses match), I > > wonder if it this might affect or break it somehow. > > > > Did you perhaps have some thoughts about that already? > > I'm aware of passt migrating its state and passt-repair, but I also > haven't tested it as I couldn't get passt-repair to work. Oops. Let me know if you're hitting any specific error I could look into. I plan anyway to try out your changes but I might need a couple of days before I find the time. > Does it also handle other protocols, or just preserves TCP connections? Layer-4-wise, we have an internal representation of UDP "flows" (observed flows of packets for which we preserve the same source port mapping, with timeouts) and we had a vague idea of migrating those as well, but it's debatable where there's any benefit from it. At Layer 2 and 3, we migrate IP and MAC addresses we observed from the guest: https://passt.top/passt/tree/migrate.c?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n31 so that we have ARP and NDP resolution, as well as any NAT mapping working right away as needed. For completeness, this is the TCP context we migrate instead: https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n108 https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n154 > The main focus of this feature are protocols that cannot handle packet > loss on their own in environments where IP address is preserved (and > thus also TCP connections). Well, strictly speaking, TCP handles packet loss, that's actually the main reason behind it. I guess this is to improve throughput and avoid latency spikes or retransmissions that could be avoided? > So, mainly tap/bridge, with the idea that > other network backends could also benefit from it. However, if it causes > problems with other backends, I could limit it just to tap. I couldn't quite figure out yet if it's beneficial, useless, or harmless for passt. With passt, what happens without your implementation is: 1. guest pauses 2. the source instance of passt starts migrating, meaning that sockets are frozen one by one, their receiving and sending queues dumped 3. pending queues are sent to the target instance of passt, which opens sockets as refills queues as needed 4. target guest resumes and will get any traffic that was received by the source instance of passt between 1. and 2. Right now there's still a Linux kernel issue we observed (see also https://pad.passt.top/p/TcpRepairTodo, that's line 4 there) which might cause segments to be received (and acknowledged!) on sockets of the source instance of passt for a small time period *after* we freeze them with TCP_REPAIR (that is, TCP_REPAIR doesn't really freeze the queue). I'm currently working on a proper fix for that. Until then, point 2. above isn't entirely accurate (but it only happens if you hammer it with traffic generators, it's not really visible otherwise). With your implementation, I guess: 1. guest pauses 2. the source instance of passt starts migrating, meaning that sockets are frozen one by one, their receiving and sending queues dumped 2a. any data received by QEMU after 1. will be stored and forwarded to the target later. But passt at this point prevents the guest from getting any data, so there should be no data involved 3. pending queues are sent to the target instance of passt, which opens sockets as refills queues as needed 3a. the target guest gets the data from 2a. As long as there's no data (as I'm assuming), there should be no change. If there's data coming in at this point, we risk that sequences don't match anymore? I'm not sure 4. target guest resumes and will *also* get any traffic that was received by the source instance of passt between 1. and 2. So if my assumption from 2a. above holds, it should be useless, but harmless. Would your implementation help with the kernel glitch we're currently observing? I don't think so, because your implementation would only play a role between passt and QEMU, and we don't have issues there. Well, it would be good to try things out. Other than that, unless I'm missing something, your implementation should probably be skipped for passt for simplicity, and also to avoid negatively affecting downtime. Note that you can also use passt without "-net passt" (that's actually quite recent) but with a tap back-end. Migration is only supported with vhost-user enabled though, and as far as I understand your implementation is disabled in that case? -- Stefano ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-28 17:27 ` Stefano Brivio @ 2026-01-30 14:40 ` Juraj Marcin 2026-01-31 2:27 ` Stefano Brivio 0 siblings, 1 reply; 23+ messages in thread From: Juraj Marcin @ 2026-01-30 14:40 UTC (permalink / raw) To: Stefano Brivio Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson Hi Stefano, thanks for the answer! On 2026-01-28 18:27, Stefano Brivio wrote: > On Wed, 28 Jan 2026 14:06:11 +0100 > Juraj Marcin <jmarcin@redhat.com> wrote: > > > Hi Stefano, > > > > On 2026-01-27 19:21, Stefano Brivio wrote: > > > [Cc'ing Laurent and David] > > > > > > On Tue, 27 Jan 2026 15:03:06 +0100 > > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > > During switchover there is a period during which both source and > > > > destination side VMs are paused. During this period, all network packets > > > > are still routed to the source side, but it will never process them. > > > > Once the destination resumes, it is not aware of these packets and they > > > > are lost. This can cause packet loss in unreliable protocols and > > > > extended delays due to retransmission in reliable protocols. > > > > > > > > This series resolves this problem by caching packets received once the > > > > source VM pauses and then passing and injecting them on the destination > > > > side. This feature is implemented in the last patch. The caching and > > > > injecting is implemented using network filter interface and should work > > > > with any backend with vhost=off, but only TAP network backend was > > > > explicitly tested. > > > > > > I haven't had a chance to try this change with passt(1) yet (the > > > backend can be enabled using "-net passt" or by starting it > > > separately). > > > > > > Given that passt implements migration on its own (in deeper detail in > > > some sense, as TCP connections are preserved if IP addresses match), I > > > wonder if it this might affect or break it somehow. > > > > > > Did you perhaps have some thoughts about that already? > > > > I'm aware of passt migrating its state and passt-repair, but I also > > haven't tested it as I couldn't get passt-repair to work. > > Oops. Let me know if you're hitting any specific error I could look > into. I tried it using this documentation [1] I found earlier, however, it wouldn't work when migrating on the same host as I expected from it. The destination passt process fails to get the port the outside TCP server is communicating with and I see the connection still as established with the source passt process. This is the specific error message from the destination passt process: Flow 0 (TCP connection): Failed to connect migrated socket: Cannot assign requested address [1]: https://www.qemu.org/docs/master/system/devices/net.html#example-of-migration-of-a-guest-on-the-same-host > > I plan anyway to try out your changes but I might need a couple of days > before I find the time. > > > Does it also handle other protocols, or just preserves TCP connections? > > Layer-4-wise, we have an internal representation of UDP "flows" > (observed flows of packets for which we preserve the same source port > mapping, with timeouts) and we had a vague idea of migrating those as > well, but it's debatable where there's any benefit from it. > > At Layer 2 and 3, we migrate IP and MAC addresses we observed from the > guest: > > https://passt.top/passt/tree/migrate.c?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n31 > > so that we have ARP and NDP resolution, as well as any NAT > mapping working right away as needed. > > For completeness, this is the TCP context we migrate instead: > > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n108 > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n154 > > > The main focus of this feature are protocols that cannot handle packet > > loss on their own in environments where IP address is preserved (and > > thus also TCP connections). > > Well, strictly speaking, TCP handles packet loss, that's actually the > main reason behind it. I guess this is to improve throughput and avoid > latency spikes or retransmissions that could be avoided? Sorry, I actually meant that all connections are preserved. The main goal is to prevent losses with protocols other than TCP when possible, which was requested by our Solution Architects. Possible improved TCP throughput due to avoided retransmissions is just a side effect of that. > > > So, mainly tap/bridge, with the idea that > > other network backends could also benefit from it. However, if it causes > > problems with other backends, I could limit it just to tap. > > I couldn't quite figure out yet if it's beneficial, useless, or > harmless for passt. With passt, what happens without your > implementation is: > > 1. guest pauses > > 2. the source instance of passt starts migrating, meaning that sockets > are frozen one by one, their receiving and sending queues dumped > > 3. pending queues are sent to the target instance of passt, which opens > sockets as refills queues as needed > > 4. target guest resumes and will get any traffic that was received by > the source instance of passt between 1. and 2. > > Right now there's still a Linux kernel issue we observed (see also > https://pad.passt.top/p/TcpRepairTodo, that's line 4 there) which might > cause segments to be received (and acknowledged!) on sockets of the > source instance of passt for a small time period *after* we freeze them > with TCP_REPAIR (that is, TCP_REPAIR doesn't really freeze the queue). > > I'm currently working on a proper fix for that. Until then, point 2. > above isn't entirely accurate (but it only happens if you hammer it > with traffic generators, it's not really visible otherwise). > > With your implementation, I guess: > > 1. guest pauses > > 2. the source instance of passt starts migrating, meaning that sockets > are frozen one by one, their receiving and sending queues dumped > > 2a. any data received by QEMU after 1. will be stored and forwarded to > the target later. But passt at this point prevents the guest from > getting any data, so there should be no data involved > > 3. pending queues are sent to the target instance of passt, which opens > sockets as refills queues as needed > > 3a. the target guest gets the data from 2a. As long as there's no data > (as I'm assuming), there should be no change. If there's data coming > in at this point, we risk that sequences don't match anymore? I'm not > sure > > 4. target guest resumes and will *also* get any traffic that was received > by the source instance of passt between 1. and 2. > > So if my assumption from 2a. above holds, it should be useless, but > harmless. > > Would your implementation help with the kernel glitch we're currently > observing? I don't think so, because your implementation would only play > a role between passt and QEMU, and we don't have issues there. > > Well, it would be good to try things out. Other than that, unless I'm > missing something, your implementation should probably be skipped for > passt for simplicity, and also to avoid negatively affecting downtime. I agree with skipping passt in such case, although, I haven't perceived any effect on downtime. Cached network packets are sent after the destination resumes, so that the network knows about new location of the VM and the source shouldn't receive any more packets intended for it. > > Note that you can also use passt without "-net passt" (that's actually > quite recent) but with a tap back-end. Migration is only supported with > vhost-user enabled though, and as far as I understand your implementation > is disabled in that case? As of now it is disabled in that case as network filters don't support vhost. > > -- > Stefano -- Juraj Marcin ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-30 14:40 ` Juraj Marcin @ 2026-01-31 2:27 ` Stefano Brivio 2026-02-04 11:23 ` Juraj Marcin 0 siblings, 1 reply; 23+ messages in thread From: Stefano Brivio @ 2026-01-31 2:27 UTC (permalink / raw) To: Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson On Fri, 30 Jan 2026 15:40:01 +0100 Juraj Marcin <jmarcin@redhat.com> wrote: > Hi Stefano, > > thanks for the answer! > > On 2026-01-28 18:27, Stefano Brivio wrote: > > On Wed, 28 Jan 2026 14:06:11 +0100 > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > Hi Stefano, > > > > > > On 2026-01-27 19:21, Stefano Brivio wrote: > > > > [Cc'ing Laurent and David] > > > > > > > > On Tue, 27 Jan 2026 15:03:06 +0100 > > > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > > > > During switchover there is a period during which both source and > > > > > destination side VMs are paused. During this period, all network packets > > > > > are still routed to the source side, but it will never process them. > > > > > Once the destination resumes, it is not aware of these packets and they > > > > > are lost. This can cause packet loss in unreliable protocols and > > > > > extended delays due to retransmission in reliable protocols. > > > > > > > > > > This series resolves this problem by caching packets received once the > > > > > source VM pauses and then passing and injecting them on the destination > > > > > side. This feature is implemented in the last patch. The caching and > > > > > injecting is implemented using network filter interface and should work > > > > > with any backend with vhost=off, but only TAP network backend was > > > > > explicitly tested. > > > > > > > > I haven't had a chance to try this change with passt(1) yet (the > > > > backend can be enabled using "-net passt" or by starting it > > > > separately). > > > > > > > > Given that passt implements migration on its own (in deeper detail in > > > > some sense, as TCP connections are preserved if IP addresses match), I > > > > wonder if it this might affect or break it somehow. > > > > > > > > Did you perhaps have some thoughts about that already? > > > > > > I'm aware of passt migrating its state and passt-repair, but I also > > > haven't tested it as I couldn't get passt-repair to work. > > > > Oops. Let me know if you're hitting any specific error I could look > > into. > > I tried it using this documentation [1] I found earlier, however, it > wouldn't work when migrating on the same host as I expected from it. The > destination passt process fails to get the port the outside TCP server > is communicating with and I see the connection still as established with > the source passt process. This is the specific error message from the > destination passt process: > > Flow 0 (TCP connection): Failed to connect migrated socket: Cannot assign requested address > > [1]: https://www.qemu.org/docs/master/system/devices/net.html#example-of-migration-of-a-guest-on-the-same-host Ouch, I see. Laurent wrote this part of documentation showing the QEMU-related bits of the migration workflow, but we should have updated it with an example with real TCP flows, because in that case you can't have the two instances of QEMU and passt running in the same namespace of the same machine: ports and addresses will conflict. There are two alternatives to test migration of actual flows. 1. two namespaces, same machine, with one instance of passt and one instance of QEMU in each. Testing a connection from guest to host can be done with a simple client/server pair, whereas, the other way around, you need some form of proxying (see the 'bidirectional' example below). This is what we do in passt's upstream tests (for a sample run, see https://passt.top/#continuous-integration, skip to 'migrate/basic' using the links on the bottom). The setup function is here: https://passt.top/passt/tree/test/lib/setup?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n308 and these are the test directives themselves: https://passt.top/passt/tree/test/migrate/basic https://passt.top/passt/tree/test/migrate/bidirectional ...if you want to try and run these tests, see https://passt.top/passt/tree/test/README.md. The test suite has quite a few dependencies and it might take a bit of effort to run the whole thing, but if you 'make assets' under test/ and then select one single test instead, with './run migrate/basic', it should be practical. I can write up "stand-alone" instructions based on that if needed. 2. two virtual machines, bridged (no need for root if you detach a network namespace on the host), migrating nested guests. Assuming three terminals (host, source, target), and a libvirt domain named "alpine" *inside* L1 guests (no need for libvirt, it just makes the write-up a bit more terse): --- [host] $ unshare -rUn # echo $$ # let's call this TARGET_PID # ip link set dev lo up [source] $ nsenter --preserve-credentials -U -n -t $TARGET_PID # qemu-system-x86_64 -machine accel=kvm -cpu host ... -nographic -serial mon:stdio -nodefaults -m 4G -netdev tap,id=n,script=no -device virtio-net,netdev=n ...in the guest, once it starts: # service NetworkManager stop # if you have it # ip link set dev eth0 up # ip addr add dev eth0 10.0.0.1/24 # ip route add default dev eth0 # ip link set dev eth0 addr 52:54:00:12:34:57 [target] $ nsenter --preserve-credentials -U -n -t $TARGET_PID # qemu-system-x86_64 -machine accel=kvm -cpu host ... -nographic -serial mon:stdio -nodefaults -m 4G -netdev tap,id=n,script=no -device virtio-net,netdev=n ...in the guest, once it starts: # service NetworkManager stop # if you have it # ip link set dev eth0 up # ip addr add dev eth0 10.0.0.2/24 # ip route add default dev eth0 [host] # ip link set dev tap0 up # ip link set dev tap1 up # ip link add dev br0 type bridge # ip link set dev tap0 master br0 # ip link set dev tap1 master br0 # ip addr add dev br0 10.0.0.3/24 check that we can reach the target # ping 10.0.0.2 start the test server # ip addr add dev br0 172.16.0.3/24 # nc -l -p 8080 [*both* source and target] # ip addr add dev eth0 172.16.0.1/25 # make sure passt picks this address for the guests, as it's more specific than a 10.0.0.0/24, it's /25 # ip route add default via 172.16.0.100 # and add a default route just so that the guest has one, but we don't need this [source (use another terminal, or run passt-repair in background)] # mkdir /run/user/1001/libvirt/qemu/run/passt/ # passt-repair /run/user/1001/libvirt/qemu/run/passt/ [source] $ virsh start --console alpine ...in the guest, once it starts: $ nc 172.16.0.3 8080 start typing, before migration [source (use another terminal, or escape console while keeping nc running)] $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://10.0.0.2/session --- if you reverse the direction of the connection, you'll need two bridges, one for migration data and one for the test connection itself. Otherwise, the kernel on the source L1 guest will manage to send a RST to your client (on L0) as soon as the connection continues on the target, because ACK segments from the target will reach the source (they're bridged), but the source has no open socket at this point. With two bridges, you can "unplug" the source target (test connection / tap interface only) before migrating. As an alternative, you could drop RST segments using nftables. I can clean up my notes for these additional steps if anybody is interested. Eventually, I guess, it should all become part of QEMU's documentation. > > I plan anyway to try out your changes but I might need a couple of days > > before I find the time. > > > > > Does it also handle other protocols, or just preserves TCP connections? > > > > Layer-4-wise, we have an internal representation of UDP "flows" > > (observed flows of packets for which we preserve the same source port > > mapping, with timeouts) and we had a vague idea of migrating those as > > well, but it's debatable where there's any benefit from it. > > > > At Layer 2 and 3, we migrate IP and MAC addresses we observed from the > > guest: > > > > https://passt.top/passt/tree/migrate.c?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n31 > > > > so that we have ARP and NDP resolution, as well as any NAT > > mapping working right away as needed. > > > > For completeness, this is the TCP context we migrate instead: > > > > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n108 > > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n154 > > > > > The main focus of this feature are protocols that cannot handle packet > > > loss on their own in environments where IP address is preserved (and > > > thus also TCP connections). > > > > Well, strictly speaking, TCP handles packet loss, that's actually the > > main reason behind it. I guess this is to improve throughput and avoid > > latency spikes or retransmissions that could be avoided? > > Sorry, I actually meant that all connections are preserved. The main > goal is to prevent losses with protocols other than TCP when possible, > which was requested by our Solution Architects. Possible improved TCP > throughput due to avoided retransmissions is just a side effect of that. Interesting... you mean UDP? Or non-IP protocols? For some typical UDP applications (realtime audio/video streams) I generally expect delayed datagrams (more of them) to be worse than some lost datagrams (fewer of them). This is part of the reason why I didn't particularly care about that in passt. Well, as long as there's a way to disable this mechanism, one could tune the configuration to their needs. > > > So, mainly tap/bridge, with the idea that > > > other network backends could also benefit from it. However, if it causes > > > problems with other backends, I could limit it just to tap. > > > > I couldn't quite figure out yet if it's beneficial, useless, or > > harmless for passt. With passt, what happens without your > > implementation is: > > > > 1. guest pauses > > > > 2. the source instance of passt starts migrating, meaning that sockets > > are frozen one by one, their receiving and sending queues dumped > > > > 3. pending queues are sent to the target instance of passt, which opens > > sockets as refills queues as needed > > > > 4. target guest resumes and will get any traffic that was received by > > the source instance of passt between 1. and 2. > > > > Right now there's still a Linux kernel issue we observed (see also > > https://pad.passt.top/p/TcpRepairTodo, that's line 4 there) which might > > cause segments to be received (and acknowledged!) on sockets of the > > source instance of passt for a small time period *after* we freeze them > > with TCP_REPAIR (that is, TCP_REPAIR doesn't really freeze the queue). > > > > I'm currently working on a proper fix for that. Until then, point 2. > > above isn't entirely accurate (but it only happens if you hammer it > > with traffic generators, it's not really visible otherwise). > > > > With your implementation, I guess: > > > > 1. guest pauses > > > > 2. the source instance of passt starts migrating, meaning that sockets > > are frozen one by one, their receiving and sending queues dumped > > > > 2a. any data received by QEMU after 1. will be stored and forwarded to > > the target later. But passt at this point prevents the guest from > > getting any data, so there should be no data involved > > > > 3. pending queues are sent to the target instance of passt, which opens > > sockets as refills queues as needed > > > > 3a. the target guest gets the data from 2a. As long as there's no data > > (as I'm assuming), there should be no change. If there's data coming > > in at this point, we risk that sequences don't match anymore? I'm not > > sure > > > > 4. target guest resumes and will *also* get any traffic that was received > > by the source instance of passt between 1. and 2. > > > > So if my assumption from 2a. above holds, it should be useless, but > > harmless. > > > > Would your implementation help with the kernel glitch we're currently > > observing? I don't think so, because your implementation would only play > > a role between passt and QEMU, and we don't have issues there. > > > > Well, it would be good to try things out. Other than that, unless I'm > > missing something, your implementation should probably be skipped for > > passt for simplicity, and also to avoid negatively affecting downtime. > > I agree with skipping passt in such case, although, I haven't perceived > any effect on downtime. Cached network packets are sent after the > destination resumes, so that the network knows about new location of the > VM and the source shouldn't receive any more packets intended for it. > > > Note that you can also use passt without "-net passt" (that's actually > > quite recent) but with a tap back-end. Migration is only supported with > > vhost-user enabled though, and as far as I understand your implementation > > is disabled in that case? > > As of now it is disabled in that case as network filters don't support > vhost. Is that something you plan to fix / change in the future, though? In that case, I would try to check how this works with passt in a bit more detail (now or later). -- Stefano ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-31 2:27 ` Stefano Brivio @ 2026-02-04 11:23 ` Juraj Marcin 0 siblings, 0 replies; 23+ messages in thread From: Juraj Marcin @ 2026-02-04 11:23 UTC (permalink / raw) To: Stefano Brivio Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson Hi Stefano, On 2026-01-31 03:27, Stefano Brivio wrote: > On Fri, 30 Jan 2026 15:40:01 +0100 > Juraj Marcin <jmarcin@redhat.com> wrote: > > > Hi Stefano, > > > > thanks for the answer! > > > > On 2026-01-28 18:27, Stefano Brivio wrote: > > > On Wed, 28 Jan 2026 14:06:11 +0100 > > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > > Hi Stefano, > > > > > > > > On 2026-01-27 19:21, Stefano Brivio wrote: > > > > > [Cc'ing Laurent and David] > > > > > > > > > > On Tue, 27 Jan 2026 15:03:06 +0100 > > > > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > > > > > > > > > During switchover there is a period during which both source and > > > > > > destination side VMs are paused. During this period, all network packets > > > > > > are still routed to the source side, but it will never process them. > > > > > > Once the destination resumes, it is not aware of these packets and they > > > > > > are lost. This can cause packet loss in unreliable protocols and > > > > > > extended delays due to retransmission in reliable protocols. > > > > > > > > > > > > This series resolves this problem by caching packets received once the > > > > > > source VM pauses and then passing and injecting them on the destination > > > > > > side. This feature is implemented in the last patch. The caching and > > > > > > injecting is implemented using network filter interface and should work > > > > > > with any backend with vhost=off, but only TAP network backend was > > > > > > explicitly tested. > > > > > > > > > > I haven't had a chance to try this change with passt(1) yet (the > > > > > backend can be enabled using "-net passt" or by starting it > > > > > separately). > > > > > > > > > > Given that passt implements migration on its own (in deeper detail in > > > > > some sense, as TCP connections are preserved if IP addresses match), I > > > > > wonder if it this might affect or break it somehow. > > > > > > > > > > Did you perhaps have some thoughts about that already? > > > > > > > > I'm aware of passt migrating its state and passt-repair, but I also > > > > haven't tested it as I couldn't get passt-repair to work. > > > > > > Oops. Let me know if you're hitting any specific error I could look > > > into. > > > > I tried it using this documentation [1] I found earlier, however, it > > wouldn't work when migrating on the same host as I expected from it. The > > destination passt process fails to get the port the outside TCP server > > is communicating with and I see the connection still as established with > > the source passt process. This is the specific error message from the > > destination passt process: > > > > Flow 0 (TCP connection): Failed to connect migrated socket: Cannot assign requested address > > > > [1]: https://www.qemu.org/docs/master/system/devices/net.html#example-of-migration-of-a-guest-on-the-same-host > > Ouch, I see. > > Laurent wrote this part of documentation showing the QEMU-related bits > of the migration workflow, but we should have updated it with an > example with real TCP flows, because in that case you can't have the > two instances of QEMU and passt running in the same namespace of the > same machine: ports and addresses will conflict. > > There are two alternatives to test migration of actual flows. > > 1. two namespaces, same machine, with one instance of passt and one > instance of QEMU in each. > > Testing a connection from guest to host can be done with a simple > client/server pair, whereas, the other way around, you need some > form of proxying (see the 'bidirectional' example below). > > This is what we do in passt's upstream tests (for a sample run, see > https://passt.top/#continuous-integration, skip to 'migrate/basic' > using the links on the bottom). The setup function is here: > > https://passt.top/passt/tree/test/lib/setup?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n308 > > and these are the test directives themselves: > > https://passt.top/passt/tree/test/migrate/basic > https://passt.top/passt/tree/test/migrate/bidirectional > > ...if you want to try and run these tests, see > https://passt.top/passt/tree/test/README.md. The test suite has quite > a few dependencies and it might take a bit of effort to run the > whole thing, but if you 'make assets' under test/ and then select > one single test instead, with './run migrate/basic', it should be > practical. > > I can write up "stand-alone" instructions based on that if needed. > > 2. two virtual machines, bridged (no need for root if you detach a > network namespace on the host), migrating nested guests. Assuming > three terminals (host, source, target), and a libvirt domain named > "alpine" *inside* L1 guests (no need for libvirt, it just makes the > write-up a bit more terse): > > --- > [host] > $ unshare -rUn > # echo $$ # let's call this TARGET_PID > # ip link set dev lo up > > > [source] > $ nsenter --preserve-credentials -U -n -t $TARGET_PID > # qemu-system-x86_64 -machine accel=kvm -cpu host ... -nographic -serial mon:stdio -nodefaults -m 4G -netdev tap,id=n,script=no -device virtio-net,netdev=n > > ...in the guest, once it starts: > > # service NetworkManager stop # if you have it > # ip link set dev eth0 up > # ip addr add dev eth0 10.0.0.1/24 > # ip route add default dev eth0 > # ip link set dev eth0 addr 52:54:00:12:34:57 > > > [target] > $ nsenter --preserve-credentials -U -n -t $TARGET_PID > # qemu-system-x86_64 -machine accel=kvm -cpu host ... -nographic -serial mon:stdio -nodefaults -m 4G -netdev tap,id=n,script=no -device virtio-net,netdev=n > > ...in the guest, once it starts: > > # service NetworkManager stop # if you have it > # ip link set dev eth0 up > # ip addr add dev eth0 10.0.0.2/24 > # ip route add default dev eth0 > > > [host] > # ip link set dev tap0 up > # ip link set dev tap1 up > # ip link add dev br0 type bridge > # ip link set dev tap0 master br0 > # ip link set dev tap1 master br0 > # ip addr add dev br0 10.0.0.3/24 > > check that we can reach the target > # ping 10.0.0.2 > > start the test server > # ip addr add dev br0 172.16.0.3/24 > # nc -l -p 8080 > > > [*both* source and target] > # ip addr add dev eth0 172.16.0.1/25 # make sure passt picks this address for the guests, as it's more specific than a 10.0.0.0/24, it's /25 > # ip route add default via 172.16.0.100 # and add a default route just so that the guest has one, but we don't need this > > > [source (use another terminal, or run passt-repair in background)] > # mkdir /run/user/1001/libvirt/qemu/run/passt/ > # passt-repair /run/user/1001/libvirt/qemu/run/passt/ > > > [source] > $ virsh start --console alpine > > ...in the guest, once it starts: > > $ nc 172.16.0.3 8080 > start typing, before migration > > > [source (use another terminal, or escape console while keeping nc running)] > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://10.0.0.2/session > --- > > if you reverse the direction of the connection, you'll need two > bridges, one for migration data and one for the test connection > itself. > > Otherwise, the kernel on the source L1 guest will manage to send a > RST to your client (on L0) as soon as the connection continues on > the target, because ACK segments from the target will reach the > source (they're bridged), but the source has no open socket at this > point. > > With two bridges, you can "unplug" the source target (test > connection / tap interface only) before migrating. As an alternative, > you could drop RST segments using nftables. > > I can clean up my notes for these additional steps if anybody is > interested. Eventually, I guess, it should all become part of QEMU's > documentation. > thank you very much, I will try that. > > > I plan anyway to try out your changes but I might need a couple of days > > > before I find the time. > > > > > > > Does it also handle other protocols, or just preserves TCP connections? > > > > > > Layer-4-wise, we have an internal representation of UDP "flows" > > > (observed flows of packets for which we preserve the same source port > > > mapping, with timeouts) and we had a vague idea of migrating those as > > > well, but it's debatable where there's any benefit from it. > > > > > > At Layer 2 and 3, we migrate IP and MAC addresses we observed from the > > > guest: > > > > > > https://passt.top/passt/tree/migrate.c?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n31 > > > > > > so that we have ARP and NDP resolution, as well as any NAT > > > mapping working right away as needed. > > > > > > For completeness, this is the TCP context we migrate instead: > > > > > > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n108 > > > https://passt.top/passt/tree/tcp_conn.h?id=e3f70c05bad90368a1a89bf31a9015125232b9ae#n154 > > > > > > > The main focus of this feature are protocols that cannot handle packet > > > > loss on their own in environments where IP address is preserved (and > > > > thus also TCP connections). > > > > > > Well, strictly speaking, TCP handles packet loss, that's actually the > > > main reason behind it. I guess this is to improve throughput and avoid > > > latency spikes or retransmissions that could be avoided? > > > > Sorry, I actually meant that all connections are preserved. The main > > goal is to prevent losses with protocols other than TCP when possible, > > which was requested by our Solution Architects. Possible improved TCP > > throughput due to avoided retransmissions is just a side effect of that. > > Interesting... you mean UDP? Or non-IP protocols? > > For some typical UDP applications (realtime audio/video streams) I > generally expect delayed datagrams (more of them) to be worse than some > lost datagrams (fewer of them). This is part of the reason why I didn't > particularly care about that in passt. Well, as long as there's a way > to disable this mechanism, one could tune the configuration to their > needs. The original request from Solution Architects was demonstrated using ICMP Echo Requests, but it should work for any protocol. While, yes, it will add certain delay to the packets, depending on the user's use case it might be better than loosing them altogether. User can decide and configure to their needs. However, in case of strict realtime workloads, migration might off the table altogether as the switchover during which all CPUs are paused is unavoidable. > > > > > So, mainly tap/bridge, with the idea that > > > > other network backends could also benefit from it. However, if it causes > > > > problems with other backends, I could limit it just to tap. > > > > > > I couldn't quite figure out yet if it's beneficial, useless, or > > > harmless for passt. With passt, what happens without your > > > implementation is: > > > > > > 1. guest pauses > > > > > > 2. the source instance of passt starts migrating, meaning that sockets > > > are frozen one by one, their receiving and sending queues dumped > > > > > > 3. pending queues are sent to the target instance of passt, which opens > > > sockets as refills queues as needed > > > > > > 4. target guest resumes and will get any traffic that was received by > > > the source instance of passt between 1. and 2. > > > > > > Right now there's still a Linux kernel issue we observed (see also > > > https://pad.passt.top/p/TcpRepairTodo, that's line 4 there) which might > > > cause segments to be received (and acknowledged!) on sockets of the > > > source instance of passt for a small time period *after* we freeze them > > > with TCP_REPAIR (that is, TCP_REPAIR doesn't really freeze the queue). > > > > > > I'm currently working on a proper fix for that. Until then, point 2. > > > above isn't entirely accurate (but it only happens if you hammer it > > > with traffic generators, it's not really visible otherwise). > > > > > > With your implementation, I guess: > > > > > > 1. guest pauses > > > > > > 2. the source instance of passt starts migrating, meaning that sockets > > > are frozen one by one, their receiving and sending queues dumped > > > > > > 2a. any data received by QEMU after 1. will be stored and forwarded to > > > the target later. But passt at this point prevents the guest from > > > getting any data, so there should be no data involved > > > > > > 3. pending queues are sent to the target instance of passt, which opens > > > sockets as refills queues as needed > > > > > > 3a. the target guest gets the data from 2a. As long as there's no data > > > (as I'm assuming), there should be no change. If there's data coming > > > in at this point, we risk that sequences don't match anymore? I'm not > > > sure > > > > > > 4. target guest resumes and will *also* get any traffic that was received > > > by the source instance of passt between 1. and 2. > > > > > > So if my assumption from 2a. above holds, it should be useless, but > > > harmless. > > > > > > Would your implementation help with the kernel glitch we're currently > > > observing? I don't think so, because your implementation would only play > > > a role between passt and QEMU, and we don't have issues there. > > > > > > Well, it would be good to try things out. Other than that, unless I'm > > > missing something, your implementation should probably be skipped for > > > passt for simplicity, and also to avoid negatively affecting downtime. > > > > I agree with skipping passt in such case, although, I haven't perceived > > any effect on downtime. Cached network packets are sent after the > > destination resumes, so that the network knows about new location of the > > VM and the source shouldn't receive any more packets intended for it. > > > > > Note that you can also use passt without "-net passt" (that's actually > > > quite recent) but with a tap back-end. Migration is only supported with > > > vhost-user enabled though, and as far as I understand your implementation > > > is disabled in that case? > > > > As of now it is disabled in that case as network filters don't support > > vhost. > > Is that something you plan to fix / change in the future, though? In > that case, I would try to check how this works with passt in a bit more > detail (now or later). Yes, we are planning to implement such feature also with vhost. > > -- > Stefano > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-01-27 18:21 ` [PATCH 0/4] " Stefano Brivio 2026-01-28 13:06 ` Juraj Marcin @ 2026-02-03 12:03 ` Laurent Vivier 2026-02-03 13:29 ` Stefano Brivio 1 sibling, 1 reply; 23+ messages in thread From: Laurent Vivier @ 2026-02-03 12:03 UTC (permalink / raw) To: Stefano Brivio, Juraj Marcin Cc: qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson On 1/27/26 19:21, Stefano Brivio wrote: > [Cc'ing Laurent and David] > > On Tue, 27 Jan 2026 15:03:06 +0100 > Juraj Marcin <jmarcin@redhat.com> wrote: > >> During switchover there is a period during which both source and >> destination side VMs are paused. During this period, all network packets >> are still routed to the source side, but it will never process them. >> Once the destination resumes, it is not aware of these packets and they >> are lost. This can cause packet loss in unreliable protocols and >> extended delays due to retransmission in reliable protocols. >> >> This series resolves this problem by caching packets received once the >> source VM pauses and then passing and injecting them on the destination >> side. This feature is implemented in the last patch. The caching and >> injecting is implemented using network filter interface and should work >> with any backend with vhost=off, but only TAP network backend was >> explicitly tested. > > I haven't had a chance to try this change with passt(1) yet (the > backend can be enabled using "-net passt" or by starting it > separately). > > Given that passt implements migration on its own (in deeper detail in > some sense, as TCP connections are preserved if IP addresses match), I > wonder if it this might affect or break it somehow. > passt implements migration only with the vhost-user backend ("-netdev vhost-user") that is not supported by netpass. All the vhost-* cannot be supported because netpass cannot catch packets on the virtio queues. passt with "-netdev stream" doesn't implement migration, but QEMU can be migrated with it and all the connections are lost. So netpass will forward packets for connections that will be broken. "-netdev passt" is only some kind of wrapper on top of "-netdev stream" and "-netdev vhost-user" that starts the passt backend by itself (rather than expecting it has been started by the user). Thanks, Laurent ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/4] migration: Pass network packets received during switchover to dest VM 2026-02-03 12:03 ` Laurent Vivier @ 2026-02-03 13:29 ` Stefano Brivio 0 siblings, 0 replies; 23+ messages in thread From: Stefano Brivio @ 2026-02-03 13:29 UTC (permalink / raw) To: Laurent Vivier Cc: Juraj Marcin, qemu-devel, Fabiano Rosas, Michael S. Tsirkin, Peter Xu, Jason Wang, Vladimir Sementsov-Ogievskiy, Laurent Vivier, David Gibson, Cindy Lu On Tue, 3 Feb 2026 13:03:26 +0100 Laurent Vivier <lvivier@redhat.com> wrote: > On 1/27/26 19:21, Stefano Brivio wrote: > > [Cc'ing Laurent and David] > > > > On Tue, 27 Jan 2026 15:03:06 +0100 > > Juraj Marcin <jmarcin@redhat.com> wrote: > > > >> During switchover there is a period during which both source and > >> destination side VMs are paused. During this period, all network packets > >> are still routed to the source side, but it will never process them. > >> Once the destination resumes, it is not aware of these packets and they > >> are lost. This can cause packet loss in unreliable protocols and > >> extended delays due to retransmission in reliable protocols. > >> > >> This series resolves this problem by caching packets received once the > >> source VM pauses and then passing and injecting them on the destination > >> side. This feature is implemented in the last patch. The caching and > >> injecting is implemented using network filter interface and should work > >> with any backend with vhost=off, but only TAP network backend was > >> explicitly tested. > > > > I haven't had a chance to try this change with passt(1) yet (the > > backend can be enabled using "-net passt" or by starting it > > separately). > > > > Given that passt implements migration on its own (in deeper detail in > > some sense, as TCP connections are preserved if IP addresses match), I > > wonder if it this might affect or break it somehow. > > passt implements migration only with the vhost-user backend ("-netdev vhost-user") that is > not supported by netpass. All the vhost-* cannot be supported because netpass cannot catch > packets on the virtio queues. Thanks for having a look! On this point... right, hence my question in: https://lore.kernel.org/qemu-devel/20260131032700.12f27487@elisabeth/ that is, is there a plan to add vhost support *for netpass*, eventually? It looks like yes: https://lore.kernel.org/qemu-devel/CACLfguUZpT-3sj4C8G8e+LB5GHpBfE_HKLOhyZ9qYR8bgkTOCw@mail.gmail.com/ but I'm not sure I got it right (Cindy? Jason?). > passt with "-netdev stream" doesn't implement migration, but QEMU can be migrated with it > and all the connections are lost. So netpass will forward packets for connections that > will be broken. Realistically, I don't think anybody will ever try to migrate VMs using -netdev stream with passt, so I guess we don't really have to care about this (it might help with some protocols, probably make UDP usage a bit worse, waste a bit of bandwidth with TCP... but that's it). The only existing (known) user of passt's migration feature is KubeVirt, which switched to passt's vhost-user interface entirely. > "-netdev passt" is only some kind of wrapper on top of "-netdev stream" and "-netdev > vhost-user" that starts the passt backend by itself (rather than expecting it has been > started by the user). -- Stefano ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2026-02-04 11:23 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-27 14:03 [PATCH 0/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 14:03 ` [PATCH 1/4] migration/qemu-file: Add ability to clear error Juraj Marcin 2026-01-27 14:03 ` [PATCH 2/4] migration: Introduce VM_STARTED return-path message Juraj Marcin 2026-01-27 22:29 ` Michael S. Tsirkin 2026-01-27 14:03 ` [PATCH 3/4] migration: Convert VMSD early_setup into VMStateSavePhase enum Juraj Marcin 2026-01-27 14:03 ` [PATCH 4/4] migration: Pass network packets received during switchover to dest VM Juraj Marcin 2026-01-27 14:25 ` Daniel P. Berrangé 2026-01-27 22:27 ` Michael S. Tsirkin 2026-01-28 12:23 ` Juraj Marcin 2026-01-28 2:55 ` Jason Wang 2026-01-28 2:56 ` Jason Wang 2026-01-28 9:07 ` Cindy Lu 2026-01-28 13:49 ` Juraj Marcin 2026-01-29 1:05 ` Jason Wang 2026-01-29 16:07 ` Zhang Chen 2026-01-27 18:21 ` [PATCH 0/4] " Stefano Brivio 2026-01-28 13:06 ` Juraj Marcin 2026-01-28 17:27 ` Stefano Brivio 2026-01-30 14:40 ` Juraj Marcin 2026-01-31 2:27 ` Stefano Brivio 2026-02-04 11:23 ` Juraj Marcin 2026-02-03 12:03 ` Laurent Vivier 2026-02-03 13:29 ` Stefano Brivio
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.