- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:08   ` Avihai Horon
@ 2025-08-12 14:34     ` Cédric Le Goater
  2025-08-12 14:58       ` Peter Xu
  2025-08-12 14:56     ` Cédric Le Goater
  2025-08-13 12:18     ` Kunkun Jiang via
  2 siblings, 1 reply; 11+ messages in thread
From: Cédric Le Goater @ 2025-08-12 14:34 UTC (permalink / raw)
  To: Avihai Horon, Kunkun Jiang, Alex Williamson, Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu,
	'Peter Xu', Fabiano Rosas
+peter
+fabiano
On 8/12/25 16:08, Avihai Horon wrote:
> 
> On 11/08/2025 19:34, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello,
>>
>> + Avihai
>>
>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>> Hi all,
>>>
>>> While testing VFIO migration, I encountered an corner scenario case:
>>> VFIO migration will not be aborted when the vfio device of dst-vm fails to transition from RESUMING to RUNNING state in vfio_vmstate_change.
>>>
>>> I saw the comments in the vfio_vmstate_change but I don't understand why no action is taken for this situation.
>>
>> There is error handling in vfio_vmstate_change() :
>>
>>         /*
>>          * Migration should be aborted in this case, but vm_state_notify()
>>          * currently does not support reporting failures.
>>          */
>>         migration_file_set_error(ret, local_err);
> 
> Hmm, I think this only sets the error on src. On dst we don't have MigrationState->to_dst_file, so we end up just reporting the error.
> But even if we did set it, no one is checking if there is a migration error after vm_start() is called in process_incoming_migration_bh().
> 
>>
>>> Allowing the live migration process to continue could cause unrecoverable damage to the VM.
> 
> What do you mean by unrecoverable damage to the VM?
> If RESUMING->RUNNING transition fails, would a VFIO reset recover the device and allow the VM to continue operation with damage limited only to the VFIO device?
> 
>>> In this case, can we directly exit the dst-vm? Through the return-path mechanism, the src-vm can continue to run.
>>>
>>> Looking forward to your reply.
>>
> The straightforward solution, as you suggested, is to exit dst upon error in RESUMING->RUNNING transition and notify about it to src through the return-path.
> However, I am not sure if failing the migration after vm_start() on dst is a bit late (as we start vCPUs and do migration_block_activate, etc.).
> 
> But I can think of another way to solve this, hopefully simpler.
> According to VFIO migration uAPI [1]:
>   * RESUMING -> STOP
>   *   Leaving RESUMING terminates a data transfer session and indicates the
>   *   device should complete processing of the data delivered by write(). The
>   *   kernel migration driver should complete the incorporation of data written
>   *   to the data transfer FD into the device internal state and perform
>   *   final validity and consistency checking of the new device state. If the
>   *   user provided data is found to be incomplete, inconsistent, or otherwise
>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>   *   optionally go to the ERROR state as described below.
> 
> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the device config is loaded (which is the last data the device is expected to receive).
> If this transition fails, it means something was wrong with migration, and we can send src an error msg via return-path (and not continue to vm_start()).
> 
> Maybe this approach is less complicated than the first one, and it will also work if src VM was paused prior migration.
> I already tested some POC and it seems to be working (at least with an artificial error i injected in RESUMING->STOP transition).
> Kunkun, can you apply the following diff [3] and check if this solves the issue?
> 
> And in general, what do you think? Should we go with this approach or do you have other ideas?
> 
> Thanks.
> 
> [1] https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099
> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part of RESUMING->RUNNING transition.
> [3]
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index e4785031a7..66f8461f02 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -267,6 +267,12 @@ static bool vfio_load_bufs_thread_load_config(VFIODevice *vbasedev,
>       ret = vfio_load_device_config_state(f_in, vbasedev);
>       bql_unlock();
> 
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +                                   VFIO_DEVICE_STATE_ERROR, errp);
> +    if (ret) {
> +        return false;
> +    }
> +
>       if (ret < 0) {
>           error_setg(errp, "%s: vfio_load_device_config_state() failed: %d",
>                      vbasedev->name, ret);
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4c06e3db93..a707d17a5b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -737,6 +737,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            Error *local_err = NULL;
> +
>               if (vfio_multifd_transfer_enabled(vbasedev)) {
>                   error_report("%s: got DEV_CONFIG_STATE in main migration "
>                                "channel but doing multifd transfer",
> @@ -744,7 +746,19 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>                   return -EINVAL;
>               }
> 
> -            return vfio_load_device_config_state(f, opaque);
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            ret = vfio_migration_set_state_or_reset(
> +                vbasedev, VFIO_DEVICE_STATE_STOP, &local_err);
> +            if (ret) {
> +                error_report_err(local_err);
> +                return ret;
> +            }
> +
> +            return 0;
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>           {
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..fd498c864d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -91,6 +91,7 @@ enum mig_rp_message_type {
>       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
>       MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
> +    MIG_RP_MSG_ERROR, /* Tell source that destination encountered an error */
> 
>       MIG_RP_MSG_MAX
>   };
> @@ -884,6 +885,11 @@ process_incoming_migration_co(void *opaque)
>       ret = qemu_loadvm_state(mis->from_src_file);
>       mis->loadvm_co = NULL;
> 
> +    if (ret) {
> +        migrate_send_rp_error(mis);
> +        error_report("SENT RP ERROR");
> +    }
> +
>   trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> 
>       ps = postcopy_state_get();
> @@ -1126,6 +1132,11 @@ bool migration_has_all_channels(void)
>       return true;
>   }
> +int migrate_send_rp_error(MigrationIncomingState *mis)
> +{
> +    return migrate_send_rp_message(mis, MIG_RP_MSG_ERROR, 0, NULL);
> +}
> +
>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>   {
>       return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
> @@ -2614,6 +2625,10 @@ static void *source_return_path_thread(void *opaque)
>               trace_source_return_path_thread_switchover_acked();
>               break;
> 
> +        case MIG_RP_MSG_ERROR:
> +            error_setg(&err, "DST indicated error");
> +            goto out;
> +
>           default:
>               break;
>           }
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..f11ff7a199 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -553,6 +553,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>                                    char *block_name);
>   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
> +int migrate_send_rp_error(MigrationIncomingState *mis);
> 
>   void dirty_bitmap_mig_before_vm_start(void);
>   void dirty_bitmap_mig_cancel_outgoing(void);
> 
>> I suggest you open an issue on :
>>
>>   https://gitlab.com/qemu-project/qemu/-/issues/
>>
>> with a detailed description of your environment :
>>
>>   Host HW, Host OS, QEMU version, QEMU command line, Guest OS, etc.
>>
>> A template is provided when a new issue is created.
>>
>>
>> Thanks,
>>
>> C.
>>
>>
>>
> 
^ permalink raw reply	[flat|nested] 11+ messages in thread
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:34     ` Cédric Le Goater
@ 2025-08-12 14:58       ` Peter Xu
  2025-08-18  6:10         ` Avihai Horon
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2025-08-12 14:58 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Avihai Horon, Kunkun Jiang, Alex Williamson, Yishai Hadas,
	open list:All patches CC here, wanghaibin.wang, Zenghui Yu,
	Fabiano Rosas
On Tue, Aug 12, 2025 at 04:34:34PM +0200, Cédric Le Goater wrote:
> +peter
> +fabiano
> 
> On 8/12/25 16:08, Avihai Horon wrote:
> > 
> > On 11/08/2025 19:34, Cédric Le Goater wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > Hello,
> > > 
> > > + Avihai
> > > 
> > > On 8/11/25 18:02, Kunkun Jiang wrote:
> > > > Hi all,
> > > > 
> > > > While testing VFIO migration, I encountered an corner scenario case:
> > > > VFIO migration will not be aborted when the vfio device of dst-vm fails to transition from RESUMING to RUNNING state in vfio_vmstate_change.
> > > > 
> > > > I saw the comments in the vfio_vmstate_change but I don't understand why no action is taken for this situation.
> > > 
> > > There is error handling in vfio_vmstate_change() :
> > > 
> > >         /*
> > >          * Migration should be aborted in this case, but vm_state_notify()
> > >          * currently does not support reporting failures.
> > >          */
> > >         migration_file_set_error(ret, local_err);
> > 
> > Hmm, I think this only sets the error on src. On dst we don't have MigrationState->to_dst_file, so we end up just reporting the error.
> > But even if we did set it, no one is checking if there is a migration error after vm_start() is called in process_incoming_migration_bh().
> > 
> > > 
> > > > Allowing the live migration process to continue could cause unrecoverable damage to the VM.
> > 
> > What do you mean by unrecoverable damage to the VM?
> > If RESUMING->RUNNING transition fails, would a VFIO reset recover the device and allow the VM to continue operation with damage limited only to the VFIO device?
> > 
> > > > In this case, can we directly exit the dst-vm? Through the return-path mechanism, the src-vm can continue to run.
> > > > 
> > > > Looking forward to your reply.
> > > 
> > The straightforward solution, as you suggested, is to exit dst upon error in RESUMING->RUNNING transition and notify about it to src through the return-path.
> > However, I am not sure if failing the migration after vm_start() on dst is a bit late (as we start vCPUs and do migration_block_activate, etc.).
> > 
> > But I can think of another way to solve this, hopefully simpler.
> > According to VFIO migration uAPI [1]:
> >   * RESUMING -> STOP
> >   *   Leaving RESUMING terminates a data transfer session and indicates the
> >   *   device should complete processing of the data delivered by write(). The
> >   *   kernel migration driver should complete the incorporation of data written
> >   *   to the data transfer FD into the device internal state and perform
> >   *   final validity and consistency checking of the new device state. If the
> >   *   user provided data is found to be incomplete, inconsistent, or otherwise
> >   *   invalid, the migration driver must fail the SET_STATE ioctl and
> >   *   optionally go to the ERROR state as described below.
> > 
> > So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the device config is loaded (which is the last data the device is expected to receive).
> > If this transition fails, it means something was wrong with migration, and we can send src an error msg via return-path (and not continue to vm_start()).
> > 
> > Maybe this approach is less complicated than the first one, and it will also work if src VM was paused prior migration.
> > I already tested some POC and it seems to be working (at least with an artificial error i injected in RESUMING->STOP transition).
> > Kunkun, can you apply the following diff [3] and check if this solves the issue?
> > 
> > And in general, what do you think? Should we go with this approach or do you have other ideas?
> > 
> > Thanks.
> > 
> > [1] https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099
> > [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part of RESUMING->RUNNING transition.
> > [3]
> > 
> > diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> > index e4785031a7..66f8461f02 100644
> > --- a/hw/vfio/migration-multifd.c
> > +++ b/hw/vfio/migration-multifd.c
> > @@ -267,6 +267,12 @@ static bool vfio_load_bufs_thread_load_config(VFIODevice *vbasedev,
> >       ret = vfio_load_device_config_state(f_in, vbasedev);
> >       bql_unlock();
> > 
> > +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> > +                                   VFIO_DEVICE_STATE_ERROR, errp);
> > +    if (ret) {
> > +        return false;
> > +    }
> > +
> >       if (ret < 0) {
> >           error_setg(errp, "%s: vfio_load_device_config_state() failed: %d",
> >                      vbasedev->name, ret);
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 4c06e3db93..a707d17a5b 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -737,6 +737,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >           switch (data) {
> >           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
> >           {
> > +            Error *local_err = NULL;
> > +
> >               if (vfio_multifd_transfer_enabled(vbasedev)) {
> >                   error_report("%s: got DEV_CONFIG_STATE in main migration "
> >                                "channel but doing multifd transfer",
> > @@ -744,7 +746,19 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> >                   return -EINVAL;
> >               }
> > 
> > -            return vfio_load_device_config_state(f, opaque);
> > +            ret = vfio_load_device_config_state(f, opaque);
> > +            if (ret) {
> > +                return ret;
> > +            }
> > +
> > +            ret = vfio_migration_set_state_or_reset(
> > +                vbasedev, VFIO_DEVICE_STATE_STOP, &local_err);
> > +            if (ret) {
> > +                error_report_err(local_err);
> > +                return ret;
> > +            }
> > +
> > +            return 0;
> >           }
> >           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
> >           {
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..fd498c864d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -91,6 +91,7 @@ enum mig_rp_message_type {
> >       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
> >       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
> >       MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
> > +    MIG_RP_MSG_ERROR, /* Tell source that destination encountered an error */
> > 
> >       MIG_RP_MSG_MAX
> >   };
> > @@ -884,6 +885,11 @@ process_incoming_migration_co(void *opaque)
> >       ret = qemu_loadvm_state(mis->from_src_file);
> >       mis->loadvm_co = NULL;
> > 
> > +    if (ret) {
> > +        migrate_send_rp_error(mis);
> > +        error_report("SENT RP ERROR");
> > +    }
> > +
> >   trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> > 
> >       ps = postcopy_state_get();
> > @@ -1126,6 +1132,11 @@ bool migration_has_all_channels(void)
> >       return true;
> >   }
> > +int migrate_send_rp_error(MigrationIncomingState *mis)
> > +{
> > +    return migrate_send_rp_message(mis, MIG_RP_MSG_ERROR, 0, NULL);
> > +}
> > +
> >   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
> >   {
> >       return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
> > @@ -2614,6 +2625,10 @@ static void *source_return_path_thread(void *opaque)
> >               trace_source_return_path_thread_switchover_acked();
> >               break;
> > 
> > +        case MIG_RP_MSG_ERROR:
> > +            error_setg(&err, "DST indicated error");
> > +            goto out;
If this is only a boolean, we can reuse RP_SHUT.  Likely we could pass in
an error to migration_incoming_state_destroy():
diff --git a/migration/migration.c b/migration/migration.c
index 42a2a6e8f2..2ebba7838a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -441,7 +441,7 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
     }
 }
 
-void migration_incoming_state_destroy(void)
+void migration_incoming_state_destroy(bool has_error)
 {
     struct MigrationIncomingState *mis = migration_incoming_get_current();
 
@@ -466,8 +466,11 @@ void migration_incoming_state_destroy(void)
     qemu_loadvm_state_cleanup(mis);
 
     if (mis->to_src_file) {
-        /* Tell source that we are done */
-        migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
+        /* Tell source whether load succeeded */
+        if (!has_error) {
+            has_error = qemu_file_get_error(mis->from_src_file) != 0;
+        }
+        migrate_send_rp_shut(mis, has_error);
         qemu_fclose(mis->to_src_file);
         mis->to_src_file = NULL;
     }
Maybe it'll even work as late as process_incoming_migration_bh(), where
vm_start() could fail - right now it couldn't, but if there'll be an error
message reported upward then logically it can also set has_error=1 for the
RP_SHUT message.  Src QEMU relies on RP_SHUT message and retval=0 to quit
src QEMU, otherwise QEMU should fail the migration and restart VM on src.
> > +
> >           default:
> >               break;
> >           }
> > diff --git a/migration/migration.h b/migration/migration.h
> > index 01329bf824..f11ff7a199 100644
> > --- a/migration/migration.h
> > +++ b/migration/migration.h
> > @@ -553,6 +553,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
> >                                    char *block_name);
> >   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
> >   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
> > +int migrate_send_rp_error(MigrationIncomingState *mis);
> > 
> >   void dirty_bitmap_mig_before_vm_start(void);
> >   void dirty_bitmap_mig_cancel_outgoing(void);
> > 
> > > I suggest you open an issue on :
> > > 
> > >   https://gitlab.com/qemu-project/qemu/-/issues/
> > > 
> > > with a detailed description of your environment :
> > > 
> > >   Host HW, Host OS, QEMU version, QEMU command line, Guest OS, etc.
> > > 
> > > A template is provided when a new issue is created.
> > > 
> > > 
> > > Thanks,
> > > 
> > > C.
> > > 
> > > 
> > > 
> > 
> 
-- 
Peter Xu
^ permalink raw reply related	[flat|nested] 11+ messages in thread
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:58       ` Peter Xu
@ 2025-08-18  6:10         ` Avihai Horon
  0 siblings, 0 replies; 11+ messages in thread
From: Avihai Horon @ 2025-08-18  6:10 UTC (permalink / raw)
  To: Peter Xu, Cédric Le Goater
  Cc: Kunkun Jiang, Alex Williamson, Yishai Hadas,
	open list:All patches CC here, wanghaibin.wang, Zenghui Yu,
	Fabiano Rosas
On 12/08/2025 17:58, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Aug 12, 2025 at 04:34:34PM +0200, Cédric Le Goater wrote:
>> +peter
>> +fabiano
>>
>> On 8/12/25 16:08, Avihai Horon wrote:
>>> On 11/08/2025 19:34, Cédric Le Goater wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Hello,
>>>>
>>>> + Avihai
>>>>
>>>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>>>> Hi all,
>>>>>
>>>>> While testing VFIO migration, I encountered an corner scenario case:
>>>>> VFIO migration will not be aborted when the vfio device of dst-vm fails to transition from RESUMING to RUNNING state in vfio_vmstate_change.
>>>>>
>>>>> I saw the comments in the vfio_vmstate_change but I don't understand why no action is taken for this situation.
>>>> There is error handling in vfio_vmstate_change() :
>>>>
>>>>          /*
>>>>           * Migration should be aborted in this case, but vm_state_notify()
>>>>           * currently does not support reporting failures.
>>>>           */
>>>>          migration_file_set_error(ret, local_err);
>>> Hmm, I think this only sets the error on src. On dst we don't have MigrationState->to_dst_file, so we end up just reporting the error.
>>> But even if we did set it, no one is checking if there is a migration error after vm_start() is called in process_incoming_migration_bh().
>>>
>>>>> Allowing the live migration process to continue could cause unrecoverable damage to the VM.
>>> What do you mean by unrecoverable damage to the VM?
>>> If RESUMING->RUNNING transition fails, would a VFIO reset recover the device and allow the VM to continue operation with damage limited only to the VFIO device?
>>>
>>>>> In this case, can we directly exit the dst-vm? Through the return-path mechanism, the src-vm can continue to run.
>>>>>
>>>>> Looking forward to your reply.
>>> The straightforward solution, as you suggested, is to exit dst upon error in RESUMING->RUNNING transition and notify about it to src through the return-path.
>>> However, I am not sure if failing the migration after vm_start() on dst is a bit late (as we start vCPUs and do migration_block_activate, etc.).
>>>
>>> But I can think of another way to solve this, hopefully simpler.
>>> According to VFIO migration uAPI [1]:
>>>    * RESUMING -> STOP
>>>    *   Leaving RESUMING terminates a data transfer session and indicates the
>>>    *   device should complete processing of the data delivered by write(). The
>>>    *   kernel migration driver should complete the incorporation of data written
>>>    *   to the data transfer FD into the device internal state and perform
>>>    *   final validity and consistency checking of the new device state. If the
>>>    *   user provided data is found to be incomplete, inconsistent, or otherwise
>>>    *   invalid, the migration driver must fail the SET_STATE ioctl and
>>>    *   optionally go to the ERROR state as described below.
>>>
>>> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the device config is loaded (which is the last data the device is expected to receive).
>>> If this transition fails, it means something was wrong with migration, and we can send src an error msg via return-path (and not continue to vm_start()).
>>>
>>> Maybe this approach is less complicated than the first one, and it will also work if src VM was paused prior migration.
>>> I already tested some POC and it seems to be working (at least with an artificial error i injected in RESUMING->STOP transition).
>>> Kunkun, can you apply the following diff [3] and check if this solves the issue?
>>>
>>> And in general, what do you think? Should we go with this approach or do you have other ideas?
>>>
>>> Thanks.
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099
>>> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part of RESUMING->RUNNING transition.
>>> [3]
>>>
>>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>>> index e4785031a7..66f8461f02 100644
>>> --- a/hw/vfio/migration-multifd.c
>>> +++ b/hw/vfio/migration-multifd.c
>>> @@ -267,6 +267,12 @@ static bool vfio_load_bufs_thread_load_config(VFIODevice *vbasedev,
>>>        ret = vfio_load_device_config_state(f_in, vbasedev);
>>>        bql_unlock();
>>>
>>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>>> +                                   VFIO_DEVICE_STATE_ERROR, errp);
>>> +    if (ret) {
>>> +        return false;
>>> +    }
>>> +
>>>        if (ret < 0) {
>>>            error_setg(errp, "%s: vfio_load_device_config_state() failed: %d",
>>>                       vbasedev->name, ret);
>>>
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 4c06e3db93..a707d17a5b 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -737,6 +737,8 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>            switch (data) {
>>>            case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>>            {
>>> +            Error *local_err = NULL;
>>> +
>>>                if (vfio_multifd_transfer_enabled(vbasedev)) {
>>>                    error_report("%s: got DEV_CONFIG_STATE in main migration "
>>>                                 "channel but doing multifd transfer",
>>> @@ -744,7 +746,19 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>                    return -EINVAL;
>>>                }
>>>
>>> -            return vfio_load_device_config_state(f, opaque);
>>> +            ret = vfio_load_device_config_state(f, opaque);
>>> +            if (ret) {
>>> +                return ret;
>>> +            }
>>> +
>>> +            ret = vfio_migration_set_state_or_reset(
>>> +                vbasedev, VFIO_DEVICE_STATE_STOP, &local_err);
>>> +            if (ret) {
>>> +                error_report_err(local_err);
>>> +                return ret;
>>> +            }
>>> +
>>> +            return 0;
>>>            }
>>>            case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>>>            {
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 10c216d25d..fd498c864d 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -91,6 +91,7 @@ enum mig_rp_message_type {
>>>        MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>>>        MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
>>>        MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>>> +    MIG_RP_MSG_ERROR, /* Tell source that destination encountered an error */
>>>
>>>        MIG_RP_MSG_MAX
>>>    };
>>> @@ -884,6 +885,11 @@ process_incoming_migration_co(void *opaque)
>>>        ret = qemu_loadvm_state(mis->from_src_file);
>>>        mis->loadvm_co = NULL;
>>>
>>> +    if (ret) {
>>> +        migrate_send_rp_error(mis);
>>> +        error_report("SENT RP ERROR");
>>> +    }
>>> +
>>>    trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>>>
>>>        ps = postcopy_state_get();
>>> @@ -1126,6 +1132,11 @@ bool migration_has_all_channels(void)
>>>        return true;
>>>    }
>>> +int migrate_send_rp_error(MigrationIncomingState *mis)
>>> +{
>>> +    return migrate_send_rp_message(mis, MIG_RP_MSG_ERROR, 0, NULL);
>>> +}
>>> +
>>>    int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>>>    {
>>>        return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
>>> @@ -2614,6 +2625,10 @@ static void *source_return_path_thread(void *opaque)
>>>                trace_source_return_path_thread_switchover_acked();
>>>                break;
>>>
>>> +        case MIG_RP_MSG_ERROR:
>>> +            error_setg(&err, "DST indicated error");
>>> +            goto out;
> If this is only a boolean, we can reuse RP_SHUT.  Likely we could pass in
> an error to migration_incoming_state_destroy():
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 42a2a6e8f2..2ebba7838a 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -441,7 +441,7 @@ void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
>       }
>   }
>
> -void migration_incoming_state_destroy(void)
> +void migration_incoming_state_destroy(bool has_error)
>   {
>       struct MigrationIncomingState *mis = migration_incoming_get_current();
>
> @@ -466,8 +466,11 @@ void migration_incoming_state_destroy(void)
>       qemu_loadvm_state_cleanup(mis);
>
>       if (mis->to_src_file) {
> -        /* Tell source that we are done */
> -        migrate_send_rp_shut(mis, qemu_file_get_error(mis->from_src_file) != 0);
> +        /* Tell source whether load succeeded */
> +        if (!has_error) {
> +            has_error = qemu_file_get_error(mis->from_src_file) != 0;
> +        }
> +        migrate_send_rp_shut(mis, has_error);
>           qemu_fclose(mis->to_src_file);
>           mis->to_src_file = NULL;
>       }
>
> Maybe it'll even work as late as process_incoming_migration_bh(), where
> vm_start() could fail - right now it couldn't, but if there'll be an error
> message reported upward then logically it can also set has_error=1 for the
> RP_SHUT message.  Src QEMU relies on RP_SHUT message and retval=0 to quit
> src QEMU, otherwise QEMU should fail the migration and restart VM on src.
Ah, nice! I didn't notice RP_SHUT can take an error value.
Will try that, thanks!
>
>>> +
>>>            default:
>>>                break;
>>>            }
>>> diff --git a/migration/migration.h b/migration/migration.h
>>> index 01329bf824..f11ff7a199 100644
>>> --- a/migration/migration.h
>>> +++ b/migration/migration.h
>>> @@ -553,6 +553,7 @@ void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>>>                                     char *block_name);
>>>    void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>>>    int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>>> +int migrate_send_rp_error(MigrationIncomingState *mis);
>>>
>>>    void dirty_bitmap_mig_before_vm_start(void);
>>>    void dirty_bitmap_mig_cancel_outgoing(void);
>>>
>>>> I suggest you open an issue on :
>>>>
>>>>    https://gitlab.com/qemu-project/qemu/-/issues/
>>>>
>>>> with a detailed description of your environment :
>>>>
>>>>    Host HW, Host OS, QEMU version, QEMU command line, Guest OS, etc.
>>>>
>>>> A template is provided when a new issue is created.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>>
>>>>
> --
> Peter Xu
>
^ permalink raw reply	[flat|nested] 11+ messages in thread
 
 
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:08   ` Avihai Horon
  2025-08-12 14:34     ` Cédric Le Goater
@ 2025-08-12 14:56     ` Cédric Le Goater
  2025-08-13 12:18       ` Kunkun Jiang via
  2025-08-18  6:13       ` Avihai Horon
  2025-08-13 12:18     ` Kunkun Jiang via
  2 siblings, 2 replies; 11+ messages in thread
From: Cédric Le Goater @ 2025-08-12 14:56 UTC (permalink / raw)
  To: Avihai Horon, Kunkun Jiang, Alex Williamson, Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu
On 8/12/25 16:08, Avihai Horon wrote:
> 
> On 11/08/2025 19:34, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello,
>>
>> + Avihai
>>
>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>> Hi all,
>>>
>>> While testing VFIO migration, I encountered an corner scenario case:
>>> VFIO migration will not be aborted when the vfio device of dst-vm fails to transition from RESUMING to RUNNING state in vfio_vmstate_change.
>>>
>>> I saw the comments in the vfio_vmstate_change but I don't understand why no action is taken for this situation.
>>
>> There is error handling in vfio_vmstate_change() :
>>
>>         /*
>>          * Migration should be aborted in this case, but vm_state_notify()
>>          * currently does not support reporting failures.
>>          */
>>         migration_file_set_error(ret, local_err);
> 
> Hmm, I think this only sets the error on src. On dst we don't have MigrationState->to_dst_file, so we end up just reporting the error.
> But even if we did set it, no one is checking if there is a migration error after vm_start() is called in process_incoming_migration_bh().
> 
>>
>>> Allowing the live migration process to continue could cause unrecoverable damage to the VM.
> 
> What do you mean by unrecoverable damage to the VM?
> If RESUMING->RUNNING transition fails, would a VFIO reset recover the device and allow the VM to continue operation with damage limited only to the VFIO device?
> 
>>> In this case, can we directly exit the dst-vm? Through the return-path mechanism, the src-vm can continue to run.
>>>
>>> Looking forward to your reply.
>>
> The straightforward solution, as you suggested, is to exit dst upon error in RESUMING->RUNNING transition and notify about it to src through the return-path.
> However, I am not sure if failing the migration after vm_start() on dst is a bit late (as we start vCPUs and do migration_block_activate, etc.).
> 
> But I can think of another way to solve this, hopefully simpler.
> According to VFIO migration uAPI [1]:
>   * RESUMING -> STOP
>   *   Leaving RESUMING terminates a data transfer session and indicates the
>   *   device should complete processing of the data delivered by write(). The
>   *   kernel migration driver should complete the incorporation of data written
>   *   to the data transfer FD into the device internal state and perform
>   *   final validity and consistency checking of the new device state. If the
>   *   user provided data is found to be incomplete, inconsistent, or otherwise
>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>   *   optionally go to the ERROR state as described below.
> 
> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the device config is loaded (which is the last data the device is expected to receive).
> If this transition fails, it means something was wrong with migration, and we can send src an error msg via return-path (and not continue to vm_start()).
> 
> Maybe this approach is less complicated than the first one, and it will also work if src VM was paused prior migration.
> I already tested some POC and it seems to be working (at least with an artificial error i injected in RESUMING->STOP transition).
> Kunkun, can you apply the following diff [3] and check if this solves the issue?
> 
> And in general, what do you think? Should we go with this approach or do you have other ideas?
> 
> Thanks.
> 
> [1] https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099
> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part of RESUMING->RUNNING transition.
> [3]
Avihai,
Could you please send an RFC patch with Peter and Fabiano in cc: ?
This will help to discuss the proposal and keep track of the issue.
Kunkun Jiang,
Could you please share details on your environment ?
Thanks,
C.
^ permalink raw reply	[flat|nested] 11+ messages in thread 
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:56     ` Cédric Le Goater
@ 2025-08-13 12:18       ` Kunkun Jiang via
  2025-08-18  6:13       ` Avihai Horon
  1 sibling, 0 replies; 11+ messages in thread
From: Kunkun Jiang via @ 2025-08-13 12:18 UTC (permalink / raw)
  To: Cédric Le Goater, Avihai Horon, Alex Williamson,
	Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu
Hi Cédric,
On 2025/8/12 22:56, Cédric Le Goater wrote:
> On 8/12/25 16:08, Avihai Horon wrote:
>>
>> On 11/08/2025 19:34, Cédric Le Goater wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello,
>>>
>>> + Avihai
>>>
>>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>>> Hi all,
>>>>
>>>> While testing VFIO migration, I encountered an corner scenario case:
>>>> VFIO migration will not be aborted when the vfio device of dst-vm 
>>>> fails to transition from RESUMING to RUNNING state in 
>>>> vfio_vmstate_change.
>>>>
>>>> I saw the comments in the vfio_vmstate_change but I don't understand 
>>>> why no action is taken for this situation.
>>>
>>> There is error handling in vfio_vmstate_change() :
>>>
>>>         /*
>>>          * Migration should be aborted in this case, but 
>>> vm_state_notify()
>>>          * currently does not support reporting failures.
>>>          */
>>>         migration_file_set_error(ret, local_err);
>>
>> Hmm, I think this only sets the error on src. On dst we don't have 
>> MigrationState->to_dst_file, so we end up just reporting the error.
>> But even if we did set it, no one is checking if there is a migration 
>> error after vm_start() is called in process_incoming_migration_bh().
>>
>>>
>>>> Allowing the live migration process to continue could cause 
>>>> unrecoverable damage to the VM.
>>
>> What do you mean by unrecoverable damage to the VM?
>> If RESUMING->RUNNING transition fails, would a VFIO reset recover the 
>> device and allow the VM to continue operation with damage limited only 
>> to the VFIO device?
>>
>>>> In this case, can we directly exit the dst-vm? Through the 
>>>> return-path mechanism, the src-vm can continue to run.
>>>>
>>>> Looking forward to your reply.
>>>
>> The straightforward solution, as you suggested, is to exit dst upon 
>> error in RESUMING->RUNNING transition and notify about it to src 
>> through the return-path.
>> However, I am not sure if failing the migration after vm_start() on 
>> dst is a bit late (as we start vCPUs and do migration_block_activate, 
>> etc.).
>>
>> But I can think of another way to solve this, hopefully simpler.
>> According to VFIO migration uAPI [1]:
>>   * RESUMING -> STOP
>>   *   Leaving RESUMING terminates a data transfer session and 
>> indicates the
>>   *   device should complete processing of the data delivered by 
>> write(). The
>>   *   kernel migration driver should complete the incorporation of 
>> data written
>>   *   to the data transfer FD into the device internal state and perform
>>   *   final validity and consistency checking of the new device state. 
>> If the
>>   *   user provided data is found to be incomplete, inconsistent, or 
>> otherwise
>>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>>   *   optionally go to the ERROR state as described below.
>>
>> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after 
>> the device config is loaded (which is the last data the device is 
>> expected to receive).
>> If this transition fails, it means something was wrong with migration, 
>> and we can send src an error msg via return-path (and not continue to 
>> vm_start()).
>>
>> Maybe this approach is less complicated than the first one, and it 
>> will also work if src VM was paused prior migration.
>> I already tested some POC and it seems to be working (at least with an 
>> artificial error i injected in RESUMING->STOP transition).
>> Kunkun, can you apply the following diff [3] and check if this solves 
>> the issue?
>>
>> And in general, what do you think? Should we go with this approach or 
>> do you have other ideas?
>>
>> Thanks.
>>
>> [1] 
>> https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099 
>>
>> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part 
>> of RESUMING->RUNNING transition.
>> [3]
> 
> Avihai,
> 
> Could you please send an RFC patch with Peter and Fabiano in cc: ?
> This will help to discuss the proposal and keep track of the issue.
> 
> 
> Kunkun Jiang,
> 
> Could you please share details on your environment ?
Sorry for the late reply.My test scenario is this:
Server:		Kunpeng-920
VFUI device:	SEC, a Kunpeng hardware accelerator
Host OS:	openEuler 24.03, kernel 6.6, qemu 8.2
Guest OS:	openEuler 24.03
Test steps:
1. Start a VM and run a task using SEC inside the VM
2. Start migration and inject an error into the SEC used by dst-VM
3. The SEC used by dst-VM will be reset. So RESUMING->RUNNING transition 
will fail with probability.(The timing has to be just right)
> 
> Thanks,
> 
> C.
> 
> 
> .
^ permalink raw reply	[flat|nested] 11+ messages in thread 
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:56     ` Cédric Le Goater
  2025-08-13 12:18       ` Kunkun Jiang via
@ 2025-08-18  6:13       ` Avihai Horon
  1 sibling, 0 replies; 11+ messages in thread
From: Avihai Horon @ 2025-08-18  6:13 UTC (permalink / raw)
  To: Cédric Le Goater, Kunkun Jiang, Alex Williamson,
	Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu
On 12/08/2025 17:56, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 8/12/25 16:08, Avihai Horon wrote:
>>
>> On 11/08/2025 19:34, Cédric Le Goater wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello,
>>>
>>> + Avihai
>>>
>>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>>> Hi all,
>>>>
>>>> While testing VFIO migration, I encountered an corner scenario case:
>>>> VFIO migration will not be aborted when the vfio device of dst-vm 
>>>> fails to transition from RESUMING to RUNNING state in 
>>>> vfio_vmstate_change.
>>>>
>>>> I saw the comments in the vfio_vmstate_change but I don't 
>>>> understand why no action is taken for this situation.
>>>
>>> There is error handling in vfio_vmstate_change() :
>>>
>>>         /*
>>>          * Migration should be aborted in this case, but 
>>> vm_state_notify()
>>>          * currently does not support reporting failures.
>>>          */
>>>         migration_file_set_error(ret, local_err);
>>
>> Hmm, I think this only sets the error on src. On dst we don't have 
>> MigrationState->to_dst_file, so we end up just reporting the error.
>> But even if we did set it, no one is checking if there is a migration 
>> error after vm_start() is called in process_incoming_migration_bh().
>>
>>>
>>>> Allowing the live migration process to continue could cause 
>>>> unrecoverable damage to the VM.
>>
>> What do you mean by unrecoverable damage to the VM?
>> If RESUMING->RUNNING transition fails, would a VFIO reset recover the 
>> device and allow the VM to continue operation with damage limited 
>> only to the VFIO device?
>>
>>>> In this case, can we directly exit the dst-vm? Through the 
>>>> return-path mechanism, the src-vm can continue to run.
>>>>
>>>> Looking forward to your reply.
>>>
>> The straightforward solution, as you suggested, is to exit dst upon 
>> error in RESUMING->RUNNING transition and notify about it to src 
>> through the return-path.
>> However, I am not sure if failing the migration after vm_start() on 
>> dst is a bit late (as we start vCPUs and do migration_block_activate, 
>> etc.).
>>
>> But I can think of another way to solve this, hopefully simpler.
>> According to VFIO migration uAPI [1]:
>>   * RESUMING -> STOP
>>   *   Leaving RESUMING terminates a data transfer session and 
>> indicates the
>>   *   device should complete processing of the data delivered by 
>> write(). The
>>   *   kernel migration driver should complete the incorporation of 
>> data written
>>   *   to the data transfer FD into the device internal state and perform
>>   *   final validity and consistency checking of the new device 
>> state. If the
>>   *   user provided data is found to be incomplete, inconsistent, or 
>> otherwise
>>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>>   *   optionally go to the ERROR state as described below.
>>
>> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after 
>> the device config is loaded (which is the last data the device is 
>> expected to receive).
>> If this transition fails, it means something was wrong with 
>> migration, and we can send src an error msg via return-path (and not 
>> continue to vm_start()).
>>
>> Maybe this approach is less complicated than the first one, and it 
>> will also work if src VM was paused prior migration.
>> I already tested some POC and it seems to be working (at least with 
>> an artificial error i injected in RESUMING->STOP transition).
>> Kunkun, can you apply the following diff [3] and check if this solves 
>> the issue?
>>
>> And in general, what do you think? Should we go with this approach or 
>> do you have other ideas?
>>
>> Thanks.
>>
>> [1] 
>> https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099
>> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as 
>> part of RESUMING->RUNNING transition.
>> [3]
>
> Avihai,
>
> Could you please send an RFC patch with Peter and Fabiano in cc: ?
> This will help to discuss the proposal and keep track of the issue.
>
Yes, of course. But I am a bit busy with other stuff, so it may take me 
a few days to do that.
Of course, If anyone would like, feel free to send your own proposal.
Thanks.
>
> Kunkun Jiang,
>
> Could you please share details on your environment ?
>
> Thanks,
>
> C.
>
^ permalink raw reply	[flat|nested] 11+ messages in thread 
 
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-12 14:08   ` Avihai Horon
  2025-08-12 14:34     ` Cédric Le Goater
  2025-08-12 14:56     ` Cédric Le Goater
@ 2025-08-13 12:18     ` Kunkun Jiang via
  2025-08-18  6:44       ` Avihai Horon
  2 siblings, 1 reply; 11+ messages in thread
From: Kunkun Jiang via @ 2025-08-13 12:18 UTC (permalink / raw)
  To: Avihai Horon, Cédric Le Goater, Alex Williamson,
	Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu
Hi Avihai,
On 2025/8/12 22:08, Avihai Horon wrote:
> 
> On 11/08/2025 19:34, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello,
>>
>> + Avihai
>>
>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>> Hi all,
>>>
>>> While testing VFIO migration, I encountered an corner scenario case:
>>> VFIO migration will not be aborted when the vfio device of dst-vm 
>>> fails to transition from RESUMING to RUNNING state in 
>>> vfio_vmstate_change.
>>>
>>> I saw the comments in the vfio_vmstate_change but I don't understand 
>>> why no action is taken for this situation.
>>
>> There is error handling in vfio_vmstate_change() :
>>
>>         /*
>>          * Migration should be aborted in this case, but 
>> vm_state_notify()
>>          * currently does not support reporting failures.
>>          */
>>         migration_file_set_error(ret, local_err);
> 
> Hmm, I think this only sets the error on src. On dst we don't have 
> MigrationState->to_dst_file, so we end up just reporting the erro > But even if we did set it, no one is checking if there is a migration
> error after vm_start() is called in process_incoming_migration_bh().
Yes, that's what I mean.
> 
>>
>>> Allowing the live migration process to continue could cause 
>>> unrecoverable damage to the VM.
> 
> What do you mean by unrecoverable damage to the VM?
> If RESUMING->RUNNING transition fails, would a VFIO reset recover the 
> device and allow the VM to continue operation with damage limited only 
> to the VFIO device?
I didn't express myself clearly, let me explain again.
In my test, I ran a task inside the VM that was using this vfio device.
In this corner scenario, the task is aborted immediately and the VM is 
still running normally. I mean it is unacceptable that the task is 
aborted directly.
> 
>>> In this case, can we directly exit the dst-vm? Through the 
>>> return-path mechanism, the src-vm can continue to run.
>>>
>>> Looking forward to your reply.
>>
> The straightforward solution, as you suggested, is to exit dst upon 
> error in RESUMING->RUNNING transition and notify about it to src through 
> the return-path.
> However, I am not sure if failing the migration after vm_start() on dst 
> is a bit late (as we start vCPUs and do migration_block_activate, etc.).
I think vCPUs have not started yet. vCPUs will be started only after the 
all callbacks of vm_state_notify are executed.
If there are any problems with my analysis, please point them out.
> 
> But I can think of another way to solve this, hopefully simpler.
> According to VFIO migration uAPI [1]:
>   * RESUMING -> STOP
>   *   Leaving RESUMING terminates a data transfer session and indicates the
>   *   device should complete processing of the data delivered by 
> write(). The
>   *   kernel migration driver should complete the incorporation of data 
> written
>   *   to the data transfer FD into the device internal state and perform
>   *   final validity and consistency checking of the new device state. 
> If the
>   *   user provided data is found to be incomplete, inconsistent, or 
> otherwise
>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>   *   optionally go to the ERROR state as described below.
> 
> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the 
> device config is loaded (which is the last data the device is expected 
> to receive).
> If this transition fails, it means something was wrong with migration, 
> and we can send src an error msg via return-path (and not continue to 
> vm_start()).
> 
> Maybe this approach is less complicated than the first one, and it will 
> also work if src VM was paused prior migration.
> I already tested some POC and it seems to be working (at least with an 
> artificial error i injected in RESUMING->STOP transition).
> Kunkun, can you apply the following diff [3] and check if this solves 
> the issue?
Ok, I'll try it.
> 
> And in general, what do you think? Should we go with this approach or do 
> you have other ideas?
My current approach is rather crude. As I said above, I think vcpu has 
not started yet, so I changed it like this.
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 4c06e3db93..70ccb706c6 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -878,7 +878,7 @@ static void vfio_vmstate_change_prepare(void 
*opaque, bool running,
  static void vfio_vmstate_change(void *opaque, bool running, RunState 
state)
  {
      VFIODevice *vbasedev = opaque;
-    enum vfio_device_mig_state new_state;
+    enum vfio_device_mig_state new_state, pre_state;
      Error *local_err = NULL;
      int ret;
@@ -899,6 +899,10 @@ static void vfio_vmstate_change(void *opaque, bool 
running, RunState state)
           * currently does not support reporting failures.
           */
          migration_file_set_error(ret, local_err);
+
+        if (pre_state == VFIO_DEVICE_STATE_RESUMING) {
+            exit(EXIT_FAILURE);
+        }
      }
> 
> Thanks.
> 
> [1] 
> https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099 
> 
> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part 
> of RESUMING->RUNNING transition.
> [3]
> 
> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
> index e4785031a7..66f8461f02 100644
> --- a/hw/vfio/migration-multifd.c
> +++ b/hw/vfio/migration-multifd.c
> @@ -267,6 +267,12 @@ static bool 
> vfio_load_bufs_thread_load_config(VFIODevice *vbasedev,
>       ret = vfio_load_device_config_state(f_in, vbasedev);
>       bql_unlock();
> 
> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +                                   VFIO_DEVICE_STATE_ERROR, errp);
> +    if (ret) {
> +        return false;
> +    }
> +
>       if (ret < 0) {
>           error_setg(errp, "%s: vfio_load_device_config_state() failed: 
> %d",
>                      vbasedev->name, ret);
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4c06e3db93..a707d17a5b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -737,6 +737,8 @@ static int vfio_load_state(QEMUFile *f, void 
> *opaque, int version_id)
>           switch (data) {
>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>           {
> +            Error *local_err = NULL;
> +
>               if (vfio_multifd_transfer_enabled(vbasedev)) {
>                   error_report("%s: got DEV_CONFIG_STATE in main 
> migration "
>                                "channel but doing multifd transfer",
> @@ -744,7 +746,19 @@ static int vfio_load_state(QEMUFile *f, void 
> *opaque, int version_id)
>                   return -EINVAL;
>               }
> 
> -            return vfio_load_device_config_state(f, opaque);
> +            ret = vfio_load_device_config_state(f, opaque);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            ret = vfio_migration_set_state_or_reset(
> +                vbasedev, VFIO_DEVICE_STATE_STOP, &local_err);
> +            if (ret) {
> +                error_report_err(local_err);
> +                return ret;
> +            }
> +
> +            return 0;
>           }
>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>           {
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..fd498c864d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -91,6 +91,7 @@ enum mig_rp_message_type {
>       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to 
> resume */
>       MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
> +    MIG_RP_MSG_ERROR, /* Tell source that destination encountered an 
> error */
> 
>       MIG_RP_MSG_MAX
>   };
> @@ -884,6 +885,11 @@ process_incoming_migration_co(void *opaque)
>       ret = qemu_loadvm_state(mis->from_src_file);
>       mis->loadvm_co = NULL;
> 
> +    if (ret) {
> +        migrate_send_rp_error(mis);
> +        error_report("SENT RP ERROR");
> +    }
> +
>   trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
> 
>       ps = postcopy_state_get();
> @@ -1126,6 +1132,11 @@ bool migration_has_all_channels(void)
>       return true;
>   }
> +int migrate_send_rp_error(MigrationIncomingState *mis)
> +{
> +    return migrate_send_rp_message(mis, MIG_RP_MSG_ERROR, 0, NULL);
> +}
> +
>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>   {
>       return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, 
> NULL);
> @@ -2614,6 +2625,10 @@ static void *source_return_path_thread(void *opaque)
>               trace_source_return_path_thread_switchover_acked();
>               break;
> 
> +        case MIG_RP_MSG_ERROR:
> +            error_setg(&err, "DST indicated error");
> +            goto out;
> +
>           default:
>               break;
>           }
> diff --git a/migration/migration.h b/migration/migration.h
> index 01329bf824..f11ff7a199 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -553,6 +553,7 @@ void 
> migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>                                    char *block_name);
>   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t 
> value);
>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
> +int migrate_send_rp_error(MigrationIncomingState *mis);
> 
>   void dirty_bitmap_mig_before_vm_start(void);
>   void dirty_bitmap_mig_cancel_outgoing(void);
> 
>> I suggest you open an issue on :
>>
>>   https://gitlab.com/qemu-project/qemu/-/issues/
>>
>> with a detailed description of your environment :
>>
>>   Host HW, Host OS, QEMU version, QEMU command line, Guest OS, etc.
>>
>> A template is provided when a new issue is created.
>>
>>
>> Thanks,
>>
>> C.
>>
>>
>>
> .
^ permalink raw reply related	[flat|nested] 11+ messages in thread
- * Re: [Question] VFIO migration will not be aborted in a corner scenario
  2025-08-13 12:18     ` Kunkun Jiang via
@ 2025-08-18  6:44       ` Avihai Horon
  0 siblings, 0 replies; 11+ messages in thread
From: Avihai Horon @ 2025-08-18  6:44 UTC (permalink / raw)
  To: Kunkun Jiang, Cédric Le Goater, Alex Williamson,
	Yishai Hadas
  Cc: open list:All patches CC here, wanghaibin.wang, Zenghui Yu
On 13/08/2025 15:18, Kunkun Jiang wrote:
> External email: Use caution opening links or attachments
>
>
> Hi Avihai,
>
> On 2025/8/12 22:08, Avihai Horon wrote:
>>
>> On 11/08/2025 19:34, Cédric Le Goater wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello,
>>>
>>> + Avihai
>>>
>>> On 8/11/25 18:02, Kunkun Jiang wrote:
>>>> Hi all,
>>>>
>>>> While testing VFIO migration, I encountered an corner scenario case:
>>>> VFIO migration will not be aborted when the vfio device of dst-vm
>>>> fails to transition from RESUMING to RUNNING state in
>>>> vfio_vmstate_change.
>>>>
>>>> I saw the comments in the vfio_vmstate_change but I don't understand
>>>> why no action is taken for this situation.
>>>
>>> There is error handling in vfio_vmstate_change() :
>>>
>>>         /*
>>>          * Migration should be aborted in this case, but
>>> vm_state_notify()
>>>          * currently does not support reporting failures.
>>>          */
>>>         migration_file_set_error(ret, local_err);
>>
>> Hmm, I think this only sets the error on src. On dst we don't have
>> MigrationState->to_dst_file, so we end up just reporting the erro > 
>> But even if we did set it, no one is checking if there is a migration
>> error after vm_start() is called in process_incoming_migration_bh().
> Yes, that's what I mean.
>>
>>>
>>>> Allowing the live migration process to continue could cause
>>>> unrecoverable damage to the VM.
>>
>> What do you mean by unrecoverable damage to the VM?
>> If RESUMING->RUNNING transition fails, would a VFIO reset recover the
>> device and allow the VM to continue operation with damage limited only
>> to the VFIO device?
> I didn't express myself clearly, let me explain again.
> In my test, I ran a task inside the VM that was using this vfio device.
> In this corner scenario, the task is aborted immediately and the VM is
> still running normally. I mean it is unacceptable that the task is
> aborted directly.
Ah I understand.
>>
>>>> In this case, can we directly exit the dst-vm? Through the
>>>> return-path mechanism, the src-vm can continue to run.
>>>>
>>>> Looking forward to your reply.
>>>
>> The straightforward solution, as you suggested, is to exit dst upon
>> error in RESUMING->RUNNING transition and notify about it to src through
>> the return-path.
>> However, I am not sure if failing the migration after vm_start() on dst
>> is a bit late (as we start vCPUs and do migration_block_activate, etc.).
> I think vCPUs have not started yet. vCPUs will be started only after the
> all callbacks of vm_state_notify are executed.
Yes, in that case need to refactor vm_prepare_start/vm_state_notify to 
handle errors. May be a bit involved, need to further look into that.
Thanks.
> If there are any problems with my analysis, please point them out.
>>
>> But I can think of another way to solve this, hopefully simpler.
>> According to VFIO migration uAPI [1]:
>>   * RESUMING -> STOP
>>   *   Leaving RESUMING terminates a data transfer session and 
>> indicates the
>>   *   device should complete processing of the data delivered by
>> write(). The
>>   *   kernel migration driver should complete the incorporation of data
>> written
>>   *   to the data transfer FD into the device internal state and perform
>>   *   final validity and consistency checking of the new device state.
>> If the
>>   *   user provided data is found to be incomplete, inconsistent, or
>> otherwise
>>   *   invalid, the migration driver must fail the SET_STATE ioctl and
>>   *   optionally go to the ERROR state as described below.
>>
>> So, IIUC, we can add an explicit RESUMING->STOP transition [2] after the
>> device config is loaded (which is the last data the device is expected
>> to receive).
>> If this transition fails, it means something was wrong with migration,
>> and we can send src an error msg via return-path (and not continue to
>> vm_start()).
>>
>> Maybe this approach is less complicated than the first one, and it will
>> also work if src VM was paused prior migration.
>> I already tested some POC and it seems to be working (at least with an
>> artificial error i injected in RESUMING->STOP transition).
>> Kunkun, can you apply the following diff [3] and check if this solves
>> the issue?
> Ok, I'll try it.
>>
>> And in general, what do you think? Should we go with this approach or do
>> you have other ideas?
> My current approach is rather crude. As I said above, I think vcpu has
> not started yet, so I changed it like this.
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 4c06e3db93..70ccb706c6 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -878,7 +878,7 @@ static void vfio_vmstate_change_prepare(void
> *opaque, bool running,
>  static void vfio_vmstate_change(void *opaque, bool running, RunState
> state)
>  {
>      VFIODevice *vbasedev = opaque;
> -    enum vfio_device_mig_state new_state;
> +    enum vfio_device_mig_state new_state, pre_state;
>      Error *local_err = NULL;
>      int ret;
>
> @@ -899,6 +899,10 @@ static void vfio_vmstate_change(void *opaque, bool
> running, RunState state)
>           * currently does not support reporting failures.
>           */
>          migration_file_set_error(ret, local_err);
> +
> +        if (pre_state == VFIO_DEVICE_STATE_RESUMING) {
> +            exit(EXIT_FAILURE);
> +        }
>      }
>>
>> Thanks.
>>
>> [1]
>> https://elixir.bootlin.com/linux/v6.16/source/include/uapi/linux/vfio.h#L1099 
>>
>>
>> [2] Today RESUMING->STOP is done implicitly by the VFIO driver as part
>> of RESUMING->RUNNING transition.
>> [3]
>>
>> diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
>> index e4785031a7..66f8461f02 100644
>> --- a/hw/vfio/migration-multifd.c
>> +++ b/hw/vfio/migration-multifd.c
>> @@ -267,6 +267,12 @@ static bool
>> vfio_load_bufs_thread_load_config(VFIODevice *vbasedev,
>>       ret = vfio_load_device_config_state(f_in, vbasedev);
>>       bql_unlock();
>>
>> +    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> +                                   VFIO_DEVICE_STATE_ERROR, errp);
>> +    if (ret) {
>> +        return false;
>> +    }
>> +
>>       if (ret < 0) {
>>           error_setg(errp, "%s: vfio_load_device_config_state() failed:
>> %d",
>>                      vbasedev->name, ret);
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 4c06e3db93..a707d17a5b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -737,6 +737,8 @@ static int vfio_load_state(QEMUFile *f, void
>> *opaque, int version_id)
>>           switch (data) {
>>           case VFIO_MIG_FLAG_DEV_CONFIG_STATE:
>>           {
>> +            Error *local_err = NULL;
>> +
>>               if (vfio_multifd_transfer_enabled(vbasedev)) {
>>                   error_report("%s: got DEV_CONFIG_STATE in main
>> migration "
>>                                "channel but doing multifd transfer",
>> @@ -744,7 +746,19 @@ static int vfio_load_state(QEMUFile *f, void
>> *opaque, int version_id)
>>                   return -EINVAL;
>>               }
>>
>> -            return vfio_load_device_config_state(f, opaque);
>> +            ret = vfio_load_device_config_state(f, opaque);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +
>> +            ret = vfio_migration_set_state_or_reset(
>> +                vbasedev, VFIO_DEVICE_STATE_STOP, &local_err);
>> +            if (ret) {
>> +                error_report_err(local_err);
>> +                return ret;
>> +            }
>> +
>> +            return 0;
>>           }
>>           case VFIO_MIG_FLAG_DEV_SETUP_STATE:
>>           {
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 10c216d25d..fd498c864d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -91,6 +91,7 @@ enum mig_rp_message_type {
>>       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>>       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to
>> resume */
>>       MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do 
>> switchover */
>> +    MIG_RP_MSG_ERROR, /* Tell source that destination encountered an
>> error */
>>
>>       MIG_RP_MSG_MAX
>>   };
>> @@ -884,6 +885,11 @@ process_incoming_migration_co(void *opaque)
>>       ret = qemu_loadvm_state(mis->from_src_file);
>>       mis->loadvm_co = NULL;
>>
>> +    if (ret) {
>> +        migrate_send_rp_error(mis);
>> +        error_report("SENT RP ERROR");
>> +    }
>> +
>> trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>>
>>       ps = postcopy_state_get();
>> @@ -1126,6 +1132,11 @@ bool migration_has_all_channels(void)
>>       return true;
>>   }
>> +int migrate_send_rp_error(MigrationIncomingState *mis)
>> +{
>> +    return migrate_send_rp_message(mis, MIG_RP_MSG_ERROR, 0, NULL);
>> +}
>> +
>>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>>   {
>>       return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0,
>> NULL);
>> @@ -2614,6 +2625,10 @@ static void *source_return_path_thread(void 
>> *opaque)
>> trace_source_return_path_thread_switchover_acked();
>>               break;
>>
>> +        case MIG_RP_MSG_ERROR:
>> +            error_setg(&err, "DST indicated error");
>> +            goto out;
>> +
>>           default:
>>               break;
>>           }
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 01329bf824..f11ff7a199 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -553,6 +553,7 @@ void
>> migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>>                                    char *block_name);
>>   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t
>> value);
>>   int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>> +int migrate_send_rp_error(MigrationIncomingState *mis);
>>
>>   void dirty_bitmap_mig_before_vm_start(void);
>>   void dirty_bitmap_mig_cancel_outgoing(void);
>>
>>> I suggest you open an issue on :
>>>
>>> https://gitlab.com/qemu-project/qemu/-/issues/
>>>
>>> with a detailed description of your environment :
>>>
>>>   Host HW, Host OS, QEMU version, QEMU command line, Guest OS, etc.
>>>
>>> A template is provided when a new issue is created.
>>>
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>
>> .
^ permalink raw reply	[flat|nested] 11+ messages in thread