qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] migration: Fix state transition in postcopy_start() error handling
@ 2025-08-26 11:51 Juraj Marcin
  2025-08-26 18:23 ` Peter Xu
  2025-09-27 14:01 ` Michael Tokarev
  0 siblings, 2 replies; 5+ messages in thread
From: Juraj Marcin @ 2025-08-26 11:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juraj Marcin, Fabiano Rosas, Peter Xu, qemu-stable

From: Juraj Marcin <jmarcin@redhat.com>

Commit 48814111366b ("migration: Always set DEVICE state") introduced
DEVICE state to postcopy, which moved the actual state transition that
leads to POSTCOPY_ACTIVE.

However, the error handling part of the postcopy_start() function still
expects the state POSTCOPY_ACTIVE, but depending on where an error
happens, now the state can be either ACTIVE, DEVICE or CANCELLING, but
never POSTCOPY_ACTIVE, as this transition now happens just before a
successful return from the function.

Instead, accept any state except CANCELLING when transitioning to FAILED
state.

Cc: qemu-stable@nongnu.org
Fixes: 48814111366b ("migration: Always set DEVICE state")
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>

---
In the RFC[1] where this patch was discussed, there was also a
suggestion for a helper function migrate_set_failure() that would check
if the state is not CANCELLING and then set migration error and FAILED
state. I discussed the implementation with Peter, and we came to a
conclusion that instead of patching such clean-up on top of the current
error handling code, it might be more useful to do a larger refactor and
clean-up of all error handling in the migration code.

Such clean-up should reduce the number of places where we need to
explicitly transition to a FAILED state (ideally to one, or only a
couple of places), and instead only set an appropriate migration error
using migrate_set_error(). Additionally, it would also refactor
inappropriate uses of QEMUFile errors where the error is not really an
error of the underlying channel and migrate_set_error() should be used
instead.

[1]: https://lore.kernel.org/all/20250807114922.1013286-3-jmarcin@redhat.com/
---
 migration/migration.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..32b8ce5613 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
 fail_closefb:
     qemu_fclose(fb);
 fail:
-    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                          MIGRATION_STATUS_FAILED);
+    if (ms->state != MIGRATION_STATUS_CANCELLING) {
+        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
+    }
     migration_block_activate(NULL);
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
     bql_unlock();
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration: Fix state transition in postcopy_start() error handling
  2025-08-26 11:51 [PATCH] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
@ 2025-08-26 18:23 ` Peter Xu
  2025-08-26 19:00   ` Fabiano Rosas
  2025-09-27 14:01 ` Michael Tokarev
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Xu @ 2025-08-26 18:23 UTC (permalink / raw)
  To: Juraj Marcin; +Cc: qemu-devel, Fabiano Rosas, qemu-stable

On Tue, Aug 26, 2025 at 01:51:40PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Commit 48814111366b ("migration: Always set DEVICE state") introduced
> DEVICE state to postcopy, which moved the actual state transition that
> leads to POSTCOPY_ACTIVE.
> 
> However, the error handling part of the postcopy_start() function still
> expects the state POSTCOPY_ACTIVE, but depending on where an error
> happens, now the state can be either ACTIVE, DEVICE or CANCELLING, but
> never POSTCOPY_ACTIVE, as this transition now happens just before a
> successful return from the function.
> 
> Instead, accept any state except CANCELLING when transitioning to FAILED
> state.
> 
> Cc: qemu-stable@nongnu.org
> Fixes: 48814111366b ("migration: Always set DEVICE state")
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>

Thanks, Juraj!

Reviewed-by: Peter Xu <peterx@redhat.com>

> 
> ---
> In the RFC[1] where this patch was discussed, there was also a
> suggestion for a helper function migrate_set_failure() that would check
> if the state is not CANCELLING and then set migration error and FAILED
> state. I discussed the implementation with Peter, and we came to a
> conclusion that instead of patching such clean-up on top of the current
> error handling code, it might be more useful to do a larger refactor and
> clean-up of all error handling in the migration code.
> 
> Such clean-up should reduce the number of places where we need to
> explicitly transition to a FAILED state (ideally to one, or only a
> couple of places), and instead only set an appropriate migration error
> using migrate_set_error(). Additionally, it would also refactor
> inappropriate uses of QEMUFile errors where the error is not really an
> error of the underlying channel and migrate_set_error() should be used
> instead.

Fabiano: we discussed something around the FAILED status before as well.
If you started working on something in this area, please shoot!

> 
> [1]: https://lore.kernel.org/all/20250807114922.1013286-3-jmarcin@redhat.com/
> ---
>  migration/migration.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..32b8ce5613 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>  fail_closefb:
>      qemu_fclose(fb);
>  fail:
> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> -                          MIGRATION_STATUS_FAILED);
> +    if (ms->state != MIGRATION_STATUS_CANCELLING) {
> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> +    }
>      migration_block_activate(NULL);
>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>      bql_unlock();
> -- 
> 2.50.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration: Fix state transition in postcopy_start() error handling
  2025-08-26 18:23 ` Peter Xu
@ 2025-08-26 19:00   ` Fabiano Rosas
  0 siblings, 0 replies; 5+ messages in thread
From: Fabiano Rosas @ 2025-08-26 19:00 UTC (permalink / raw)
  To: Peter Xu, Juraj Marcin; +Cc: qemu-devel, qemu-stable

Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 26, 2025 at 01:51:40PM +0200, Juraj Marcin wrote:
>> From: Juraj Marcin <jmarcin@redhat.com>
>> 
>> Commit 48814111366b ("migration: Always set DEVICE state") introduced
>> DEVICE state to postcopy, which moved the actual state transition that
>> leads to POSTCOPY_ACTIVE.
>> 
>> However, the error handling part of the postcopy_start() function still
>> expects the state POSTCOPY_ACTIVE, but depending on where an error
>> happens, now the state can be either ACTIVE, DEVICE or CANCELLING, but
>> never POSTCOPY_ACTIVE, as this transition now happens just before a
>> successful return from the function.
>> 
>> Instead, accept any state except CANCELLING when transitioning to FAILED
>> state.
>> 
>> Cc: qemu-stable@nongnu.org
>> Fixes: 48814111366b ("migration: Always set DEVICE state")
>> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
>
> Thanks, Juraj!
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

>> 
>> ---
>> In the RFC[1] where this patch was discussed, there was also a
>> suggestion for a helper function migrate_set_failure() that would check
>> if the state is not CANCELLING and then set migration error and FAILED
>> state. I discussed the implementation with Peter, and we came to a
>> conclusion that instead of patching such clean-up on top of the current
>> error handling code, it might be more useful to do a larger refactor and
>> clean-up of all error handling in the migration code.
>> 
>> Such clean-up should reduce the number of places where we need to
>> explicitly transition to a FAILED state (ideally to one, or only a
>> couple of places), and instead only set an appropriate migration error
>> using migrate_set_error(). Additionally, it would also refactor
>> inappropriate uses of QEMUFile errors where the error is not really an
>> error of the underlying channel and migrate_set_error() should be used
>> instead.
>
> Fabiano: we discussed something around the FAILED status before as well.
> If you started working on something in this area, please shoot!
>

I don't have anything planned, it's just the thread that I already
linked in the previous version of this patch. Juraj is aware.

>> 
>> [1]: https://lore.kernel.org/all/20250807114922.1013286-3-jmarcin@redhat.com/
>> ---
>>  migration/migration.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 10c216d25d..32b8ce5613 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>>  fail_closefb:
>>      qemu_fclose(fb);
>>  fail:
>> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> -                          MIGRATION_STATUS_FAILED);
>> +    if (ms->state != MIGRATION_STATUS_CANCELLING) {
>> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> +    }
>>      migration_block_activate(NULL);
>>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>>      bql_unlock();
>> -- 
>> 2.50.1
>> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration: Fix state transition in postcopy_start() error handling
  2025-08-26 11:51 [PATCH] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
  2025-08-26 18:23 ` Peter Xu
@ 2025-09-27 14:01 ` Michael Tokarev
  2025-09-29 15:47   ` Peter Xu
  1 sibling, 1 reply; 5+ messages in thread
From: Michael Tokarev @ 2025-09-27 14:01 UTC (permalink / raw)
  To: Juraj Marcin, qemu-devel; +Cc: Fabiano Rosas, Peter Xu, qemu-stable

On 26.08.2025 14:51, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Commit 48814111366b ("migration: Always set DEVICE state") introduced
> DEVICE state to postcopy, which moved the actual state transition that
> leads to POSTCOPY_ACTIVE.
> 
> However, the error handling part of the postcopy_start() function still
> expects the state POSTCOPY_ACTIVE, but depending on where an error
> happens, now the state can be either ACTIVE, DEVICE or CANCELLING, but
> never POSTCOPY_ACTIVE, as this transition now happens just before a
> successful return from the function.
> 
> Instead, accept any state except CANCELLING when transitioning to FAILED
> state.
> 
> Cc: qemu-stable@nongnu.org
> Fixes: 48814111366b ("migration: Always set DEVICE state")
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> 
> ---
> In the RFC[1] where this patch was discussed, there was also a
> suggestion for a helper function migrate_set_failure() that would check
> if the state is not CANCELLING and then set migration error and FAILED
> state. I discussed the implementation with Peter, and we came to a
> conclusion that instead of patching such clean-up on top of the current
> error handling code, it might be more useful to do a larger refactor and
> clean-up of all error handling in the migration code.
> 
> Such clean-up should reduce the number of places where we need to
> explicitly transition to a FAILED state (ideally to one, or only a
> couple of places), and instead only set an appropriate migration error
> using migrate_set_error(). Additionally, it would also refactor
> inappropriate uses of QEMUFile errors where the error is not really an
> error of the underlying channel and migrate_set_error() should be used
> instead.
> 
> [1]: https://lore.kernel.org/all/20250807114922.1013286-3-jmarcin@redhat.com/

Ping?  Can we apply this to the master branch, so I can pick it up for
the stable series?

Thanks,

/mjt


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration: Fix state transition in postcopy_start() error handling
  2025-09-27 14:01 ` Michael Tokarev
@ 2025-09-29 15:47   ` Peter Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Xu @ 2025-09-29 15:47 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Juraj Marcin, qemu-devel, Fabiano Rosas, qemu-stable

On Sat, Sep 27, 2025 at 05:01:11PM +0300, Michael Tokarev wrote:
> On 26.08.2025 14:51, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Commit 48814111366b ("migration: Always set DEVICE state") introduced
> > DEVICE state to postcopy, which moved the actual state transition that
> > leads to POSTCOPY_ACTIVE.
> > 
> > However, the error handling part of the postcopy_start() function still
> > expects the state POSTCOPY_ACTIVE, but depending on where an error
> > happens, now the state can be either ACTIVE, DEVICE or CANCELLING, but
> > never POSTCOPY_ACTIVE, as this transition now happens just before a
> > successful return from the function.
> > 
> > Instead, accept any state except CANCELLING when transitioning to FAILED
> > state.
> > 
> > Cc: qemu-stable@nongnu.org
> > Fixes: 48814111366b ("migration: Always set DEVICE state")
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > 
> > ---
> > In the RFC[1] where this patch was discussed, there was also a
> > suggestion for a helper function migrate_set_failure() that would check
> > if the state is not CANCELLING and then set migration error and FAILED
> > state. I discussed the implementation with Peter, and we came to a
> > conclusion that instead of patching such clean-up on top of the current
> > error handling code, it might be more useful to do a larger refactor and
> > clean-up of all error handling in the migration code.
> > 
> > Such clean-up should reduce the number of places where we need to
> > explicitly transition to a FAILED state (ideally to one, or only a
> > couple of places), and instead only set an appropriate migration error
> > using migrate_set_error(). Additionally, it would also refactor
> > inappropriate uses of QEMUFile errors where the error is not really an
> > error of the underlying channel and migrate_set_error() should be used
> > instead.
> > 
> > [1]: https://lore.kernel.org/all/20250807114922.1013286-3-jmarcin@redhat.com/
> 
> Ping?  Can we apply this to the master branch, so I can pick it up for
> the stable series?

Apologies for the delay, queued.  Will send the PR this week.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-29 15:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26 11:51 [PATCH] migration: Fix state transition in postcopy_start() error handling Juraj Marcin
2025-08-26 18:23 ` Peter Xu
2025-08-26 19:00   ` Fabiano Rosas
2025-09-27 14:01 ` Michael Tokarev
2025-09-29 15:47   ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).