* [PATCH] Improve error propagation via return path
@ 2025-10-21 7:52 Dhruv Choudhary
2025-10-21 14:34 ` Peter Xu
0 siblings, 1 reply; 7+ messages in thread
From: Dhruv Choudhary @ 2025-10-21 7:52 UTC (permalink / raw)
To: Peter Xu, Fabiano Rosas, qemu-devel; +Cc: Dhruv Choudhary
Use the return-path thread to send error details from the
destination to the source on a migration failure. Management
applications can then query the source QEMU for errors, as
the single source of truth, making failures easy to trace.
Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
---
migration/migration.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index a63b46bbef..123cffb286 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -87,6 +87,7 @@ enum mig_rp_message_type {
MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
+ MIG_RP_MSG_ERROR, /* propogate error to source */
MIG_RP_MSG_MAX
};
@@ -608,6 +609,17 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
return migrate_send_rp_message_req_pages(mis, rb, start);
}
+static void migrate_send_rp_error(MigrationIncomingState *mis, Error *errp)
+{
+ const char *rpmsg = error_get_pretty(errp);
+ if (!mis->to_src_file) {
+ mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
+ }
+ migrate_send_rp_message(mis, MIG_RP_MSG_ERROR,
+ (uint16_t)(strlen(rpmsg) + 1),
+ (char *)rpmsg);
+}
+
static bool migration_colo_enabled;
bool migration_incoming_colo_enabled(void)
{
@@ -905,8 +917,12 @@ process_incoming_migration_co(void *opaque)
}
if (ret < 0) {
- error_prepend(&local_err, "load of migration failed: %s: ",
- strerror(-ret));
+ error_prepend(&local_err, "destination error : load of migration failed:
+ %s: ", strerror(-ret));
+ /* Check if return path is enabled and then send error to source */
+ if (migrate_postcopy_ram() || migrate_return_path()) {
+ migrate_send_rp_error(mis, local_err);
+ }
goto fail;
}
@@ -2437,6 +2453,7 @@ static struct rp_cmd_args {
[MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
[MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
[MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
+ [MIG_RP_MSG_ERROR] = { .len = -1, .name = "ERROR"},
[MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
};
@@ -2667,6 +2684,10 @@ static void *source_return_path_thread(void *opaque)
trace_source_return_path_thread_switchover_acked();
break;
+ case MIG_RP_MSG_ERROR:
+ error_setg(&err, "%s", (char *)buf);
+ goto out;
+
default:
break;
}
--
2.39.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 7:52 [PATCH] Improve error propagation via return path Dhruv Choudhary
@ 2025-10-21 14:34 ` Peter Xu
2025-10-21 14:54 ` Vladimir Sementsov-Ogievskiy
0 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2025-10-21 14:34 UTC (permalink / raw)
To: Dhruv Choudhary
Cc: Fabiano Rosas, qemu-devel, Vladimir Sementsov-Ogievskiy,
Daniel P. Berrangé
On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
> Use the return-path thread to send error details from the
> destination to the source on a migration failure. Management
> applications can then query the source QEMU for errors, as
> the single source of truth, making failures easy to trace.
>
> Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
+Vladimir, Dan
IIUC we may still need to know whether the src QEMU supports this message
or not.
OTOH, we have introduced exit-on-error since 9.1:
# @exit-on-error: Exit on incoming migration failure. Default true.
# When set to false, the failure triggers a :qapi:event:`MIGRATION`
# event, and error details could be retrieved with `query-migrate`.
# (since 9.1)
This patch is going the other way. That feature suggests the mgmt query
the error from dest directly.
We should stick with one plan rather than doing both.
Thanks,
> ---
> migration/migration.c | 25 +++++++++++++++++++++++--
> 1 file changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index a63b46bbef..123cffb286 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -87,6 +87,7 @@ enum mig_rp_message_type {
> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
> MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
> + MIG_RP_MSG_ERROR, /* propogate error to source */
>
> MIG_RP_MSG_MAX
> };
> @@ -608,6 +609,17 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> return migrate_send_rp_message_req_pages(mis, rb, start);
> }
>
> +static void migrate_send_rp_error(MigrationIncomingState *mis, Error *errp)
> +{
> + const char *rpmsg = error_get_pretty(errp);
> + if (!mis->to_src_file) {
> + mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
> + }
> + migrate_send_rp_message(mis, MIG_RP_MSG_ERROR,
> + (uint16_t)(strlen(rpmsg) + 1),
> + (char *)rpmsg);
> +}
> +
> static bool migration_colo_enabled;
> bool migration_incoming_colo_enabled(void)
> {
> @@ -905,8 +917,12 @@ process_incoming_migration_co(void *opaque)
> }
>
> if (ret < 0) {
> - error_prepend(&local_err, "load of migration failed: %s: ",
> - strerror(-ret));
> + error_prepend(&local_err, "destination error : load of migration failed:
> + %s: ", strerror(-ret));
> + /* Check if return path is enabled and then send error to source */
> + if (migrate_postcopy_ram() || migrate_return_path()) {
> + migrate_send_rp_error(mis, local_err);
> + }
> goto fail;
> }
>
> @@ -2437,6 +2453,7 @@ static struct rp_cmd_args {
> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
> [MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
> + [MIG_RP_MSG_ERROR] = { .len = -1, .name = "ERROR"},
> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
> };
>
> @@ -2667,6 +2684,10 @@ static void *source_return_path_thread(void *opaque)
> trace_source_return_path_thread_switchover_acked();
> break;
>
> + case MIG_RP_MSG_ERROR:
> + error_setg(&err, "%s", (char *)buf);
> + goto out;
> +
> default:
> break;
> }
> --
> 2.39.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 14:34 ` Peter Xu
@ 2025-10-21 14:54 ` Vladimir Sementsov-Ogievskiy
2025-10-21 15:24 ` Peter Xu
0 siblings, 1 reply; 7+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2025-10-21 14:54 UTC (permalink / raw)
To: Peter Xu, Dhruv Choudhary
Cc: Fabiano Rosas, qemu-devel, Daniel P. Berrangé
On 21.10.25 17:34, Peter Xu wrote:
> On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
>> Use the return-path thread to send error details from the
>> destination to the source on a migration failure. Management
>> applications can then query the source QEMU for errors, as
>> the single source of truth, making failures easy to trace.
>>
>> Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
>
> +Vladimir, Dan
>
> IIUC we may still need to know whether the src QEMU supports this message
> or not.
>
> OTOH, we have introduced exit-on-error since 9.1:
>
> # @exit-on-error: Exit on incoming migration failure. Default true.
> # When set to false, the failure triggers a :qapi:event:`MIGRATION`
> # event, and error details could be retrieved with `query-migrate`.
> # (since 9.1)
>
> This patch is going the other way. That feature suggests the mgmt query
> the error from dest directly.
>
> We should stick with one plan rather than doing both.
>
Why?
exit-on-error=false is good anyway: when QMP connection is established, the
management of target QEMU process is the same: we do call qmp commands to
add devices, etc. We get QMP events. Actually, exiting is unexpected, better
to fit into QMP protocol, continuing to send events and wait for qmp quit
to exit.
Passing error back to the source simply improves error message on source,
which otherwise is often confusing.
Using both, we of course see same error in two places.. But we do have two
QEMU processes, which both handled by on-host managing services. We should
correctly report error on both parts anyway.
Improving error messages on source is just and improvement, which makes
current behavior better (with or without exit-on-error=false).
Removing exit-on-error=false semantics (with or without passing errors back)
would be a step backward, to violating of QMP protocol by unexpected exits.
>
>> ---
>> migration/migration.c | 25 +++++++++++++++++++++++--
>> 1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index a63b46bbef..123cffb286 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -87,6 +87,7 @@ enum mig_rp_message_type {
>> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
>> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
>> MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>> + MIG_RP_MSG_ERROR, /* propogate error to source */
>>
>> MIG_RP_MSG_MAX
>> };
>> @@ -608,6 +609,17 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>> return migrate_send_rp_message_req_pages(mis, rb, start);
>> }
>>
>> +static void migrate_send_rp_error(MigrationIncomingState *mis, Error *errp)
>> +{
>> + const char *rpmsg = error_get_pretty(errp);
>> + if (!mis->to_src_file) {
>> + mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
>> + }
>> + migrate_send_rp_message(mis, MIG_RP_MSG_ERROR,
>> + (uint16_t)(strlen(rpmsg) + 1),
>> + (char *)rpmsg);
>> +}
>> +
>> static bool migration_colo_enabled;
>> bool migration_incoming_colo_enabled(void)
>> {
>> @@ -905,8 +917,12 @@ process_incoming_migration_co(void *opaque)
>> }
>>
>> if (ret < 0) {
>> - error_prepend(&local_err, "load of migration failed: %s: ",
>> - strerror(-ret));
>> + error_prepend(&local_err, "destination error : load of migration failed:
>> + %s: ", strerror(-ret));
>> + /* Check if return path is enabled and then send error to source */
>> + if (migrate_postcopy_ram() || migrate_return_path()) {
>> + migrate_send_rp_error(mis, local_err);
>> + }
>> goto fail;
>> }
>>
>> @@ -2437,6 +2453,7 @@ static struct rp_cmd_args {
>> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
>> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
>> [MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
>> + [MIG_RP_MSG_ERROR] = { .len = -1, .name = "ERROR"},
>> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
>> };
>>
>> @@ -2667,6 +2684,10 @@ static void *source_return_path_thread(void *opaque)
>> trace_source_return_path_thread_switchover_acked();
>> break;
>>
>> + case MIG_RP_MSG_ERROR:
>> + error_setg(&err, "%s", (char *)buf);
>> + goto out;
>> +
>> default:
>> break;
>> }
>> --
>> 2.39.3
>>
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 14:54 ` Vladimir Sementsov-Ogievskiy
@ 2025-10-21 15:24 ` Peter Xu
2025-10-21 20:31 ` Fabiano Rosas
0 siblings, 1 reply; 7+ messages in thread
From: Peter Xu @ 2025-10-21 15:24 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy
Cc: Dhruv Choudhary, Fabiano Rosas, qemu-devel,
Daniel P. Berrangé
On Tue, Oct 21, 2025 at 05:54:09PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 21.10.25 17:34, Peter Xu wrote:
> > On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
> > > Use the return-path thread to send error details from the
> > > destination to the source on a migration failure. Management
> > > applications can then query the source QEMU for errors, as
> > > the single source of truth, making failures easy to trace.
> > >
> > > Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
> >
> > +Vladimir, Dan
> >
> > IIUC we may still need to know whether the src QEMU supports this message
> > or not.
> >
> > OTOH, we have introduced exit-on-error since 9.1:
> >
> > # @exit-on-error: Exit on incoming migration failure. Default true.
> > # When set to false, the failure triggers a :qapi:event:`MIGRATION`
> > # event, and error details could be retrieved with `query-migrate`.
> > # (since 9.1)
> >
> > This patch is going the other way. That feature suggests the mgmt query
> > the error from dest directly.
> >
> > We should stick with one plan rather than doing both.
> >
>
> Why?
>
> exit-on-error=false is good anyway: when QMP connection is established, the
> management of target QEMU process is the same: we do call qmp commands to
> add devices, etc. We get QMP events. Actually, exiting is unexpected, better
> to fit into QMP protocol, continuing to send events and wait for qmp quit
> to exit.
>
> Passing error back to the source simply improves error message on source,
> which otherwise is often confusing.
>
> Using both, we of course see same error in two places.. But we do have two
> QEMU processes, which both handled by on-host managing services. We should
> correctly report error on both parts anyway.
>
> Improving error messages on source is just and improvement, which makes
> current behavior better (with or without exit-on-error=false).
>
> Removing exit-on-error=false semantics (with or without passing errors back)
> would be a step backward, to violating of QMP protocol by unexpected exits.
I didn't mean to propose removing exit-on-error, what I meant is when with
it this patch doesn't look like helpful.
Has libvirt been integrated with exit-on-error? If so, IMHO we don't need
this patch anymore. To me it's not an improvement when with exit-on-error,
because duplicating the error from dest to src makes it harder to know
where the error happened.
--
Peter Xu
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 15:24 ` Peter Xu
@ 2025-10-21 20:31 ` Fabiano Rosas
2025-10-21 21:18 ` Peter Xu
2025-10-22 8:32 ` Daniel P. Berrangé
0 siblings, 2 replies; 7+ messages in thread
From: Fabiano Rosas @ 2025-10-21 20:31 UTC (permalink / raw)
To: Peter Xu, Vladimir Sementsov-Ogievskiy
Cc: Dhruv Choudhary, qemu-devel, Daniel P. Berrangé
Peter Xu <peterx@redhat.com> writes:
> On Tue, Oct 21, 2025 at 05:54:09PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> On 21.10.25 17:34, Peter Xu wrote:
>> > On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
>> > > Use the return-path thread to send error details from the
>> > > destination to the source on a migration failure. Management
>> > > applications can then query the source QEMU for errors, as
>> > > the single source of truth, making failures easy to trace.
>> > >
>> > > Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
>> >
>> > +Vladimir, Dan
>> >
>> > IIUC we may still need to know whether the src QEMU supports this message
>> > or not.
>> >
>> > OTOH, we have introduced exit-on-error since 9.1:
>> >
>> > # @exit-on-error: Exit on incoming migration failure. Default true.
>> > # When set to false, the failure triggers a :qapi:event:`MIGRATION`
>> > # event, and error details could be retrieved with `query-migrate`.
>> > # (since 9.1)
>> >
>> > This patch is going the other way. That feature suggests the mgmt query
>> > the error from dest directly.
>> >
>> > We should stick with one plan rather than doing both.
>> >
>>
>> Why?
>>
>> exit-on-error=false is good anyway: when QMP connection is established, the
>> management of target QEMU process is the same: we do call qmp commands to
>> add devices, etc. We get QMP events. Actually, exiting is unexpected, better
>> to fit into QMP protocol, continuing to send events and wait for qmp quit
>> to exit.
>>
>> Passing error back to the source simply improves error message on source,
>> which otherwise is often confusing.
>>
>> Using both, we of course see same error in two places.. But we do have two
>> QEMU processes, which both handled by on-host managing services. We should
>> correctly report error on both parts anyway.
>>
>> Improving error messages on source is just and improvement, which makes
>> current behavior better (with or without exit-on-error=false).
>>
>> Removing exit-on-error=false semantics (with or without passing errors back)
>> would be a step backward, to violating of QMP protocol by unexpected exits.
>
> I didn't mean to propose removing exit-on-error, what I meant is when with
> it this patch doesn't look like helpful.
>
> Has libvirt been integrated with exit-on-error? If so, IMHO we don't need
> this patch anymore. To me it's not an improvement when with exit-on-error,
> because duplicating the error from dest to src makes it harder to know
> where the error happened.
Yeah, this does introduce some complexity of the "whose error is this?"
kind. I can imagine future users of migrate_has_error() having to handle
the error differently whether it came from this machine or the remote
one. Maybe with current code there's no issue, but we need to think from
a design perspective. Another point is whether the source machine is
always prepared to see an error that has nothing to do with its own
operation as it usually gets to know about a destination error only when
TCP connections start to fail.
That said, from a usability perspective, I'm in favor of having the
source machine be able to inform the user about the destination
machine's error. It goes in the direction of relying less on the
management layer, which we already agree might be a good idea.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 20:31 ` Fabiano Rosas
@ 2025-10-21 21:18 ` Peter Xu
2025-10-22 8:32 ` Daniel P. Berrangé
1 sibling, 0 replies; 7+ messages in thread
From: Peter Xu @ 2025-10-21 21:18 UTC (permalink / raw)
To: Fabiano Rosas
Cc: Vladimir Sementsov-Ogievskiy, Dhruv Choudhary, qemu-devel,
Daniel P. Berrangé
On Tue, Oct 21, 2025 at 05:31:19PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Tue, Oct 21, 2025 at 05:54:09PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >> On 21.10.25 17:34, Peter Xu wrote:
> >> > On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
> >> > > Use the return-path thread to send error details from the
> >> > > destination to the source on a migration failure. Management
> >> > > applications can then query the source QEMU for errors, as
> >> > > the single source of truth, making failures easy to trace.
> >> > >
> >> > > Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
> >> >
> >> > +Vladimir, Dan
> >> >
> >> > IIUC we may still need to know whether the src QEMU supports this message
> >> > or not.
> >> >
> >> > OTOH, we have introduced exit-on-error since 9.1:
> >> >
> >> > # @exit-on-error: Exit on incoming migration failure. Default true.
> >> > # When set to false, the failure triggers a :qapi:event:`MIGRATION`
> >> > # event, and error details could be retrieved with `query-migrate`.
> >> > # (since 9.1)
> >> >
> >> > This patch is going the other way. That feature suggests the mgmt query
> >> > the error from dest directly.
> >> >
> >> > We should stick with one plan rather than doing both.
> >> >
> >>
> >> Why?
> >>
> >> exit-on-error=false is good anyway: when QMP connection is established, the
> >> management of target QEMU process is the same: we do call qmp commands to
> >> add devices, etc. We get QMP events. Actually, exiting is unexpected, better
> >> to fit into QMP protocol, continuing to send events and wait for qmp quit
> >> to exit.
> >>
> >> Passing error back to the source simply improves error message on source,
> >> which otherwise is often confusing.
> >>
> >> Using both, we of course see same error in two places.. But we do have two
> >> QEMU processes, which both handled by on-host managing services. We should
> >> correctly report error on both parts anyway.
> >>
> >> Improving error messages on source is just and improvement, which makes
> >> current behavior better (with or without exit-on-error=false).
> >>
> >> Removing exit-on-error=false semantics (with or without passing errors back)
> >> would be a step backward, to violating of QMP protocol by unexpected exits.
> >
> > I didn't mean to propose removing exit-on-error, what I meant is when with
> > it this patch doesn't look like helpful.
> >
> > Has libvirt been integrated with exit-on-error? If so, IMHO we don't need
> > this patch anymore. To me it's not an improvement when with exit-on-error,
> > because duplicating the error from dest to src makes it harder to know
> > where the error happened.
>
> Yeah, this does introduce some complexity of the "whose error is this?"
> kind. I can imagine future users of migrate_has_error() having to handle
> the error differently whether it came from this machine or the remote
> one. Maybe with current code there's no issue, but we need to think from
> a design perspective. Another point is whether the source machine is
> always prepared to see an error that has nothing to do with its own
> operation as it usually gets to know about a destination error only when
> TCP connections start to fail.
>
> That said, from a usability perspective, I'm in favor of having the
> source machine be able to inform the user about the destination
> machine's error. It goes in the direction of relying less on the
> management layer, which we already agree might be a good idea.
Yes, but when the mgmt already has exit-on-error and be able to pull errors
on both sides, it'll be a problem because mgmt will start to see two errors.
But indeed even before this patch, mgmt can already see two errors.. the
src error being an IO failure. I wonder how libvirt decides which error to
use, and whether some patch like this one would confuse mgmt (consider mgmt
now ignores IO errors when the other error reports something meaningful,
but then mgmt starts to see both errors are not IO errors).
One thing we could do (without worrying about disturbing mgmt) might be we
store the dest error separately on src.. Then we expose it separately in
qeury-migrate too (then mgmt / HMP can decide what error to show).
--
Peter Xu
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Improve error propagation via return path
2025-10-21 20:31 ` Fabiano Rosas
2025-10-21 21:18 ` Peter Xu
@ 2025-10-22 8:32 ` Daniel P. Berrangé
1 sibling, 0 replies; 7+ messages in thread
From: Daniel P. Berrangé @ 2025-10-22 8:32 UTC (permalink / raw)
To: Fabiano Rosas
Cc: Peter Xu, Vladimir Sementsov-Ogievskiy, Dhruv Choudhary,
qemu-devel
On Tue, Oct 21, 2025 at 05:31:19PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Tue, Oct 21, 2025 at 05:54:09PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >> On 21.10.25 17:34, Peter Xu wrote:
> >> > On Tue, Oct 21, 2025 at 07:52:53AM +0000, Dhruv Choudhary wrote:
> >> > > Use the return-path thread to send error details from the
> >> > > destination to the source on a migration failure. Management
> >> > > applications can then query the source QEMU for errors, as
> >> > > the single source of truth, making failures easy to trace.
> >> > >
> >> > > Signed-off-by: Dhruv Choudhary <dhruv.choudhary@nutanix.com>
> >> >
> >> > +Vladimir, Dan
> >> >
> >> > IIUC we may still need to know whether the src QEMU supports this message
> >> > or not.
> >> >
> >> > OTOH, we have introduced exit-on-error since 9.1:
> >> >
> >> > # @exit-on-error: Exit on incoming migration failure. Default true.
> >> > # When set to false, the failure triggers a :qapi:event:`MIGRATION`
> >> > # event, and error details could be retrieved with `query-migrate`.
> >> > # (since 9.1)
> >> >
> >> > This patch is going the other way. That feature suggests the mgmt query
> >> > the error from dest directly.
> >> >
> >> > We should stick with one plan rather than doing both.
> >> >
> >>
> >> Why?
> >>
> >> exit-on-error=false is good anyway: when QMP connection is established, the
> >> management of target QEMU process is the same: we do call qmp commands to
> >> add devices, etc. We get QMP events. Actually, exiting is unexpected, better
> >> to fit into QMP protocol, continuing to send events and wait for qmp quit
> >> to exit.
> >>
> >> Passing error back to the source simply improves error message on source,
> >> which otherwise is often confusing.
> >>
> >> Using both, we of course see same error in two places.. But we do have two
> >> QEMU processes, which both handled by on-host managing services. We should
> >> correctly report error on both parts anyway.
> >>
> >> Improving error messages on source is just and improvement, which makes
> >> current behavior better (with or without exit-on-error=false).
> >>
> >> Removing exit-on-error=false semantics (with or without passing errors back)
> >> would be a step backward, to violating of QMP protocol by unexpected exits.
> >
> > I didn't mean to propose removing exit-on-error, what I meant is when with
> > it this patch doesn't look like helpful.
> >
> > Has libvirt been integrated with exit-on-error? If so, IMHO we don't need
> > this patch anymore. To me it's not an improvement when with exit-on-error,
> > because duplicating the error from dest to src makes it harder to know
> > where the error happened.
>
> Yeah, this does introduce some complexity of the "whose error is this?"
> kind. I can imagine future users of migrate_has_error() having to handle
> the error differently whether it came from this machine or the remote
> one. Maybe with current code there's no issue, but we need to think from
> a design perspective. Another point is whether the source machine is
> always prepared to see an error that has nothing to do with its own
> operation as it usually gets to know about a destination error only when
> TCP connections start to fail.
>
> That said, from a usability perspective, I'm in favor of having the
> source machine be able to inform the user about the destination
> machine's error. It goes in the direction of relying less on the
> management layer, which we already agree might be a good idea.
Should we neccessarily assume that target machine's error is the "best"
error message ? Failures can result in errors being raised on both the
source and dest, and it is not clearcut which side will have the root
cause error, and which will just have a side effect error. If we pass
the target error back to the source, we need to ensure that we don't
replace a better error that the source already has.
Allowing use of 'query-migrate' to fetch errors on both the source and
dest means mgmt apps have both errors available, but that does then
mean the mgmt app needs to decide which error is "best".
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-10-22 8:33 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21 7:52 [PATCH] Improve error propagation via return path Dhruv Choudhary
2025-10-21 14:34 ` Peter Xu
2025-10-21 14:54 ` Vladimir Sementsov-Ogievskiy
2025-10-21 15:24 ` Peter Xu
2025-10-21 20:31 ` Fabiano Rosas
2025-10-21 21:18 ` Peter Xu
2025-10-22 8:32 ` Daniel P. Berrangé
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).