* [PATCH 0/2] migration: Fix return-path thread exit
@ 2024-02-01 18:48 Cédric Le Goater
2024-02-01 18:48 ` [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source() Cédric Le Goater
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-01 18:48 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Fabiano Rosas, Cédric Le Goater
Hello,
Today, close_return_path_on_source() can perform a shutdown to exit
the return-path thread if an error occured. However, migrate_fd_cleanup()
does cleanups too early and the shutdown in close_return_path_on_source()
fails, leaving the source and destination waiting for an event to occur.
This little series tries to fix that. Comments welcome !
Thanks,
C.
Cédric Le Goater (2):
migration: Add a file_error argument to close_return_path_on_source()
migration: Fix return-path thread exit
migration/migration.c | 33 ++++++++++++++++++---------------
1 file changed, 18 insertions(+), 15 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source()
2024-02-01 18:48 [PATCH 0/2] migration: Fix return-path thread exit Cédric Le Goater
@ 2024-02-01 18:48 ` Cédric Le Goater
2024-02-02 14:30 ` Fabiano Rosas
2024-02-01 18:48 ` [PATCH 2/2] migration: Fix return-path thread exit Cédric Le Goater
2024-02-02 9:53 ` [PATCH 0/2] " Peter Xu
2 siblings, 1 reply; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-01 18:48 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Fabiano Rosas, Cédric Le Goater
close_return_path_on_source() retrieves the migration error from the
the QEMUFile '->to_dst_file' to know if a shutdown is required to exit
the return-path thread. However, in migrate_fd_cleanup(), '->to_dst_file'
is cleaned up before calling close_return_path_on_source() and the
shutdown is never performed, leaving the source and destination
waiting for an event to occur.
Cache the file error in a temporary variable and pass it to
close_return_path_on_source() to avoid relying on '->to_dst_file'.
Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
migration/migration.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index d5f705ceef4c925589aa49335969672c0d761fa2..2c3362235c7651c11d581f3c3639571f1f9636ef 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -100,7 +100,7 @@ static int migration_maybe_pause(MigrationState *s,
int *current_active_state,
int new_state);
static void migrate_fd_cancel(MigrationState *s);
-static bool close_return_path_on_source(MigrationState *s);
+static bool close_return_path_on_source(MigrationState *s, int eror);
static void migration_downtime_start(MigrationState *s)
{
@@ -1313,6 +1313,8 @@ void migrate_set_state(int *state, int old_state, int new_state)
static void migrate_fd_cleanup(MigrationState *s)
{
+ int file_error = 0;
+
g_free(s->hostname);
s->hostname = NULL;
json_writer_free(s->vmdesc);
@@ -1333,6 +1335,7 @@ static void migrate_fd_cleanup(MigrationState *s)
multifd_save_cleanup();
qemu_mutex_lock(&s->qemu_file_lock);
+ file_error = qemu_file_get_error(s->to_dst_file);
tmp = s->to_dst_file;
s->to_dst_file = NULL;
qemu_mutex_unlock(&s->qemu_file_lock);
@@ -1348,7 +1351,7 @@ static void migrate_fd_cleanup(MigrationState *s)
* We already cleaned up to_dst_file, so errors from the return
* path might be due to that, ignore them.
*/
- close_return_path_on_source(s);
+ close_return_path_on_source(s, file_error);
assert(!migration_is_active(s));
@@ -2357,7 +2360,7 @@ static int open_return_path_on_source(MigrationState *ms)
}
/* Return true if error detected, or false otherwise */
-static bool close_return_path_on_source(MigrationState *ms)
+static bool close_return_path_on_source(MigrationState *ms, int file_error)
{
if (!ms->rp_state.rp_thread_created) {
return false;
@@ -2372,8 +2375,7 @@ static bool close_return_path_on_source(MigrationState *ms)
* cause it to unblock if it's stuck waiting for the destination.
*/
WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
- if (ms->to_dst_file && ms->rp_state.from_dst_file &&
- qemu_file_get_error(ms->to_dst_file)) {
+ if (file_error && ms->rp_state.from_dst_file) {
qemu_file_shutdown(ms->rp_state.from_dst_file);
}
}
@@ -2707,6 +2709,7 @@ static void migration_completion(MigrationState *s)
{
int ret = 0;
int current_active_state = s->state;
+ int file_error = qemu_file_get_error(s->to_dst_file);
if (s->state == MIGRATION_STATUS_ACTIVE) {
ret = migration_completion_precopy(s, ¤t_active_state);
@@ -2720,11 +2723,11 @@ static void migration_completion(MigrationState *s)
goto fail;
}
- if (close_return_path_on_source(s)) {
+ if (close_return_path_on_source(s, file_error)) {
goto fail;
}
- if (qemu_file_get_error(s->to_dst_file)) {
+ if (file_error) {
trace_migration_completion_file_err();
goto fail;
}
@@ -2861,6 +2864,7 @@ static MigThrError postcopy_pause(MigrationState *s)
while (true) {
QEMUFile *file;
+ int file_error;
/*
* Current channel is possibly broken. Release it. Note that this is
@@ -2874,6 +2878,7 @@ static MigThrError postcopy_pause(MigrationState *s)
assert(s->to_dst_file);
migration_ioc_unregister_yank_from_file(s->to_dst_file);
qemu_mutex_lock(&s->qemu_file_lock);
+ file_error = qemu_file_get_error(s->to_dst_file);
file = s->to_dst_file;
s->to_dst_file = NULL;
qemu_mutex_unlock(&s->qemu_file_lock);
@@ -2886,7 +2891,7 @@ static MigThrError postcopy_pause(MigrationState *s)
* path and just wait for the thread to finish. It will be
* re-created when we resume.
*/
- close_return_path_on_source(s);
+ close_return_path_on_source(s, file_error);
migrate_set_state(&s->state, s->state,
MIGRATION_STATUS_POSTCOPY_PAUSED);
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/2] migration: Fix return-path thread exit
2024-02-01 18:48 [PATCH 0/2] migration: Fix return-path thread exit Cédric Le Goater
2024-02-01 18:48 ` [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source() Cédric Le Goater
@ 2024-02-01 18:48 ` Cédric Le Goater
2024-02-02 14:42 ` Fabiano Rosas
2024-02-02 9:53 ` [PATCH 0/2] " Peter Xu
2 siblings, 1 reply; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-01 18:48 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Fabiano Rosas, Cédric Le Goater
In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread. However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.
Close the file after calling close_return_path_on_source() so that the
shutdown succeeds and the return-path thread exits.
Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
migration/migration.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
static void migrate_fd_cleanup(MigrationState *s)
{
int file_error = 0;
+ QEMUFile *tmp = NULL;
g_free(s->hostname);
s->hostname = NULL;
@@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_savevm_state_cleanup();
if (s->to_dst_file) {
- QEMUFile *tmp;
-
trace_migrate_fd_cleanup();
bql_unlock();
if (s->migration_thread_running) {
@@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
* critical section won't block for long.
*/
migration_ioc_unregister_yank_from_file(tmp);
- qemu_fclose(tmp);
}
- /*
- * We already cleaned up to_dst_file, so errors from the return
- * path might be due to that, ignore them.
- */
close_return_path_on_source(s, file_error);
+ if (tmp) {
+ qemu_fclose(tmp);
+ }
+
assert(!migration_is_active(s));
if (s->state == MIGRATION_STATUS_CANCELLING) {
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] migration: Fix return-path thread exit
2024-02-01 18:48 [PATCH 0/2] migration: Fix return-path thread exit Cédric Le Goater
2024-02-01 18:48 ` [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source() Cédric Le Goater
2024-02-01 18:48 ` [PATCH 2/2] migration: Fix return-path thread exit Cédric Le Goater
@ 2024-02-02 9:53 ` Peter Xu
2024-02-02 13:04 ` Cédric Le Goater
2024-02-05 10:32 ` Daniel P. Berrangé
2 siblings, 2 replies; 14+ messages in thread
From: Peter Xu @ 2024-02-02 9:53 UTC (permalink / raw)
To: Cédric Le Goater, Daniel P. Berrangé; +Cc: qemu-devel, Fabiano Rosas
On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote:
> Hello,
Hi, Cédric,
Thanks for the patches.
>
> Today, close_return_path_on_source() can perform a shutdown to exit
> the return-path thread if an error occured. However, migrate_fd_cleanup()
> does cleanups too early and the shutdown in close_return_path_on_source()
> fails, leaving the source and destination waiting for an event to occur.
>
> This little series tries to fix that. Comments welcome !
One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
close_return_path_on_source() is weird: IMHO we have better way to detect
"whether the migration has error" now, which is migrate_has_error().
For this specific issue, I think one long standing issue that might be
relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
the same QIOChannel now. Logically the two QEMUFile should be able to be
managed separately, say, close() of to_dst_file shouldn't affect the other.
However I don't think it's the case now, as qemu_fclose(to_dst_file) will
do qio_channel_close() already, which means there will be a side effect to
the other QEMUFile that its backing IOC is already closed.
Is this the issue we're facing? IOW, the close() of to_dst_file will not
properly kick the other thread who is blocked at reading from_dst_file,
while the shutdown() will kick it out?
If so, not sure whether we can somehow relay the real qio_channel_close()
to until the last user releases it? IOW, conditionally close() the channel
in qio_channel_finalize(), if the channel is still open? Would that make
sense?
Copy Dan too.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] migration: Fix return-path thread exit
2024-02-02 9:53 ` [PATCH 0/2] " Peter Xu
@ 2024-02-02 13:04 ` Cédric Le Goater
2024-02-05 10:32 ` Daniel P. Berrangé
1 sibling, 0 replies; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-02 13:04 UTC (permalink / raw)
To: Peter Xu, Daniel P. Berrangé; +Cc: qemu-devel, Fabiano Rosas
Hello Peter,
>> Today, close_return_path_on_source() can perform a shutdown to exit
>> the return-path thread if an error occured. However, migrate_fd_cleanup()
>> does cleanups too early and the shutdown in close_return_path_on_source()
>> fails, leaving the source and destination waiting for an event to occur.
>>
>> This little series tries to fix that. Comments welcome !
>
> One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
> close_return_path_on_source() is weird: IMHO we have better way to detect
> "whether the migration has error" now, which is migrate_has_error().
ok. migrate_has_error() looks safe to use in that case. It works fine
with all the prereq VFIO cleanups (that I didn't send yet) and errors
in the setup of dirty tracking are reported correctly to the migration
core subsystem.
> For this specific issue, I think one long standing issue that might be
> relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
> the same QIOChannel now. Logically the two QEMUFile should be able to be
> managed separately, say, close() of to_dst_file shouldn't affect the other.
>
> However I don't think it's the case now, as qemu_fclose(to_dst_file) will
> do qio_channel_close() already, which means there will be a side effect to
> the other QEMUFile that its backing IOC is already closed.
>
> Is this the issue we're facing?
Yes. The socket is closed before calling close_return_path_on_source()
and ms->rp_state.from_dst_file becomes invalid, the shutdown silently
fails (we should maybe report error in qemu_file_shutdown()) and the
return-path thread does not exits.
> IOW, the close() of to_dst_file will not
> properly kick the other thread who is blocked at reading from_dst_file,
> while the shutdown() will kick it out?
Yes, that's how I understand the comment :
/*
* If this is a normal exit then the destination will send a SHUT
* and the rp_thread will exit, however if there's an error we
* need to cause it to exit. shutdown(2), if we have it, will
* cause it to unblock if it's stuck waiting for the destination.
*/
> If so, not sure whether we can somehow relay the real qio_channel_close()
> to until the last user releases it? IOW, conditionally close() the channel> in qio_channel_finalize(), if the channel is still open? Would that make
> sense?
It's the first time that I look at this code :/ I can't tell. Here is
the closing section :
qemu_mutex_unlock(&s->qemu_file_lock);
/*
* Close the file handle without the lock to make sure the
* critical section won't block for long.
*/
migration_ioc_unregister_yank_from_file(tmp);
qemu_fclose(tmp);
}
Thanks,
C.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source()
2024-02-01 18:48 ` [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source() Cédric Le Goater
@ 2024-02-02 14:30 ` Fabiano Rosas
2024-02-02 14:45 ` Cédric Le Goater
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2024-02-02 14:30 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel; +Cc: Peter Xu, Cédric Le Goater
Cédric Le Goater <clg@redhat.com> writes:
> close_return_path_on_source() retrieves the migration error from the
> the QEMUFile '->to_dst_file' to know if a shutdown is required to exit
> the return-path thread. However, in migrate_fd_cleanup(), '->to_dst_file'
> is cleaned up before calling close_return_path_on_source() and the
> shutdown is never performed, leaving the source and destination
> waiting for an event to occur.
Isn't this just missing qemu_file_shutdown() at migrate_fd_cleanup?
if (s->to_dst_file) {
...
migration_ioc_unregister_yank_from_file(tmp);
+ qemu_file_shutdown(tmp);
qemu_fclose(tmp);
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] migration: Fix return-path thread exit
2024-02-01 18:48 ` [PATCH 2/2] migration: Fix return-path thread exit Cédric Le Goater
@ 2024-02-02 14:42 ` Fabiano Rosas
2024-02-02 14:51 ` Cédric Le Goater
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2024-02-02 14:42 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel; +Cc: Peter Xu, Cédric Le Goater
Cédric Le Goater <clg@redhat.com> writes:
> In case of error, close_return_path_on_source() can perform a shutdown
> to exit the return-path thread. However, in migrate_fd_cleanup(),
> 'to_dst_file' is closed before calling close_return_path_on_source()
> and the shutdown fails, leaving the source and destination waiting for
> an event to occur.
At close_return_path_on_source, qemu_file_shutdown() and checking
ms->to_dst_file are done under the qemu_file_lock, so how could
migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
check have passed?
>
> Close the file after calling close_return_path_on_source() so that the
> shutdown succeeds and the return-path thread exits.
>
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> ---
> migration/migration.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
> static void migrate_fd_cleanup(MigrationState *s)
> {
> int file_error = 0;
> + QEMUFile *tmp = NULL;
>
> g_free(s->hostname);
> s->hostname = NULL;
> @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
> qemu_savevm_state_cleanup();
>
> if (s->to_dst_file) {
> - QEMUFile *tmp;
> -
> trace_migrate_fd_cleanup();
> bql_unlock();
> if (s->migration_thread_running) {
> @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
> * critical section won't block for long.
> */
> migration_ioc_unregister_yank_from_file(tmp);
> - qemu_fclose(tmp);
> }
>
> - /*
> - * We already cleaned up to_dst_file, so errors from the return
> - * path might be due to that, ignore them.
> - */
> close_return_path_on_source(s, file_error);
>
> + if (tmp) {
> + qemu_fclose(tmp);
> + }
> +
> assert(!migration_is_active(s));
>
> if (s->state == MIGRATION_STATUS_CANCELLING) {
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source()
2024-02-02 14:30 ` Fabiano Rosas
@ 2024-02-02 14:45 ` Cédric Le Goater
0 siblings, 0 replies; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-02 14:45 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel; +Cc: Peter Xu
On 2/2/24 15:30, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
>
>> close_return_path_on_source() retrieves the migration error from the
>> the QEMUFile '->to_dst_file' to know if a shutdown is required to exit
>> the return-path thread. However, in migrate_fd_cleanup(), '->to_dst_file'
>> is cleaned up before calling close_return_path_on_source() and the
>> shutdown is never performed, leaving the source and destination
>> waiting for an event to occur.
>
> Isn't this just missing qemu_file_shutdown() at migrate_fd_cleanup?
>
> if (s->to_dst_file) {
> ...
> migration_ioc_unregister_yank_from_file(tmp);
> + qemu_file_shutdown(tmp);
> qemu_fclose(tmp);
> }
>
That would make the return-path thread exit indeed. It should not
be necessary when there are no errors though and this is done
outside of the close_return_path_on_source() helper. There could
be side effects.
I took into account Peter's comment and replaced the changes of
PATCH 1 with :
@@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(
* cause it to unblock if it's stuck waiting for the destination.
*/
WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
- if (ms->to_dst_file && ms->rp_state.from_dst_file &&
- qemu_file_get_error(ms->to_dst_file)) {
+ if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
qemu_file_shutdown(ms->rp_state.from_dst_file);
}
}
Nevertheless, we need to qemu_file_shutdown() correctly the socket
for this to work and the problem seems more complex than just moving
code as I did in PATCH 2.
Thanks,
C.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] migration: Fix return-path thread exit
2024-02-02 14:42 ` Fabiano Rosas
@ 2024-02-02 14:51 ` Cédric Le Goater
2024-02-02 15:11 ` Fabiano Rosas
0 siblings, 1 reply; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-02 14:51 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel; +Cc: Peter Xu
On 2/2/24 15:42, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
>
>> In case of error, close_return_path_on_source() can perform a shutdown
>> to exit the return-path thread. However, in migrate_fd_cleanup(),
>> 'to_dst_file' is closed before calling close_return_path_on_source()
>> and the shutdown fails, leaving the source and destination waiting for
>> an event to occur.
>
> At close_return_path_on_source, qemu_file_shutdown() and checking
> ms->to_dst_file are done under the qemu_file_lock, so how could
> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
> check have passed?
This is not a locking issue, it's much simpler. migrate_fd_cleanup()
clears the ms->to_dst_file pointer and closes the QEMUFile and then
calls close_return_path_on_source() which then tries to use resources
which are not available anymore.
Thanks,
C.
>
>>
>> Close the file after calling close_return_path_on_source() so that the
>> shutdown succeeds and the return-path thread exits.
>>
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>> ---
>> migration/migration.c | 12 +++++-------
>> 1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
>> static void migrate_fd_cleanup(MigrationState *s)
>> {
>> int file_error = 0;
>> + QEMUFile *tmp = NULL;
>>
>> g_free(s->hostname);
>> s->hostname = NULL;
>> @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
>> qemu_savevm_state_cleanup();
>>
>> if (s->to_dst_file) {
>> - QEMUFile *tmp;
>> -
>> trace_migrate_fd_cleanup();
>> bql_unlock();
>> if (s->migration_thread_running) {
>> @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
>> * critical section won't block for long.
>> */
>> migration_ioc_unregister_yank_from_file(tmp);
>> - qemu_fclose(tmp);
>> }
>>
>> - /*
>> - * We already cleaned up to_dst_file, so errors from the return
>> - * path might be due to that, ignore them.
>> - */
>> close_return_path_on_source(s, file_error);
>>
>> + if (tmp) {
>> + qemu_fclose(tmp);
>> + }
>> +
>> assert(!migration_is_active(s));
>>
>> if (s->state == MIGRATION_STATUS_CANCELLING) {
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] migration: Fix return-path thread exit
2024-02-02 14:51 ` Cédric Le Goater
@ 2024-02-02 15:11 ` Fabiano Rosas
2024-02-05 3:37 ` Peter Xu
0 siblings, 1 reply; 14+ messages in thread
From: Fabiano Rosas @ 2024-02-02 15:11 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel; +Cc: Peter Xu
Cédric Le Goater <clg@redhat.com> writes:
> On 2/2/24 15:42, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>>
>>> In case of error, close_return_path_on_source() can perform a shutdown
>>> to exit the return-path thread. However, in migrate_fd_cleanup(),
>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>> and the shutdown fails, leaving the source and destination waiting for
>>> an event to occur.
>>
>> At close_return_path_on_source, qemu_file_shutdown() and checking
>> ms->to_dst_file are done under the qemu_file_lock, so how could
>> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
>> check have passed?
>
> This is not a locking issue, it's much simpler. migrate_fd_cleanup()
> clears the ms->to_dst_file pointer and closes the QEMUFile and then
> calls close_return_path_on_source() which then tries to use resources
> which are not available anymore.
I'm missing something here. Which resources? I assume you're talking
about this:
WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
if (ms->to_dst_file && ms->rp_state.from_dst_file &&
qemu_file_get_error(ms->to_dst_file)) {
qemu_file_shutdown(ms->rp_state.from_dst_file);
}
}
How do we get past the 'if (ms->to_dst_file)'?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] migration: Fix return-path thread exit
2024-02-02 15:11 ` Fabiano Rosas
@ 2024-02-05 3:37 ` Peter Xu
2024-02-05 10:17 ` Cédric Le Goater
0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-02-05 3:37 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Cédric Le Goater, qemu-devel
On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
>
> > On 2/2/24 15:42, Fabiano Rosas wrote:
> >> Cédric Le Goater <clg@redhat.com> writes:
> >>
> >>> In case of error, close_return_path_on_source() can perform a shutdown
> >>> to exit the return-path thread. However, in migrate_fd_cleanup(),
> >>> 'to_dst_file' is closed before calling close_return_path_on_source()
> >>> and the shutdown fails, leaving the source and destination waiting for
> >>> an event to occur.
> >>
> >> At close_return_path_on_source, qemu_file_shutdown() and checking
> >> ms->to_dst_file are done under the qemu_file_lock, so how could
> >> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
> >> check have passed?
> >
> > This is not a locking issue, it's much simpler. migrate_fd_cleanup()
> > clears the ms->to_dst_file pointer and closes the QEMUFile and then
> > calls close_return_path_on_source() which then tries to use resources
> > which are not available anymore.
>
> I'm missing something here. Which resources? I assume you're talking
> about this:
>
> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
> if (ms->to_dst_file && ms->rp_state.from_dst_file &&
> qemu_file_get_error(ms->to_dst_file)) {
> qemu_file_shutdown(ms->rp_state.from_dst_file);
> }
> }
>
> How do we get past the 'if (ms->to_dst_file)'?
We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
close_return_path_on_source(), found that to_dst_file==NULL and then skip
the shutdown().
One other option might be that we do close_return_path_on_source() before
the chunk of releasing to_dst_file.
This "two qemufiles share the same ioc" issue had bitten us before IIRC,
and the only concern of that workaround is we keep postponing resolution of
the real issue, then we keep getting bitten by it..
Maybe we can wait a few days to see if Dan can join the conversation and if
we can reach a consensus on a complete solution. Otherwise I think we can
still work this around, but maybe that'll require a comment block
explaining the bits after such movement.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] migration: Fix return-path thread exit
2024-02-05 3:37 ` Peter Xu
@ 2024-02-05 10:17 ` Cédric Le Goater
0 siblings, 0 replies; 14+ messages in thread
From: Cédric Le Goater @ 2024-02-05 10:17 UTC (permalink / raw)
To: Peter Xu, Fabiano Rosas; +Cc: qemu-devel
On 2/5/24 04:37, Peter Xu wrote:
> On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>>
>>> On 2/2/24 15:42, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@redhat.com> writes:
>>>>
>>>>> In case of error, close_return_path_on_source() can perform a shutdown
>>>>> to exit the return-path thread. However, in migrate_fd_cleanup(),
>>>>> 'to_dst_file' is closed before calling close_return_path_on_source()
>>>>> and the shutdown fails, leaving the source and destination waiting for
>>>>> an event to occur.
>>>>
>>>> At close_return_path_on_source, qemu_file_shutdown() and checking
>>>> ms->to_dst_file are done under the qemu_file_lock, so how could
>>>> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
>>>> check have passed?
>>>
>>> This is not a locking issue, it's much simpler. migrate_fd_cleanup()
>>> clears the ms->to_dst_file pointer and closes the QEMUFile and then
>>> calls close_return_path_on_source() which then tries to use resources
>>> which are not available anymore.
>>
>> I'm missing something here. Which resources? I assume you're talking
>> about this:
>>
>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>> if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>> qemu_file_get_error(ms->to_dst_file)) {
>> qemu_file_shutdown(ms->rp_state.from_dst_file);
>> }
>> }
>>
>> How do we get past the 'if (ms->to_dst_file)'?
>
> We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
> close_return_path_on_source(), found that to_dst_file==NULL and then skip
> the shutdown().
>
> One other option might be that we do close_return_path_on_source() before
> the chunk of releasing to_dst_file.
>
> This "two qemufiles share the same ioc" issue had bitten us before IIRC,
> and the only concern of that workaround is we keep postponing resolution of
> the real issue, then we keep getting bitten by it..
>
> Maybe we can wait a few days to see if Dan can join the conversation and if
> we can reach a consensus on a complete solution. Otherwise I think we can
> still work this around, but maybe that'll require a comment block
> explaining the bits after such movement.
yes. The series should have been sent with an RFC.
I changed PATCH 1 to use migrate_has_error() instead of
qemu_file_get_error(ms->to_dst_file). I will keep PATCH 2 as it is for
the time being and wait for more feedback.
The prereq series adds an Error** argument to the .save_setup() and
.log_global*() handlers. I should send this week.
Thanks,
C.
>
> Thanks,
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] migration: Fix return-path thread exit
2024-02-02 9:53 ` [PATCH 0/2] " Peter Xu
2024-02-02 13:04 ` Cédric Le Goater
@ 2024-02-05 10:32 ` Daniel P. Berrangé
2024-02-06 2:42 ` Peter Xu
1 sibling, 1 reply; 14+ messages in thread
From: Daniel P. Berrangé @ 2024-02-05 10:32 UTC (permalink / raw)
To: Peter Xu; +Cc: Cédric Le Goater, qemu-devel, Fabiano Rosas
On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote:
> On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote:
> > Hello,
>
> Hi, Cédric,
>
> Thanks for the patches.
>
> >
> > Today, close_return_path_on_source() can perform a shutdown to exit
> > the return-path thread if an error occured. However, migrate_fd_cleanup()
> > does cleanups too early and the shutdown in close_return_path_on_source()
> > fails, leaving the source and destination waiting for an event to occur.
> >
> > This little series tries to fix that. Comments welcome !
>
> One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
> close_return_path_on_source() is weird: IMHO we have better way to detect
> "whether the migration has error" now, which is migrate_has_error().
>
> For this specific issue, I think one long standing issue that might be
> relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
> the same QIOChannel now. Logically the two QEMUFile should be able to be
> managed separately, say, close() of to_dst_file shouldn't affect the other.
>
> However I don't think it's the case now, as qemu_fclose(to_dst_file) will
> do qio_channel_close() already, which means there will be a side effect to
> the other QEMUFile that its backing IOC is already closed.
>
> Is this the issue we're facing? IOW, the close() of to_dst_file will not
> properly kick the other thread who is blocked at reading from_dst_file,
> while the shutdown() will kick it out?
>
> If so, not sure whether we can somehow relay the real qio_channel_close()
> to until the last user releases it? IOW, conditionally close() the channel
> in qio_channel_finalize(), if the channel is still open? Would that make
> sense?
IMHO the problem described above is a result of the design mistake of
having 2 separate QEMUFile instances for what is ultimately the same
channel. This was a convenient approach to take originally, but it has
likely outlived its purpose.
In the ideal world IMHO, QEMUFile would not exist at all, and we would
have a QIOChannelCached that adds the read/write buffering above the
base QIOChannel.
That's doable, but bigger than a quick fix. A natural stepping stone
to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile,
which might be more practical as a quick fix.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/2] migration: Fix return-path thread exit
2024-02-05 10:32 ` Daniel P. Berrangé
@ 2024-02-06 2:42 ` Peter Xu
0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-02-06 2:42 UTC (permalink / raw)
To: Daniel P. Berrangé; +Cc: Cédric Le Goater, qemu-devel, Fabiano Rosas
On Mon, Feb 05, 2024 at 10:32:33AM +0000, Daniel P. Berrangé wrote:
> On Fri, Feb 02, 2024 at 05:53:39PM +0800, Peter Xu wrote:
> > On Thu, Feb 01, 2024 at 07:48:51PM +0100, Cédric Le Goater wrote:
> > > Hello,
> >
> > Hi, Cédric,
> >
> > Thanks for the patches.
> >
> > >
> > > Today, close_return_path_on_source() can perform a shutdown to exit
> > > the return-path thread if an error occured. However, migrate_fd_cleanup()
> > > does cleanups too early and the shutdown in close_return_path_on_source()
> > > fails, leaving the source and destination waiting for an event to occur.
> > >
> > > This little series tries to fix that. Comments welcome !
> >
> > One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
> > close_return_path_on_source() is weird: IMHO we have better way to detect
> > "whether the migration has error" now, which is migrate_has_error().
> >
> > For this specific issue, I think one long standing issue that might be
> > relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
> > the same QIOChannel now. Logically the two QEMUFile should be able to be
> > managed separately, say, close() of to_dst_file shouldn't affect the other.
> >
> > However I don't think it's the case now, as qemu_fclose(to_dst_file) will
> > do qio_channel_close() already, which means there will be a side effect to
> > the other QEMUFile that its backing IOC is already closed.
> >
> > Is this the issue we're facing? IOW, the close() of to_dst_file will not
> > properly kick the other thread who is blocked at reading from_dst_file,
> > while the shutdown() will kick it out?
> >
> > If so, not sure whether we can somehow relay the real qio_channel_close()
> > to until the last user releases it? IOW, conditionally close() the channel
> > in qio_channel_finalize(), if the channel is still open? Would that make
> > sense?
>
> IMHO the problem described above is a result of the design mistake of
> having 2 separate QEMUFile instances for what is ultimately the same
> channel. This was a convenient approach to take originally, but it has
> likely outlived its purpose.
>
> In the ideal world IMHO, QEMUFile would not exist at all, and we would
> have a QIOChannelCached that adds the read/write buffering above the
> base QIOChannel.
We have that in the TODO wiki page for a long time, I'll update it
slightly.
https://wiki.qemu.org/ToDo/LiveMigration#Rewrite_QEMUFile_for_migration
But yeah that might be too big a hammer to solve this specific issue.
AFAIU Fabiano is looking into that direction, but I assume it should still
be a long term thing.
>
> That's doable, but bigger than a quick fix. A natural stepping stone
> to get there though is to move from 2 QEMUFile objs down to 1 QEMUFile,
> which might be more practical as a quick fix.
Agree. However would this still be quite some change?
We still have a lot of references on the four qemufiles (to/from_dst_file,
to/from_src_file), at least that'll need a replacement; I didn't yet
further check whether all places can be done with a direct replacement of
such change, some tweaks may be needed here and there, but shouldn't be
major.
Meanwhile IIUC it'll also need a major rework on QEMUFile, allowing it to
be bi-directional? We may need to duplicate the cache layer, IIUC, one for
each direction IOs.
--
Peter Xu
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-02-06 2:43 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-01 18:48 [PATCH 0/2] migration: Fix return-path thread exit Cédric Le Goater
2024-02-01 18:48 ` [PATCH 1/2] migration: Add a file_error argument to close_return_path_on_source() Cédric Le Goater
2024-02-02 14:30 ` Fabiano Rosas
2024-02-02 14:45 ` Cédric Le Goater
2024-02-01 18:48 ` [PATCH 2/2] migration: Fix return-path thread exit Cédric Le Goater
2024-02-02 14:42 ` Fabiano Rosas
2024-02-02 14:51 ` Cédric Le Goater
2024-02-02 15:11 ` Fabiano Rosas
2024-02-05 3:37 ` Peter Xu
2024-02-05 10:17 ` Cédric Le Goater
2024-02-02 9:53 ` [PATCH 0/2] " Peter Xu
2024-02-02 13:04 ` Cédric Le Goater
2024-02-05 10:32 ` Daniel P. Berrangé
2024-02-06 2:42 ` Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).