From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: Leonardo Bras Soares Passos <lsoaresp@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Lukas Straub <lukasstraub2@web.de>,
Laszlo Ersek <lersek@redhat.com>,
peterx@redhat.com
Subject: Re: [PATCH v2 0/7] migration: Better error handling in return path thread
Date: Tue, 25 Jul 2023 15:24:26 -0300 [thread overview]
Message-ID: <87r0ovonmt.fsf@suse.de> (raw)
In-Reply-To: <87351dz9fi.fsf@suse.de>
Fabiano Rosas <farosas@suse.de> writes:
> Peter Xu <peterx@redhat.com> writes:
>
>> v2:
>> - Patch "migration: Provide explicit error message for file shutdowns"
>> - Touched up qapi doc [Fabiano]
>> - Added Bugzilla link to commit which I didn't even notice that I was
>> fixing a bug.. but rightfully pointed out by Laszlo.
>> - Moved it to the 1st patch because it fixes a bug, please consider
>> review and merge it earlier.
>>
>> This is a small series that reworks error handling of postcopy return path
>> threads.
>>
>> We used to contain a bunch of error_report(), converting them into
>> error_setg() properly and deliver any of those errors to migration generic
>> error reports (via migrate_set_error()). Then these errors can also be
>> observed in query-migrate after postcopy is paused.
>>
>> Dropped the return-path specific error reporting: mark_source_rp_bad(),
>> because it's a duplication if we can always use migrate_set_error().
>>
>> Please have a look, thanks.
>>
>> Peter Xu (7):
>> migration: Display error in query-migrate irrelevant of status
>> migration: Let migrate_set_error() take ownership
>> migration: Introduce migrate_has_error()
>> migration: Refactor error handling in source return path
>> migration: Deliver return path file error to migrate state too
>> qemufile: Always return a verbose error
>> migration: Provide explicit error message for file shutdowns
>>
>> qapi/migration.json | 5 +-
>> migration/migration.h | 8 +-
>> migration/ram.h | 5 +-
>> migration/channel.c | 1 -
>> migration/migration.c | 168 +++++++++++++++++++++++----------------
>> migration/multifd.c | 10 +--
>> migration/postcopy-ram.c | 1 -
>> migration/qemu-file.c | 20 ++++-
>> migration/ram.c | 42 +++++-----
>> migration/trace-events | 2 +-
>> 10 files changed, 149 insertions(+), 113 deletions(-)
>
> Hi Peter,
>
> Were you aiming at solving any specific bug with this series? I'm seeing
> a bug on master (361d5397355) with the
> /x86_64/migration/postcopy/preempt/recovery/plain test around the areas
> that this series touches.
>
> It happens very rarely and I'm still investigating, but in case you have
> any thoughts:
>
> ====
> It seems there's a race condition between postcopy resume and the return
> path cleanup.
>
> It is possible for open_return_path_on_source() to setup the new
> QEMUFile *before* the cleanup path at source_return_path_thread() has
> had a chance to run, so we end up calling migration_release_dst_files()
> on the new file and ms->rp_state.from_dst_file gets set to NULL again,
> leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.
I did some more digging and this is indeed what happens. When we pause
on the incoming side, the to_src_file is closed and the source return
path sees an error (EBADFD) which leads to the cleanup (from_dst_file =
NULL). This happens independently and without any synchronization with a
potential concurrent resume operation.
Is there a reason for not closing the return path thread and starting a
new one for resume? The from_dst_file is the only thing being changed
anyway. It would allow us to remove the retry logic along with the
problematic cleanup path and not need another synchronization point
between qmp_migrate() and the return path.
Here's the race (important bit is open_return_path happening before
migration_release_dst_files):
migration | qmp | return path
--------------------------+-----------------------------+---------------------------------
qmp_migrate_pause()
shutdown(ms->to_dst_file)
f->last_error = -EIO
migrate_detect_error()
postcopy_pause()
set_state(PAUSED)
wait(postcopy_pause_sem)
qmp_migrate(resume)
migrate_fd_connect()
resume = state == PAUSED
open_return_path <-- TOO SOON!
set_state(RECOVER)
post(postcopy_pause_sem)
(incoming closes to_src_file)
res = qemu_file_get_error(rp)
migration_release_dst_files()
ms->rp_state.from_dst_file = NULL
post(postcopy_pause_rp_sem)
postcopy_pause_return_path_thread()
wait(postcopy_pause_rp_sem)
rp = ms->rp_state.from_dst_file
goto retry
qemu_file_get_error(rp)
SIGSEGV
-------------------------------------------------------------------------------------------
next prev parent reply other threads:[~2023-07-25 18:25 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-05 16:34 [PATCH v2 0/7] migration: Better error handling in return path thread Peter Xu
2023-07-05 16:34 ` [PATCH v2 1/7] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-07-05 16:34 ` [PATCH v2 2/7] migration: Let migrate_set_error() take ownership Peter Xu
2023-07-05 20:53 ` Fabiano Rosas
2023-07-05 16:34 ` [PATCH v2 3/7] migration: Introduce migrate_has_error() Peter Xu
2023-07-05 20:55 ` Fabiano Rosas
2023-07-05 16:34 ` [PATCH v2 4/7] migration: Refactor error handling in source return path Peter Xu
2023-07-05 16:35 ` [PATCH v2 5/7] migration: Deliver return path file error to migrate state too Peter Xu
2023-07-05 16:35 ` [PATCH v2 6/7] qemufile: Always return a verbose error Peter Xu
2023-07-05 21:54 ` Fabiano Rosas
2023-07-05 22:24 ` Peter Xu
2023-07-06 13:56 ` Fabiano Rosas
2023-07-05 16:35 ` [PATCH v2 7/7] migration: Provide explicit error message for file shutdowns Peter Xu
2023-07-05 22:05 ` Fabiano Rosas
2023-07-05 22:34 ` Peter Xu
2023-07-06 13:50 ` Fabiano Rosas
2023-07-06 16:27 ` Peter Xu
2023-07-06 17:33 ` Fabiano Rosas
2023-07-06 18:08 ` Peter Xu
2023-07-06 18:47 ` Fabiano Rosas
2023-07-24 14:11 ` [PATCH v2 0/7] migration: Better error handling in return path thread Fabiano Rosas
2023-07-25 18:24 ` Fabiano Rosas [this message]
2023-07-26 16:36 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r0ovonmt.fsf@suse.de \
--to=farosas@suse.de \
--cc=lersek@redhat.com \
--cc=lsoaresp@redhat.com \
--cc=lukasstraub2@web.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).