From: Fabiano Rosas <farosas@suse.de>
To: qemu-devel@nongnu.org
Cc: Juan Quintela <quintela@redhat.com>, Peter Xu <peterx@redhat.com>,
Wei Wang <wei.w.wang@intel.com>
Subject: [PATCH v4 0/8] Fix segfault on migration return path
Date: Wed, 16 Aug 2023 11:25:02 -0300 [thread overview]
Message-ID: <20230816142510.5637-1-farosas@suse.de> (raw)
The major change from v3 is the dropping of shutdown() before close()
as suggested by Peter.
I also added a yank handler to rp_state.from_dst_file.
CI run: https://gitlab.com/farosas/qemu/-/pipelines/969570891
v3:
https://lore.kernel.org/r/20230811150836.2895-1-farosas@suse.de
I decided to fix the issues with the shutdown instead of complaining
about them. First 5 patches address all of the possible races I
found. The only problem left to figure out is the -EIO on shutdown
which will need more thought.
Patches 6 & 7 fix the original segfault.
Patches 8-10 make the cleanup of migration files more predictable and
centralized.
I also adjusted some small things that were mentioned by Peter:
- moved the rp.error = false outside of the thread;
- stopped checking for errors during postcopy_pause();
- dropped the tracepoint;
CI run: https://gitlab.com/farosas/qemu/-/pipelines/963407228
v2:
https://lore.kernel.org/r/20230802143644.7534-1-farosas@suse.de
- moved the await into postcopy_pause() as Peter suggested;
- brought back the mark_source_rp_bad call. Turns out that piece of
code is filled with nuance. I just moved it aside since it doesn't
make sense during pause/resume. We can tackle that when we get the
chance.
CI run: https://gitlab.com/farosas/qemu/-/pipelines/953420150
Also ran the switchover and preempt tests for 1000 times each on
x86_64.
v1:
https://lore.kernel.org/r/20230728121516.16258-1-farosas@suse.de
The /x86_64/migration/postcopy/preempt/recovery/plain test is
sometimes failing due a segmentation fault on the migration return
path. There is a race involving the retry logic of the return path and
the migration resume command.
The issue happens when the retry logic tries to cleanup the current
return path file, but ends up cleaning the new one and trying to use
it right after. Tracing shows it clearly:
open_return_path_on_source <-- at migration start
open_return_path_on_source_continue <-- rp thread created
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (incoming)
postcopy_pause_fault_thread
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (source)
postcopy_pause_incoming_continued
open_return_path_on_source <-- NOK, too soon
postcopy_pause_continued
postcopy_pause_return_path <-- too late, already operating on the new from_dst_file
postcopy_pause_return_path_continued <-- will continue and crash
postcopy_pause_incoming
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued
We could solve this by adding some form of synchronization to ensure
that we always do the cleanup before setting up the new file, but I
find it more straight-forward to move the retry logic outside of the
thread by letting it finish and starting a new thread when resuming
the migration.
More details on the commit message.
CI run: https://gitlab.com/farosas/qemu/-/pipelines/947875609
Fabiano Rosas (8):
migration: Fix possible race when setting rp_state.error
migration: Fix possible races when shutting down the return path
migration: Fix possible race when shutting down to_dst_file
migration: Remove redundant cleanup of postcopy_qemufile_src
migration: Consolidate return path closing code
migration: Replace the return path retry logic
migration: Move return path cleanup to main migration thread
migration: Add a wrapper to cleanup migration files
migration/migration.c | 229 +++++++++++++++---------------------------
migration/migration.h | 1 -
2 files changed, 80 insertions(+), 150 deletions(-)
--
2.35.3
next reply other threads:[~2023-08-16 14:25 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-16 14:25 Fabiano Rosas [this message]
2023-08-16 14:25 ` [PATCH v4 1/8] migration: Fix possible race when setting rp_state.error Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 2/8] migration: Fix possible races when shutting down the return path Fabiano Rosas
2023-08-16 15:05 ` Peter Xu
2023-08-16 14:25 ` [PATCH v4 3/8] migration: Fix possible race when shutting down to_dst_file Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 4/8] migration: Remove redundant cleanup of postcopy_qemufile_src Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 5/8] migration: Consolidate return path closing code Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 6/8] migration: Replace the return path retry logic Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 7/8] migration: Move return path cleanup to main migration thread Fabiano Rosas
2023-08-16 14:25 ` [PATCH v4 8/8] migration: Add a wrapper to cleanup migration files Fabiano Rosas
2023-08-16 15:18 ` Peter Xu
[not found] ` <87leealt8h.fsf@suse.de>
[not found] ` <ZN0k/DFQQIeEpgjl@x1n>
2023-08-16 19:37 ` Peter Xu
2023-08-16 21:20 ` Fabiano Rosas
2023-08-16 21:55 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230816142510.5637-1-farosas@suse.de \
--to=farosas@suse.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=wei.w.wang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).