qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Fix segfault on migration return path
@ 2023-08-02 14:36 Fabiano Rosas
  2023-08-02 14:36 ` [PATCH v2 1/2] migration: Split await_return_path_close_on_source Fabiano Rosas
  2023-08-02 14:36 ` [PATCH v2 2/2] migration: Replace the return path retry logic Fabiano Rosas
  0 siblings, 2 replies; 14+ messages in thread
From: Fabiano Rosas @ 2023-08-02 14:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela, Peter Xu, Wei Wang

For this version:

- moved the await into postcopy_pause() as Peter suggested;

- brought back the mark_source_rp_bad call. Turns out that piece of
code is filled with nuance. I just moved it aside since it doesn't
make sense during pause/resume. We can tackle that when we get the
chance.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/953420150
Also ran the switchover and preempt tests for 1000 times each on
x86_64.

v1:
https://lore.kernel.org/r/20230728121516.16258-1-farosas@suse.de

The /x86_64/migration/postcopy/preempt/recovery/plain test is
sometimes failing due a segmentation fault on the migration return
path. There is a race involving the retry logic of the return path and
the migration resume command.

The issue happens when the retry logic tries to cleanup the current
return path file, but ends up cleaning the new one and trying to use
it right after. Tracing shows it clearly:

open_return_path_on_source  <-- at migration start
open_return_path_on_source_continue <-- rp thread created
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (incoming)
postcopy_pause_fault_thread
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (source)
postcopy_pause_incoming_continued
open_return_path_on_source   <-- NOK, too soon
postcopy_pause_continued
postcopy_pause_return_path   <-- too late, already operating on the new from_dst_file
postcopy_pause_return_path_continued <-- will continue and crash
postcopy_pause_incoming
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued

We could solve this by adding some form of synchronization to ensure
that we always do the cleanup before setting up the new file, but I
find it more straight-forward to move the retry logic outside of the
thread by letting it finish and starting a new thread when resuming
the migration.

More details on the commit message.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/947875609

Fabiano Rosas (2):
  migration: Split await_return_path_close_on_source
  migration: Replace the return path retry logic

 migration/migration.c  | 110 ++++++++++++++++-------------------------
 migration/migration.h  |   1 -
 migration/trace-events |   1 +
 3 files changed, 44 insertions(+), 68 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-08-03 15:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-02 14:36 [PATCH v2 0/2] Fix segfault on migration return path Fabiano Rosas
2023-08-02 14:36 ` [PATCH v2 1/2] migration: Split await_return_path_close_on_source Fabiano Rosas
2023-08-02 16:19   ` Peter Xu
2023-08-02 19:58     ` Fabiano Rosas
2023-08-02 20:40       ` Peter Xu
2023-08-03 14:45         ` Fabiano Rosas
2023-08-03 15:15           ` Peter Xu
2023-08-03 15:24             ` Daniel P. Berrangé
2023-08-03 15:39               ` Peter Xu
2023-08-02 14:36 ` [PATCH v2 2/2] migration: Replace the return path retry logic Fabiano Rosas
2023-08-02 16:02   ` Peter Xu
2023-08-02 20:04     ` Fabiano Rosas
2023-08-02 20:44       ` Peter Xu
2023-08-03 15:00         ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).