From: Peter Xu <peterx@redhat.com>
To: Fabiano Rosas <farosas@suse.de>
Cc: qemu-devel@nongnu.org,
Leonardo Bras Soares Passos <lsoaresp@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Lukas Straub <lukasstraub2@web.de>,
Laszlo Ersek <lersek@redhat.com>
Subject: Re: [PATCH v2 0/7] migration: Better error handling in return path thread
Date: Wed, 26 Jul 2023 12:36:23 -0400 [thread overview]
Message-ID: <ZMFLh7I62ckbDHaF@x1n> (raw)
In-Reply-To: <87r0ovonmt.fsf@suse.de>
Hi, Fabiano,
Sorry to be late on responding.
On Tue, Jul 25, 2023 at 03:24:26PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
>
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> v2:
> >> - Patch "migration: Provide explicit error message for file shutdowns"
> >> - Touched up qapi doc [Fabiano]
> >> - Added Bugzilla link to commit which I didn't even notice that I was
> >> fixing a bug.. but rightfully pointed out by Laszlo.
> >> - Moved it to the 1st patch because it fixes a bug, please consider
> >> review and merge it earlier.
> >>
> >> This is a small series that reworks error handling of postcopy return path
> >> threads.
> >>
> >> We used to contain a bunch of error_report(), converting them into
> >> error_setg() properly and deliver any of those errors to migration generic
> >> error reports (via migrate_set_error()). Then these errors can also be
> >> observed in query-migrate after postcopy is paused.
> >>
> >> Dropped the return-path specific error reporting: mark_source_rp_bad(),
> >> because it's a duplication if we can always use migrate_set_error().
> >>
> >> Please have a look, thanks.
> >>
> >> Peter Xu (7):
> >> migration: Display error in query-migrate irrelevant of status
> >> migration: Let migrate_set_error() take ownership
> >> migration: Introduce migrate_has_error()
> >> migration: Refactor error handling in source return path
> >> migration: Deliver return path file error to migrate state too
> >> qemufile: Always return a verbose error
> >> migration: Provide explicit error message for file shutdowns
> >>
> >> qapi/migration.json | 5 +-
> >> migration/migration.h | 8 +-
> >> migration/ram.h | 5 +-
> >> migration/channel.c | 1 -
> >> migration/migration.c | 168 +++++++++++++++++++++++----------------
> >> migration/multifd.c | 10 +--
> >> migration/postcopy-ram.c | 1 -
> >> migration/qemu-file.c | 20 ++++-
> >> migration/ram.c | 42 +++++-----
> >> migration/trace-events | 2 +-
> >> 10 files changed, 149 insertions(+), 113 deletions(-)
> >
> > Hi Peter,
> >
> > Were you aiming at solving any specific bug with this series?
Nop. I simply noticed that the error in return path cannot be proxied to
migration object and thought maybe we should do that.
Thanks for looing into this problem.
> > I'm seeing
> > a bug on master (361d5397355) with the
> > /x86_64/migration/postcopy/preempt/recovery/plain test around the areas
> > that this series touches.
> >
> > It happens very rarely and I'm still investigating, but in case you have
> > any thoughts:
> >
> > ====
> > It seems there's a race condition between postcopy resume and the return
> > path cleanup.
> >
> > It is possible for open_return_path_on_source() to setup the new
> > QEMUFile *before* the cleanup path at source_return_path_thread() has
> > had a chance to run, so we end up calling migration_release_dst_files()
> > on the new file and ms->rp_state.from_dst_file gets set to NULL again,
> > leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.
>
> I did some more digging and this is indeed what happens. When we pause
> on the incoming side, the to_src_file is closed and the source return
> path sees an error (EBADFD) which leads to the cleanup (from_dst_file =
> NULL). This happens independently and without any synchronization with a
> potential concurrent resume operation.
>
> Is there a reason for not closing the return path thread and starting a
> new one for resume?
Not anything special. When I initially worked on that (quite a few years
ago) I thought it would be the simplest we keep hold as much things as
possible, including the threads. But maybe it's not too hard either to
just reinitiate the thread when resume indeed.
> The from_dst_file is the only thing being changed
> anyway. It would allow us to remove the retry logic along with the
> problematic cleanup path and not need another synchronization point
> between qmp_migrate() and the return path.
I'm fine if you want to remove the return path thread when a pause happens,
as long as we can cleanly do that; if you already drafted something and
looks all good from your side, happy to read it. We may somewhere need
another call to await_return_path_close_on_source() when pause from
migration thread.
The other way is the main thread can stupidly wait for the two files to be
released, namely, from_dst_file and postcopy_qemufile_src.
>
> Here's the race (important bit is open_return_path happening before
> migration_release_dst_files):
>
> migration | qmp | return path
> --------------------------+-----------------------------+---------------------------------
> qmp_migrate_pause()
> shutdown(ms->to_dst_file)
> f->last_error = -EIO
> migrate_detect_error()
> postcopy_pause()
> set_state(PAUSED)
> wait(postcopy_pause_sem)
> qmp_migrate(resume)
> migrate_fd_connect()
> resume = state == PAUSED
> open_return_path <-- TOO SOON!
> set_state(RECOVER)
> post(postcopy_pause_sem)
> (incoming closes to_src_file)
> res = qemu_file_get_error(rp)
> migration_release_dst_files()
> ms->rp_state.from_dst_file = NULL
> post(postcopy_pause_rp_sem)
> postcopy_pause_return_path_thread()
> wait(postcopy_pause_rp_sem)
> rp = ms->rp_state.from_dst_file
> goto retry
> qemu_file_get_error(rp)
> SIGSEGV
> -------------------------------------------------------------------------------------------
>
Thanks,
--
Peter Xu
prev parent reply other threads:[~2023-07-26 16:39 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-05 16:34 [PATCH v2 0/7] migration: Better error handling in return path thread Peter Xu
2023-07-05 16:34 ` [PATCH v2 1/7] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-07-05 16:34 ` [PATCH v2 2/7] migration: Let migrate_set_error() take ownership Peter Xu
2023-07-05 20:53 ` Fabiano Rosas
2023-07-05 16:34 ` [PATCH v2 3/7] migration: Introduce migrate_has_error() Peter Xu
2023-07-05 20:55 ` Fabiano Rosas
2023-07-05 16:34 ` [PATCH v2 4/7] migration: Refactor error handling in source return path Peter Xu
2023-07-05 16:35 ` [PATCH v2 5/7] migration: Deliver return path file error to migrate state too Peter Xu
2023-07-05 16:35 ` [PATCH v2 6/7] qemufile: Always return a verbose error Peter Xu
2023-07-05 21:54 ` Fabiano Rosas
2023-07-05 22:24 ` Peter Xu
2023-07-06 13:56 ` Fabiano Rosas
2023-07-05 16:35 ` [PATCH v2 7/7] migration: Provide explicit error message for file shutdowns Peter Xu
2023-07-05 22:05 ` Fabiano Rosas
2023-07-05 22:34 ` Peter Xu
2023-07-06 13:50 ` Fabiano Rosas
2023-07-06 16:27 ` Peter Xu
2023-07-06 17:33 ` Fabiano Rosas
2023-07-06 18:08 ` Peter Xu
2023-07-06 18:47 ` Fabiano Rosas
2023-07-24 14:11 ` [PATCH v2 0/7] migration: Better error handling in return path thread Fabiano Rosas
2023-07-25 18:24 ` Fabiano Rosas
2023-07-26 16:36 ` Peter Xu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZMFLh7I62ckbDHaF@x1n \
--to=peterx@redhat.com \
--cc=farosas@suse.de \
--cc=lersek@redhat.com \
--cc=lsoaresp@redhat.com \
--cc=lukasstraub2@web.de \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).