From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Leonardo Bras <leobras@redhat.com>,
qemu-devel@nongnu.org
Subject: Re: QEMU migration-test CI intermittent failure
Date: Thu, 14 Sep 2023 22:56:23 -0300 [thread overview]
Message-ID: <87o7i4nqrc.fsf@suse.de> (raw)
In-Reply-To: <ZQOW4BS1ZcDTN7tK@x1n>
Peter Xu <peterx@redhat.com> writes:
> On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>>
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
>> >>> I managed to reproduce it. It's not the return path error. In hindsight
>> >>> that's obvious because that error happens in the 'recovery' test and this
>> >>> one in the 'plain' one. Sorry about the noise.
>> >>
>> >> No worry. It's good to finally identify that.
>> >>
>> >>>
>> >>> This one reproduced with just 4 iterations of preempt/plain. I'll
>> >>> investigate.
>> >
>> > It seems that we're getting a tcp disconnect (ECONNRESET) on when doing
>> > that shutdown() on postcopy_qemufile_src. The one from commit 6621883f93
>> > ("migration: Fix potential race on postcopy_qemufile_src").
>> >
>> > I'm trying to determine why that happens when other times it just
>> > returns 0 as expected.
>> >
>> > Could this mean that we're kicking the dest too soon while it is still
>> > receiving valid data?
>>
>> Looking a bit more into this, what's happening is that
>> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
>> while ram_load_postcopy() is still running.
>>
>> The postcopy_ram_listen_thread() function waits for the
>> main_thread_load_event, but that only works when not using preempt. With
>> the preempt thread, the event is set right away and we proceed to do the
>> cleanup without waiting.
>>
>> So the assumption of commit 6621883f93 that the incoming side knows when
>> it has finished migrating is wrong IMO. Without the EOS we're relying on
>> the chance that the shutdown() happens after the last recvmsg has
>> returned and not during it.
>>
>> Peter, what do you think?
>
> That's a good point.
>
> One thing to verify that (sorry, I still cannot reproduce it myself, which
> is so weirdly... it seems loads won't really help reproducing this) is to
> let the main thread wait for all requested pages to arrive:
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 29aea9456d..df055c51ea 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -597,6 +597,12 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
> trace_postcopy_ram_incoming_cleanup_entry();
>
> if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
> + /*
> + * NOTE! it's possible that the preempt thread is still handling
> + * the last pages to arrive which were requested by faults. Making
> + * sure nothing is left behind.
> + */
> + while (qatomic_read(&mis->page_requested_count));
> /* Notify the fast load thread to quit */
> mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
> if (mis->postcopy_qemufile_dst) {
>
> If that can work solidly, we can figure out a better way than a dead loop
> here.
Yep, 2000+ iterations so far and no error.
Should we add something that makes ram_load_postcopy return once it's
finished? Then this code could just set PREEMPT_THREAD_QUIT and join the
preempt thread.
next prev parent reply other threads:[~2023-09-15 1:57 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-13 19:23 QEMU migration-test CI intermittent failure Stefan Hajnoczi
2023-09-13 19:42 ` Fabiano Rosas
2023-09-13 19:51 ` Stefan Hajnoczi
2023-09-14 14:56 ` Peter Xu
2023-09-14 15:10 ` Fabiano Rosas
2023-09-14 15:35 ` Peter Xu
2023-09-14 15:57 ` Fabiano Rosas
2023-09-14 16:39 ` Peter Xu
2023-09-14 21:13 ` Fabiano Rosas
2023-09-14 22:54 ` Fabiano Rosas
2023-09-14 23:27 ` Peter Xu
2023-09-15 1:56 ` Fabiano Rosas [this message]
2023-09-15 16:28 ` Peter Xu
2023-09-15 16:55 ` Peter Xu
2023-09-18 14:15 ` Fabiano Rosas
2023-09-18 15:35 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o7i4nqrc.fsf@suse.de \
--to=farosas@suse.de \
--cc=leobras@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.