From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Leonardo Bras <leobras@redhat.com>,
qemu-devel@nongnu.org
Subject: Re: QEMU migration-test CI intermittent failure
Date: Mon, 18 Sep 2023 11:15:57 -0300 [thread overview]
Message-ID: <878r931s9u.fsf@suse.de> (raw)
In-Reply-To: <ZQSMekgtg5kIPl2H@x1n>
Peter Xu <peterx@redhat.com> writes:
> On Fri, Sep 15, 2023 at 12:28:36PM -0400, Peter Xu wrote:
>> On Thu, Sep 14, 2023 at 10:56:23PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> > > On Thu, Sep 14, 2023 at 07:54:17PM -0300, Fabiano Rosas wrote:
>> > >> Fabiano Rosas <farosas@suse.de> writes:
>> > >>
>> > >> > Peter Xu <peterx@redhat.com> writes:
>> > >> >
>> > >> >> On Thu, Sep 14, 2023 at 12:57:08PM -0300, Fabiano Rosas wrote:
>> > >> >>> I managed to reproduce it. It's not the return path error. In hindsight
>> > >> >>> that's obvious because that error happens in the 'recovery' test and this
>> > >> >>> one in the 'plain' one. Sorry about the noise.
>> > >> >>
>> > >> >> No worry. It's good to finally identify that.
>> > >> >>
>> > >> >>>
>> > >> >>> This one reproduced with just 4 iterations of preempt/plain. I'll
>> > >> >>> investigate.
>> > >> >
>> > >> > It seems that we're getting a tcp disconnect (ECONNRESET) on when doing
>> > >> > that shutdown() on postcopy_qemufile_src. The one from commit 6621883f93
>> > >> > ("migration: Fix potential race on postcopy_qemufile_src").
>> > >> >
>> > >> > I'm trying to determine why that happens when other times it just
>> > >> > returns 0 as expected.
>> > >> >
>> > >> > Could this mean that we're kicking the dest too soon while it is still
>> > >> > receiving valid data?
>> > >>
>> > >> Looking a bit more into this, what's happening is that
>> > >> postcopy_ram_incoming_cleanup() is shutting the postcopy_qemufile_dst
>> > >> while ram_load_postcopy() is still running.
>> > >>
>> > >> The postcopy_ram_listen_thread() function waits for the
>> > >> main_thread_load_event, but that only works when not using preempt. With
>> > >> the preempt thread, the event is set right away and we proceed to do the
>> > >> cleanup without waiting.
>> > >>
>> > >> So the assumption of commit 6621883f93 that the incoming side knows when
>> > >> it has finished migrating is wrong IMO. Without the EOS we're relying on
>> > >> the chance that the shutdown() happens after the last recvmsg has
>> > >> returned and not during it.
>> > >>
>> > >> Peter, what do you think?
>> > >
>> > > That's a good point.
>> > >
>> > > One thing to verify that (sorry, I still cannot reproduce it myself, which
>> > > is so weirdly... it seems loads won't really help reproducing this) is to
>> > > let the main thread wait for all requested pages to arrive:
>> > >
>> > > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> > > index 29aea9456d..df055c51ea 100644
>> > > --- a/migration/postcopy-ram.c
>> > > +++ b/migration/postcopy-ram.c
>> > > @@ -597,6 +597,12 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>> > > trace_postcopy_ram_incoming_cleanup_entry();
>> > >
>> > > if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
>> > > + /*
>> > > + * NOTE! it's possible that the preempt thread is still handling
>> > > + * the last pages to arrive which were requested by faults. Making
>> > > + * sure nothing is left behind.
>> > > + */
>> > > + while (qatomic_read(&mis->page_requested_count));
>> > > /* Notify the fast load thread to quit */
>> > > mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
>> > > if (mis->postcopy_qemufile_dst) {
>> > >
>> > > If that can work solidly, we can figure out a better way than a dead loop
>> > > here.
>> >
>> > Yep, 2000+ iterations so far and no error.
>>
>> Definitely good news.. thanks.
>>
>> >
>> > Should we add something that makes ram_load_postcopy return once it's
>> > finished? Then this code could just set PREEMPT_THREAD_QUIT and join the
>> > preempt thread.
>>
>> Yeah it'll be nice to have that. We used to have an EOS message at the end
>> for the preempt channel, but that used to cause another race (which IIRC I
>> could easily reproduce at that time) and we removed it after
>> 6621883f9398bc3 for 7.2+.
>>
>> Now instead of fiddling with the protocol again, we do have a way to detect
>> that, but it may need some trick on counting on page_requested_count and
>> also provide a signal mechanism.
>>
>> I've drafted one solution here and attached here as a complete patch, feel
>> free to try. If you have anything better, that'll be even nicer; just let
>> me know your thoughts.
>>
>> ---8<---
>> From 384aff2264650872e15916dcfeaec593e5e9b781 Mon Sep 17 00:00:00 2001
>> From: Peter Xu <peterx@redhat.com>
>> Date: Fri, 15 Sep 2023 12:11:34 -0400
>> Subject: [PATCH] migration: Fix race that dest preempt thread close too early
>>
>> We hit intermit CI issue on failing at migration-test over the unit test
>> preempt/plain:
>>
>> qemu-system-x86_64: Unable to read from socket: Connection reset by peer
>> Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc current = 4f hit_edge = 1
>> **
>> ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion failed: (bad == 0)
>> (test program exited with status code -6)
>>
>> Fabiano debugged into it and found that the preempt thread can quit even
>> without receiving all the pages, which can cause guest not receiving all
>> the pages and corrupt the guest memory.
>>
>> To make sure preempt thread finished receiving all the pages, we can rely
>> on the page_requested_count being zero because preempt channel will only
>> receive requested page faults. Note, not all the faulted pages are required
>> to be sent via the preempt channel/thread; imagine the case when a
>> requested page is just queued into the background main channel for
>> migration, the src qemu will just still send it via the background channel.
>>
>> Here instead of spinning over reading the count, we add a condvar so the
>> main thread can wait on it if that unusual case happened, without burning
>> the cpu for no good reason, even if the duration is short; so even if we
>> spin in this rare case is probably fine. It's just better to not do so.
>>
>> The condvar is only used when that special case is triggered. Some memory
>> ordering trick is needed to guarantee it from happening (against the
>> preempt thread status field), so the main thread will always get a kick
>> when that triggers correctly.
>>
>> Closes: https://gitlab.com/qemu-project/qemu/-/issues/1886
>> Debugged-by: Fabiano Rosas <farosas@suse.de>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> migration/migration.h | 13 ++++++++++++-
>> migration/migration.c | 1 +
>> migration/postcopy-ram.c | 35 +++++++++++++++++++++++++++++++++++
>> 3 files changed, 48 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/migration.h b/migration/migration.h
>> index c390500604..cdaa10d515 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -196,7 +196,10 @@ struct MigrationIncomingState {
>>
>> /* A tree of pages that we requested to the source VM */
>> GTree *page_requested;
>> - /* For debugging purpose only, but would be nice to keep */
>> + /*
>> + * For postcopy only, count the number of requested page faults that
>> + * still haven't been resolved.
>> + */
>> int page_requested_count;
>> /*
>> * The mutex helps to maintain the requested pages that we sent to the
>> @@ -210,6 +213,14 @@ struct MigrationIncomingState {
>> * contains valid information.
>> */
>> QemuMutex page_request_mutex;
>> + /*
>> + * If postcopy preempt is enabled, there is a chance that the main
>> + * thread finished loading its data before the preempt channel has
>> + * finished loading the urgent pages. If that happens, the two threads
>> + * will use this condvar to synchronize, so the main thread will always
>> + * wait until all pages received.
>> + */
>> + QemuCond page_request_cond;
>>
>> /*
>> * Number of devices that have yet to approve switchover. When this reaches
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d61e572742..e86b3f7368 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -153,6 +153,7 @@ void migration_object_init(void)
>> qemu_sem_init(¤t_incoming->postcopy_qemufile_dst_done, 0);
>>
>> qemu_mutex_init(¤t_incoming->page_request_mutex);
>> + qemu_cond_init(¤t_incoming->page_request_cond);
>> current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>>
>> migration_object_check(current_migration, &error_fatal);
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 29aea9456d..dbf02cd3ed 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -599,6 +599,30 @@ int postcopy_ram_incoming_cleanup(MigrationIncomingState *mis)
>> if (mis->preempt_thread_status == PREEMPT_THREAD_CREATED) {
>> /* Notify the fast load thread to quit */
>> mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
>> + /*
>> + * Update preempt_thread_status before reading count. Note: mutex
>> + * lock only provide ACQUIRE semantic, and it doesn't stops this
>> + * write to be reordered after reading the count.
>> + */
>> + smp_mb();
>> + /*
>> + * It's possible that the preempt thread is still handling the last
>> + * pages to arrive which were requested by guest page faults.
>> + * Making sure nothing is left behind by waiting on the condvar if
>> + * that unlikely case happened.
>> + */
>> + WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
>> + if (qatomic_read(&mis->page_requested_count)) {
>> + /*
>> + * It is guaranteed to receive a signal later, because the
>> + * count>0 now, so it's destined to be decreased to zero
>> + * very soon by the preempt thread.
>> + */
>> + qemu_cond_wait(&mis->page_request_cond,
>> + &mis->page_request_mutex);
>> + }
>> + }
>> + /* Notify the fast load thread to quit */
>> if (mis->postcopy_qemufile_dst) {
>> qemu_file_shutdown(mis->postcopy_qemufile_dst);
>> }
>> @@ -1279,6 +1303,17 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
>> g_tree_remove(mis->page_requested, host_addr);
>> mis->page_requested_count--;
>> trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
>> + /* Order the update of count and read of preempt status */
>> + smp_mb();
>> + if (qatomic_read(&mis->preempt_thread_status) ==
>> + PREEMPT_THREAD_QUIT) {
>> + /*
>> + * This probably means the main thread is waiting for us.
>> + * Notify that we've finished receiving the last requested
>> + * page.
>> + */
>> + qemu_cond_signal(&mis->page_request_cond);
>> + }
>> }
>> qemu_mutex_unlock(&mis->page_request_mutex);
>> mark_postcopy_blocktime_end((uintptr_t)host_addr);
>> --
>> 2.41.0
>
> I just made some changes on top of this patch..
>
> One major thing is to double check we only kick main thread only if it's
> the last page, as I think there can be >1 pages on the fly..
>
> Since at it, I cleaned up a bit on accessing either the preempt status
> field or the counter, to make it always consistent (always not use atomic
> ops for status as it's defined as volatile anyway; meanwhile always use
> qatomic for the counter).
>
> ---8<---
> From bfdd1f872fbeafe26ebad280a6d9e5a88de621de Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Fri, 15 Sep 2023 12:45:01 -0400
> Subject: [PATCH] fixup! migration: Fix race that dest preempt thread close too
> early
>
> Three more small changes:
>
> - Always use qatomic_* for page_requested_count accesses
> - only kick main thread when we're sure it's the last page
> - drop qatomic_read() on reading preempt status, to make it match use
> cases elsewhere, not needed when defined as volatile already
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/migration.c | 2 +-
> migration/postcopy-ram.c | 7 ++++---
> 2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index e86b3f7368..3ee1e6b0d6 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -368,7 +368,7 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> * things like g_tree_lookup() will return TRUE (1) when found.
> */
> g_tree_insert(mis->page_requested, aligned, (gpointer)1);
> - mis->page_requested_count++;
> + qatomic_inc(&mis->page_requested_count);
> trace_postcopy_page_req_add(aligned, mis->page_requested_count);
> }
> }
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index dbf02cd3ed..5408e028c6 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1301,12 +1301,13 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
> */
> if (g_tree_lookup(mis->page_requested, host_addr)) {
> g_tree_remove(mis->page_requested, host_addr);
> - mis->page_requested_count--;
> + int left_pages = qatomic_dec_fetch(&mis->page_requested_count);
> +
> trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
> /* Order the update of count and read of preempt status */
> smp_mb();
> - if (qatomic_read(&mis->preempt_thread_status) ==
> - PREEMPT_THREAD_QUIT) {
> + if (mis->preempt_thread_status == PREEMPT_THREAD_QUIT &&
> + left_pages == 0) {
> /*
> * This probably means the main thread is waiting for us.
> * Notify that we've finished receiving the last requested
> --
> 2.41.0
This fixes the issue. It looks ok to me. Do you want me to send a separate patch with it?
next prev parent reply other threads:[~2023-09-18 14:16 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-13 19:23 QEMU migration-test CI intermittent failure Stefan Hajnoczi
2023-09-13 19:42 ` Fabiano Rosas
2023-09-13 19:51 ` Stefan Hajnoczi
2023-09-14 14:56 ` Peter Xu
2023-09-14 15:10 ` Fabiano Rosas
2023-09-14 15:35 ` Peter Xu
2023-09-14 15:57 ` Fabiano Rosas
2023-09-14 16:39 ` Peter Xu
2023-09-14 21:13 ` Fabiano Rosas
2023-09-14 22:54 ` Fabiano Rosas
2023-09-14 23:27 ` Peter Xu
2023-09-15 1:56 ` Fabiano Rosas
2023-09-15 16:28 ` Peter Xu
2023-09-15 16:55 ` Peter Xu
2023-09-18 14:15 ` Fabiano Rosas [this message]
2023-09-18 15:35 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=878r931s9u.fsf@suse.de \
--to=farosas@suse.de \
--cc=leobras@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.