qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, peterx@redhat.com, berrange@redhat.com,
	Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH v8 0/7] Allow to enable multifd and postcopy migration together
Date: Thu, 10 Apr 2025 17:18:04 -0300	[thread overview]
Message-ID: <87h62vhitf.fsf@suse.de> (raw)
In-Reply-To: <CAE8KmOxi3w5sf_sd04qLxkg8Lys=KA0pMPJMcZ-=CJVGZRmivw@mail.gmail.com>

Prasad Pandit <ppandit@redhat.com> writes:

> Hello Fabiano,
>
> On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <farosas@suse.de> wrote:
>> Prasad Pandit <ppandit@redhat.com> writes:
>> > * Thank you for the reproducer and traces. I'll try to check more and
>> > see if I'm able to reproduce it on my side.
>>
>> Thanks. I cannot merge this series until that issue is resolved. If it
>> reproduces on my machine there's a high chance that it will break CI at
>> some point and then it'll be a nightmare to debug. This has happened
>> many times before with multifd.
>
> ===
> qemu/build)$ for i in $(seq 1 9999); do echo "$i ====";
> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test
> --full -r '/x86_64/migration/postcopy/multifd/plain' || break; done |
> tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) };
> /slow test/ { printf("%s\n", $_); }'
>
> Host-1]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.56 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.43 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.46 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.58 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> --------
> Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs
>
>
> Host-2]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.69 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.37 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> --------
> Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs
>
>
> Host-3]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.55 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> --------
> Iter: 9999, low: 1.31, high: 2.49, avg: 1.48
> ===
>
> * I tried to reproduce the hang issue with and without -traces across
> 3 different machines but am unable to reproduce it on my side.
>
> * Going through the source and the back trace you provided, you said
> gdb hangs in the postcopy_ram_fault_thread() function at poll()
> function to wait for a missing page.
>    - But by this time, postcopy_ram_listen thread is already preparing
> to cleanup and exit
>        - That means postcopy migration is finished/ending
>    - ie. postcopy migration is ending without (or before) migrating
> all the RAM pages from the source side?
>

That's what it looks like. It could be some error condition that is not
being propagated properly. The thread hits an error and exits without
informing the rest of migration.

> In postcopy mode:
>     * Is there a way to log the pages (#numers) that are sent from the
> source side?
>     * And log the pages (#numbers) that are received on the receive side?
>
> * That way we might be able to check/confirm the pages which were not
> received or not processed properly.

Some combination of the postcopy traces should give you that. Sorry,
Peter Xu really is the expert on postcopy, I just tag along.

>
> * Can we connect the faulting/missing (HVA=0x7fffd68cc000)
> address/page in postcopy_ram_fault_thread() with the memcpy that the
> main thread seems to be loading via vapic_post_load()? ie. the main
> thread and pocyopy_ram_fault_thread() above could be doing unrelated
> things.
>

The snippet I posted shows that it's the same page:

(gdb) x/i $pc
=> 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>:    rep movsb %ds:(%rsi),%es:(%rdi)
(gdb) p/x $rsi
$1 = 0x7fffd68cc000

> * Other than this, I've revised the patch-set as suggested. How do we
> proceed further?

Send your next version and I'll set some time aside to debug this.

heads-up: I'll be off from 2025/04/18 until 2025/05/05. Peter should be
already back in the meantime.

>
> Thank you.
> ---
>   - Prasad


  reply	other threads:[~2025-04-10 20:18 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-18 12:38 [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 1/7] migration/multifd: move macros to multifd header Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 2/7] migration: Refactor channel discovery mechanism Prasad Pandit
2025-03-31 15:01   ` Fabiano Rosas
2025-04-03  7:01     ` Prasad Pandit
2025-04-03 12:59       ` Fabiano Rosas
2025-04-04  9:48         ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 3/7] migration: enable multifd and postcopy together Prasad Pandit
2025-03-31 15:27   ` Fabiano Rosas
2025-04-03 10:57     ` Prasad Pandit
2025-04-03 13:03       ` Fabiano Rosas
2025-03-18 12:38 ` [PATCH v8 4/7] tests/qtest/migration: consolidate set capabilities Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 5/7] tests/qtest/migration: add postcopy tests with multifd Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 6/7] migration: Add save_postcopy_prepare() savevm handler Prasad Pandit
2025-03-31 15:08   ` Fabiano Rosas
2025-04-03  7:03     ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 7/7] migration/ram: Implement save_postcopy_prepare() Prasad Pandit
2025-03-31 15:18   ` Fabiano Rosas
2025-04-03  7:21     ` Prasad Pandit
2025-04-03 13:07       ` Fabiano Rosas
2025-04-04  9:50         ` Prasad Pandit
2025-03-25  9:53 ` [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-27 14:35   ` Fabiano Rosas
2025-03-27 16:01     ` Prasad Pandit
2025-03-31 20:54 ` Fabiano Rosas
2025-04-03  7:24   ` Prasad Pandit
2025-04-03 13:11     ` Fabiano Rosas
2025-04-10 12:22       ` Prasad Pandit
2025-04-10 20:18         ` Fabiano Rosas [this message]
2025-04-11  7:25           ` Prasad Pandit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h62vhitf.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=berrange@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pjp@fedoraproject.org \
    --cc=ppandit@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).