From: Fabiano Rosas <farosas@suse.de>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, peterx@redhat.com, berrange@redhat.com,
Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH v8 0/7] Allow to enable multifd and postcopy migration together
Date: Thu, 10 Apr 2025 17:18:04 -0300 [thread overview]
Message-ID: <87h62vhitf.fsf@suse.de> (raw)
In-Reply-To: <CAE8KmOxi3w5sf_sd04qLxkg8Lys=KA0pMPJMcZ-=CJVGZRmivw@mail.gmail.com>
Prasad Pandit <ppandit@redhat.com> writes:
> Hello Fabiano,
>
> On Thu, 3 Apr 2025 at 18:41, Fabiano Rosas <farosas@suse.de> wrote:
>> Prasad Pandit <ppandit@redhat.com> writes:
>> > * Thank you for the reproducer and traces. I'll try to check more and
>> > see if I'm able to reproduce it on my side.
>>
>> Thanks. I cannot merge this series until that issue is resolved. If it
>> reproduces on my machine there's a high chance that it will break CI at
>> some point and then it'll be a nightmare to debug. This has happened
>> many times before with multifd.
>
> ===
> qemu/build)$ for i in $(seq 1 9999); do echo "$i ====";
> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test
> --full -r '/x86_64/migration/postcopy/multifd/plain' || break; done |
> tee /tmp/migration-test.out | awk -e '/====/ { printf ("%s ", $_) };
> /slow test/ { printf("%s\n", $_); }'
>
> Host-1]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.56 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.43 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.46 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.42 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.58 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> --------
> Iter: 9999, low: 1.35, high: 1.73, avg: 1.47 secs
>
>
> Host-2]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.69 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.41 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.37 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.45 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.47 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> --------
> Iter: 9999, low: 1.34, high: 1.82, avg: 1.48 secs
>
>
> Host-3]
> ...
> 9980 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9981 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.55 secs
> 9982 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9983 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9984 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9985 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9986 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9987 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.52 secs
> 9988 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9989 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9990 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.51 secs
> 9991 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9992 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9993 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.50 secs
> 9994 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.53 secs
> 9995 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.49 secs
> 9996 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.48 secs
> 9997 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> 9998 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.44 secs
> 9999 ==== # slow test /x86_64/migration/postcopy/multifd/plain
> executed in 1.54 secs
> --------
> Iter: 9999, low: 1.31, high: 2.49, avg: 1.48
> ===
>
> * I tried to reproduce the hang issue with and without -traces across
> 3 different machines but am unable to reproduce it on my side.
>
> * Going through the source and the back trace you provided, you said
> gdb hangs in the postcopy_ram_fault_thread() function at poll()
> function to wait for a missing page.
> - But by this time, postcopy_ram_listen thread is already preparing
> to cleanup and exit
> - That means postcopy migration is finished/ending
> - ie. postcopy migration is ending without (or before) migrating
> all the RAM pages from the source side?
>
That's what it looks like. It could be some error condition that is not
being propagated properly. The thread hits an error and exits without
informing the rest of migration.
> In postcopy mode:
> * Is there a way to log the pages (#numers) that are sent from the
> source side?
> * And log the pages (#numbers) that are received on the receive side?
>
> * That way we might be able to check/confirm the pages which were not
> received or not processed properly.
Some combination of the postcopy traces should give you that. Sorry,
Peter Xu really is the expert on postcopy, I just tag along.
>
> * Can we connect the faulting/missing (HVA=0x7fffd68cc000)
> address/page in postcopy_ram_fault_thread() with the memcpy that the
> main thread seems to be loading via vapic_post_load()? ie. the main
> thread and pocyopy_ram_fault_thread() above could be doing unrelated
> things.
>
The snippet I posted shows that it's the same page:
(gdb) x/i $pc
=> 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>: rep movsb %ds:(%rsi),%es:(%rdi)
(gdb) p/x $rsi
$1 = 0x7fffd68cc000
> * Other than this, I've revised the patch-set as suggested. How do we
> proceed further?
Send your next version and I'll set some time aside to debug this.
heads-up: I'll be off from 2025/04/18 until 2025/05/05. Peter should be
already back in the meantime.
>
> Thank you.
> ---
> - Prasad
next prev parent reply other threads:[~2025-04-10 20:18 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-18 12:38 [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 1/7] migration/multifd: move macros to multifd header Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 2/7] migration: Refactor channel discovery mechanism Prasad Pandit
2025-03-31 15:01 ` Fabiano Rosas
2025-04-03 7:01 ` Prasad Pandit
2025-04-03 12:59 ` Fabiano Rosas
2025-04-04 9:48 ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 3/7] migration: enable multifd and postcopy together Prasad Pandit
2025-03-31 15:27 ` Fabiano Rosas
2025-04-03 10:57 ` Prasad Pandit
2025-04-03 13:03 ` Fabiano Rosas
2025-03-18 12:38 ` [PATCH v8 4/7] tests/qtest/migration: consolidate set capabilities Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 5/7] tests/qtest/migration: add postcopy tests with multifd Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 6/7] migration: Add save_postcopy_prepare() savevm handler Prasad Pandit
2025-03-31 15:08 ` Fabiano Rosas
2025-04-03 7:03 ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 7/7] migration/ram: Implement save_postcopy_prepare() Prasad Pandit
2025-03-31 15:18 ` Fabiano Rosas
2025-04-03 7:21 ` Prasad Pandit
2025-04-03 13:07 ` Fabiano Rosas
2025-04-04 9:50 ` Prasad Pandit
2025-03-25 9:53 ` [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-27 14:35 ` Fabiano Rosas
2025-03-27 16:01 ` Prasad Pandit
2025-03-31 20:54 ` Fabiano Rosas
2025-04-03 7:24 ` Prasad Pandit
2025-04-03 13:11 ` Fabiano Rosas
2025-04-10 12:22 ` Prasad Pandit
2025-04-10 20:18 ` Fabiano Rosas [this message]
2025-04-11 7:25 ` Prasad Pandit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h62vhitf.fsf@suse.de \
--to=farosas@suse.de \
--cc=berrange@redhat.com \
--cc=peterx@redhat.com \
--cc=pjp@fedoraproject.org \
--cc=ppandit@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.