qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, peterx@redhat.com, berrange@redhat.com,
	Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH v8 0/7] Allow to enable multifd and postcopy migration together
Date: Thu, 03 Apr 2025 10:11:00 -0300	[thread overview]
Message-ID: <87zfgxjspn.fsf@suse.de> (raw)
In-Reply-To: <CAE8KmOyS+nPexU_NbF0yhK_=ubnGgKs5Lv+j7bH=xowgqQ2zkA@mail.gmail.com>

Prasad Pandit <ppandit@redhat.com> writes:

> On Tue, 1 Apr 2025 at 02:24, Fabiano Rosas <farosas@suse.de> wrote:
>> The postcopy/multifd/plain test is still hanging from time to time. I
>> see a vmstate load function trying to access guest memory and the
>> postcopy-listen thread already finished, waiting for that
>> qemu_loadvm_state() (frame #18) to return and set the
>> main_thread_load_event.
>>
>> Thread 1 (Thread 0x7fbc4849df80 (LWP 7487) "qemu-system-x86"):
>> #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:274
>> #1  0x0000560b135103aa in flatview_read_continue_step (attrs=..., buf=0x560b168a5930 "U\252\022\006\016\a1\300\271", len=9216, mr_addr=831488, l=0x7fbc465ff980, mr=0x560b166c5070) at ../system/physmem.c:3056
>> #2  0x0000560b1351042e in flatview_read_continue (fv=0x560b16c606a0, addr=831488, attrs=..., ptr=0x560b168a5930, len=9216, mr_addr=831488, l=9216, mr=0x560b166c5070) at ../system/physmem.c:3073
>> #3  0x0000560b13510533 in flatview_read (fv=0x560b16c606a0, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3103
>> #4  0x0000560b135105be in address_space_read_full (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3116
>> #5  0x0000560b135106e7 in address_space_rw (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3144
>> #6  0x0000560b13510848 in cpu_physical_memory_rw (addr=831488, buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3170
>> #7  0x0000560b1338f5a5 in cpu_physical_memory_read (addr=831488, buf=0x560b168a5930, len=9216) at qemu/include/exec/cpu-common.h:148
>> #8  0x0000560b1339063c in patch_hypercalls (s=0x560b168840c0) at ../hw/i386/vapic.c:547
>> #9  0x0000560b1339096d in vapic_prepare (s=0x560b168840c0) at ../hw/i386/vapic.c:629
>> #10 0x0000560b13390e8b in vapic_post_load (opaque=0x560b168840c0, version_id=1) at ../hw/i386/vapic.c:789
>> #11 0x0000560b135b4924 in vmstate_load_state (f=0x560b16c53400, vmsd=0x560b147c6cc0 <vmstate_vapic>, opaque=0x560b168840c0, version_id=1) at ../migration/vmstate.c:234
>> #12 0x0000560b132a15b8 in vmstate_load (f=0x560b16c53400, se=0x560b16893390) at ../migration/savevm.c:972
>> #13 0x0000560b132a4f28 in qemu_loadvm_section_start_full (f=0x560b16c53400, type=4 '\004') at ../migration/savevm.c:2746
>> #14 0x0000560b132a5ae8 in qemu_loadvm_state_main (f=0x560b16c53400, mis=0x560b16877f20) at ../migration/savevm.c:3058
>> #15 0x0000560b132a45d0 in loadvm_handle_cmd_packaged (mis=0x560b16877f20) at ../migration/savevm.c:2451
>> #16 0x0000560b132a4b36 in loadvm_process_command (f=0x560b168c3b60) at ../migration/savevm.c:2614
>> #17 0x0000560b132a5b96 in qemu_loadvm_state_main (f=0x560b168c3b60, mis=0x560b16877f20) at ../migration/savevm.c:3073
>> #18 0x0000560b132a5db7 in qemu_loadvm_state (f=0x560b168c3b60) at ../migration/savevm.c:3150
>> #19 0x0000560b13286271 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:892
>> #20 0x0000560b137cb6d4 in coroutine_trampoline (i0=377836416, i1=22027) at ../util/coroutine-ucontext.c:175
>> #21 0x00007fbc4786a79e in ??? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:103
>>
>>
>> Thread 10 (Thread 0x7fffce7fc700 (LWP 11778) "mig/dst/listen"):
>> #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> #1  0x000055555614e33f in qemu_futex_wait (f=0x5555576f6fc0, val=4294967295) at qemu/include/qemu/futex.h:29
>> #2  0x000055555614e505 in qemu_event_wait (ev=0x5555576f6fc0) at ../util/qemu-thread-posix.c:464
>> #3  0x0000555555c44eb1 in postcopy_ram_listen_thread (opaque=0x5555576f6f20) at ../migration/savevm.c:2135
>> #4  0x000055555614e6b8 in qemu_thread_start (args=0x5555582c8480) at ../util/qemu-thread-posix.c:541
>> #5  0x00007ffff72626ea in start_thread (arg=0x7fffce7fc700) at pthread_create.c:477
>> #6  0x00007ffff532158f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> Thread 9 (Thread 0x7fffceffd700 (LWP 11777) "mig/dst/fault"):
>> #0  0x00007ffff5314a89 in __GI___poll (fds=0x7fffc0000b60, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
>> #1  0x0000555555c3be3f in postcopy_ram_fault_thread (opaque=0x5555576f6f20) at ../migration/postcopy-ram.c:999
>> #2  0x000055555614e6b8 in qemu_thread_start (args=0x555557735be0) at ../util/qemu-thread-posix.c:541
>> #3  0x00007ffff72626ea in start_thread (arg=0x7fffceffd700) at pthread_create.c:477
>> #4  0x00007ffff532158f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> Breaking with gdb and stepping through the memcpy code generates a
>> request for a page that's seemingly already in the receivedmap:
>>
>> (gdb) x/i $pc
>> => 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>:    rep movsb %ds:(%rsi),%es:(%rdi)
>> (gdb) p/x $rsi
>> $1 = 0x7fffd68cc000
>> (gdb) si
>> postcopy_ram_fault_thread_request Request for HVA=0x7fffd68cc000 rb=pc.ram offset=0xcc000 pid=11754
>> // these are my printfs:
>> postcopy_request_page:
>> migrate_send_rp_req_pages:
>> migrate_send_rp_req_pages: mutex
>> migrate_send_rp_req_pages: received
>>
>> // gdb hangs here, it looks like the page wasn't populated?
>>
>> I've had my share of postcopy for the day. Hopefully you'll be able to
>> figure out what the issue is.
>>
>> - reproducer (2nd iter already hangs for me):
>>
>> $ for i in $(seq 1 9999); do echo "$i ============="; \
>> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \
>> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done
>>
>> - reproducer with traces and gdb:
>>
>> $ for i in $(seq 1 9999); do echo "$i ============="; \
>> QTEST_TRACE="multifd_* -trace source_* -trace postcopy_* -trace savevm_* \
>> -trace loadvm_*" QTEST_QEMU_BINARY_DST='gdb --ex "handle SIGUSR1 \
>> noprint" --ex "run" --args ./qemu-system-x86_64' \
>> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \
>> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done
>
> * Thank you for the reproducer and traces. I'll try to check more and
> see if I'm able to reproduce it on my side.
>

Thanks. I cannot merge this series until that issue is resolved. If it
reproduces on my machine there's a high chance that it will break CI at
some point and then it'll be a nightmare to debug. This has happened
many times before with multifd.

> Thank you.
> ---
>   - Prasad


  reply	other threads:[~2025-04-03 13:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-18 12:38 [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 1/7] migration/multifd: move macros to multifd header Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 2/7] migration: Refactor channel discovery mechanism Prasad Pandit
2025-03-31 15:01   ` Fabiano Rosas
2025-04-03  7:01     ` Prasad Pandit
2025-04-03 12:59       ` Fabiano Rosas
2025-04-04  9:48         ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 3/7] migration: enable multifd and postcopy together Prasad Pandit
2025-03-31 15:27   ` Fabiano Rosas
2025-04-03 10:57     ` Prasad Pandit
2025-04-03 13:03       ` Fabiano Rosas
2025-03-18 12:38 ` [PATCH v8 4/7] tests/qtest/migration: consolidate set capabilities Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 5/7] tests/qtest/migration: add postcopy tests with multifd Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 6/7] migration: Add save_postcopy_prepare() savevm handler Prasad Pandit
2025-03-31 15:08   ` Fabiano Rosas
2025-04-03  7:03     ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 7/7] migration/ram: Implement save_postcopy_prepare() Prasad Pandit
2025-03-31 15:18   ` Fabiano Rosas
2025-04-03  7:21     ` Prasad Pandit
2025-04-03 13:07       ` Fabiano Rosas
2025-04-04  9:50         ` Prasad Pandit
2025-03-25  9:53 ` [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-27 14:35   ` Fabiano Rosas
2025-03-27 16:01     ` Prasad Pandit
2025-03-31 20:54 ` Fabiano Rosas
2025-04-03  7:24   ` Prasad Pandit
2025-04-03 13:11     ` Fabiano Rosas [this message]
2025-04-10 12:22       ` Prasad Pandit
2025-04-10 20:18         ` Fabiano Rosas
2025-04-11  7:25           ` Prasad Pandit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zfgxjspn.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=berrange@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pjp@fedoraproject.org \
    --cc=ppandit@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).