All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Prasad Pandit <ppandit@redhat.com>
Cc: qemu-devel@nongnu.org, peterx@redhat.com, berrange@redhat.com,
	Prasad Pandit <pjp@fedoraproject.org>
Subject: Re: [PATCH v8 0/7] Allow to enable multifd and postcopy migration together
Date: Thu, 03 Apr 2025 10:11:00 -0300	[thread overview]
Message-ID: <87zfgxjspn.fsf@suse.de> (raw)
In-Reply-To: <CAE8KmOyS+nPexU_NbF0yhK_=ubnGgKs5Lv+j7bH=xowgqQ2zkA@mail.gmail.com>

Prasad Pandit <ppandit@redhat.com> writes:

> On Tue, 1 Apr 2025 at 02:24, Fabiano Rosas <farosas@suse.de> wrote:
>> The postcopy/multifd/plain test is still hanging from time to time. I
>> see a vmstate load function trying to access guest memory and the
>> postcopy-listen thread already finished, waiting for that
>> qemu_loadvm_state() (frame #18) to return and set the
>> main_thread_load_event.
>>
>> Thread 1 (Thread 0x7fbc4849df80 (LWP 7487) "qemu-system-x86"):
>> #0  __memcpy_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:274
>> #1  0x0000560b135103aa in flatview_read_continue_step (attrs=..., buf=0x560b168a5930 "U\252\022\006\016\a1\300\271", len=9216, mr_addr=831488, l=0x7fbc465ff980, mr=0x560b166c5070) at ../system/physmem.c:3056
>> #2  0x0000560b1351042e in flatview_read_continue (fv=0x560b16c606a0, addr=831488, attrs=..., ptr=0x560b168a5930, len=9216, mr_addr=831488, l=9216, mr=0x560b166c5070) at ../system/physmem.c:3073
>> #3  0x0000560b13510533 in flatview_read (fv=0x560b16c606a0, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3103
>> #4  0x0000560b135105be in address_space_read_full (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216) at ../system/physmem.c:3116
>> #5  0x0000560b135106e7 in address_space_rw (as=0x560b14970fc0 <address_space_memory>, addr=831488, attrs=..., buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3144
>> #6  0x0000560b13510848 in cpu_physical_memory_rw (addr=831488, buf=0x560b168a5930, len=9216, is_write=false) at ../system/physmem.c:3170
>> #7  0x0000560b1338f5a5 in cpu_physical_memory_read (addr=831488, buf=0x560b168a5930, len=9216) at qemu/include/exec/cpu-common.h:148
>> #8  0x0000560b1339063c in patch_hypercalls (s=0x560b168840c0) at ../hw/i386/vapic.c:547
>> #9  0x0000560b1339096d in vapic_prepare (s=0x560b168840c0) at ../hw/i386/vapic.c:629
>> #10 0x0000560b13390e8b in vapic_post_load (opaque=0x560b168840c0, version_id=1) at ../hw/i386/vapic.c:789
>> #11 0x0000560b135b4924 in vmstate_load_state (f=0x560b16c53400, vmsd=0x560b147c6cc0 <vmstate_vapic>, opaque=0x560b168840c0, version_id=1) at ../migration/vmstate.c:234
>> #12 0x0000560b132a15b8 in vmstate_load (f=0x560b16c53400, se=0x560b16893390) at ../migration/savevm.c:972
>> #13 0x0000560b132a4f28 in qemu_loadvm_section_start_full (f=0x560b16c53400, type=4 '\004') at ../migration/savevm.c:2746
>> #14 0x0000560b132a5ae8 in qemu_loadvm_state_main (f=0x560b16c53400, mis=0x560b16877f20) at ../migration/savevm.c:3058
>> #15 0x0000560b132a45d0 in loadvm_handle_cmd_packaged (mis=0x560b16877f20) at ../migration/savevm.c:2451
>> #16 0x0000560b132a4b36 in loadvm_process_command (f=0x560b168c3b60) at ../migration/savevm.c:2614
>> #17 0x0000560b132a5b96 in qemu_loadvm_state_main (f=0x560b168c3b60, mis=0x560b16877f20) at ../migration/savevm.c:3073
>> #18 0x0000560b132a5db7 in qemu_loadvm_state (f=0x560b168c3b60) at ../migration/savevm.c:3150
>> #19 0x0000560b13286271 in process_incoming_migration_co (opaque=0x0) at ../migration/migration.c:892
>> #20 0x0000560b137cb6d4 in coroutine_trampoline (i0=377836416, i1=22027) at ../util/coroutine-ucontext.c:175
>> #21 0x00007fbc4786a79e in ??? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:103
>>
>>
>> Thread 10 (Thread 0x7fffce7fc700 (LWP 11778) "mig/dst/listen"):
>> #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> #1  0x000055555614e33f in qemu_futex_wait (f=0x5555576f6fc0, val=4294967295) at qemu/include/qemu/futex.h:29
>> #2  0x000055555614e505 in qemu_event_wait (ev=0x5555576f6fc0) at ../util/qemu-thread-posix.c:464
>> #3  0x0000555555c44eb1 in postcopy_ram_listen_thread (opaque=0x5555576f6f20) at ../migration/savevm.c:2135
>> #4  0x000055555614e6b8 in qemu_thread_start (args=0x5555582c8480) at ../util/qemu-thread-posix.c:541
>> #5  0x00007ffff72626ea in start_thread (arg=0x7fffce7fc700) at pthread_create.c:477
>> #6  0x00007ffff532158f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> Thread 9 (Thread 0x7fffceffd700 (LWP 11777) "mig/dst/fault"):
>> #0  0x00007ffff5314a89 in __GI___poll (fds=0x7fffc0000b60, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
>> #1  0x0000555555c3be3f in postcopy_ram_fault_thread (opaque=0x5555576f6f20) at ../migration/postcopy-ram.c:999
>> #2  0x000055555614e6b8 in qemu_thread_start (args=0x555557735be0) at ../util/qemu-thread-posix.c:541
>> #3  0x00007ffff72626ea in start_thread (arg=0x7fffceffd700) at pthread_create.c:477
>> #4  0x00007ffff532158f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> Breaking with gdb and stepping through the memcpy code generates a
>> request for a page that's seemingly already in the receivedmap:
>>
>> (gdb) x/i $pc
>> => 0x7ffff5399d14 <__memcpy_evex_unaligned_erms+86>:    rep movsb %ds:(%rsi),%es:(%rdi)
>> (gdb) p/x $rsi
>> $1 = 0x7fffd68cc000
>> (gdb) si
>> postcopy_ram_fault_thread_request Request for HVA=0x7fffd68cc000 rb=pc.ram offset=0xcc000 pid=11754
>> // these are my printfs:
>> postcopy_request_page:
>> migrate_send_rp_req_pages:
>> migrate_send_rp_req_pages: mutex
>> migrate_send_rp_req_pages: received
>>
>> // gdb hangs here, it looks like the page wasn't populated?
>>
>> I've had my share of postcopy for the day. Hopefully you'll be able to
>> figure out what the issue is.
>>
>> - reproducer (2nd iter already hangs for me):
>>
>> $ for i in $(seq 1 9999); do echo "$i ============="; \
>> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \
>> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done
>>
>> - reproducer with traces and gdb:
>>
>> $ for i in $(seq 1 9999); do echo "$i ============="; \
>> QTEST_TRACE="multifd_* -trace source_* -trace postcopy_* -trace savevm_* \
>> -trace loadvm_*" QTEST_QEMU_BINARY_DST='gdb --ex "handle SIGUSR1 \
>> noprint" --ex "run" --args ./qemu-system-x86_64' \
>> QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test \
>> --full -r /x86_64/migration/postcopy/multifd/plain || break ; done
>
> * Thank you for the reproducer and traces. I'll try to check more and
> see if I'm able to reproduce it on my side.
>

Thanks. I cannot merge this series until that issue is resolved. If it
reproduces on my machine there's a high chance that it will break CI at
some point and then it'll be a nightmare to debug. This has happened
many times before with multifd.

> Thank you.
> ---
>   - Prasad


  reply	other threads:[~2025-04-03 13:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-18 12:38 [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 1/7] migration/multifd: move macros to multifd header Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 2/7] migration: Refactor channel discovery mechanism Prasad Pandit
2025-03-31 15:01   ` Fabiano Rosas
2025-04-03  7:01     ` Prasad Pandit
2025-04-03 12:59       ` Fabiano Rosas
2025-04-04  9:48         ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 3/7] migration: enable multifd and postcopy together Prasad Pandit
2025-03-31 15:27   ` Fabiano Rosas
2025-04-03 10:57     ` Prasad Pandit
2025-04-03 13:03       ` Fabiano Rosas
2025-03-18 12:38 ` [PATCH v8 4/7] tests/qtest/migration: consolidate set capabilities Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 5/7] tests/qtest/migration: add postcopy tests with multifd Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 6/7] migration: Add save_postcopy_prepare() savevm handler Prasad Pandit
2025-03-31 15:08   ` Fabiano Rosas
2025-04-03  7:03     ` Prasad Pandit
2025-03-18 12:38 ` [PATCH v8 7/7] migration/ram: Implement save_postcopy_prepare() Prasad Pandit
2025-03-31 15:18   ` Fabiano Rosas
2025-04-03  7:21     ` Prasad Pandit
2025-04-03 13:07       ` Fabiano Rosas
2025-04-04  9:50         ` Prasad Pandit
2025-03-25  9:53 ` [PATCH v8 0/7] Allow to enable multifd and postcopy migration together Prasad Pandit
2025-03-27 14:35   ` Fabiano Rosas
2025-03-27 16:01     ` Prasad Pandit
2025-03-31 20:54 ` Fabiano Rosas
2025-04-03  7:24   ` Prasad Pandit
2025-04-03 13:11     ` Fabiano Rosas [this message]
2025-04-10 12:22       ` Prasad Pandit
2025-04-10 20:18         ` Fabiano Rosas
2025-04-11  7:25           ` Prasad Pandit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zfgxjspn.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=berrange@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pjp@fedoraproject.org \
    --cc=ppandit@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.