qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* recent flakiness (intermittent hangs) of migration-test
@ 2020-10-29 17:20 Peter Maydell
  2020-10-29 17:41 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2020-10-29 17:20 UTC (permalink / raw)
  To: QEMU Developers, Juan Quintela, Dr. David Alan Gilbert

Recently I've seen the 'make check' migration-test developing an
intermittent hang; I've seen this now on aarch32 host several times
and also on s390x host. The symptom is that the test just hangs
with a couple of child qemu processes sitting around doing nothing.
^Cing out of 'make check' doesn't kill the qemu processes; they
seem to need a 'kill -9'.

Sorry for the low-information-density bug report, but I don't really
have time at the moment to debug failures in merge build test runs
because the queue of stuff still to merge is enormous...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-29 17:20 recent flakiness (intermittent hangs) of migration-test Peter Maydell
@ 2020-10-29 17:41 ` Dr. David Alan Gilbert
  2020-10-29 18:55   ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Dr. David Alan Gilbert @ 2020-10-29 17:41 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers, Juan Quintela

* Peter Maydell (peter.maydell@linaro.org) wrote:
> Recently I've seen the 'make check' migration-test developing an
> intermittent hang; I've seen this now on aarch32 host several times
> and also on s390x host. The symptom is that the test just hangs
> with a couple of child qemu processes sitting around doing nothing.
> ^Cing out of 'make check' doesn't kill the qemu processes; they
> seem to need a 'kill -9'.

If you can send us a copy of the ps -eaf of the qemu commandlines
in the failing case it would be interesting to see.

> Sorry for the low-information-density bug report, but I don't really
> have time at the moment to debug failures in merge build test runs
> because the queue of stuff still to merge is enormous...

If you get a moment to breath then a backtrace of the migration-test
process itself would be useful to see where it's hanging.

Dave

> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-29 17:41 ` Dr. David Alan Gilbert
@ 2020-10-29 18:55   ` Peter Maydell
  2020-10-29 19:34     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2020-10-29 18:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: QEMU Developers, Juan Quintela

On Thu, 29 Oct 2020 at 17:41, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Peter Maydell (peter.maydell@linaro.org) wrote:
> > Recently I've seen the 'make check' migration-test developing an
> > intermittent hang; I've seen this now on aarch32 host several times
> > and also on s390x host. The symptom is that the test just hangs
> > with a couple of child qemu processes sitting around doing nothing.
> > ^Cing out of 'make check' doesn't kill the qemu processes; they
> > seem to need a 'kill -9'.
>
> If you can send us a copy of the ps -eaf of the qemu commandlines
> in the failing case it would be interesting to see.

ubuntu    2078 46210  0 13:24 pts/0    00:00:00 bash -o pipefail -c
echo 'MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
QTEST_QEMU_IMG=./qemu-img
G_TEST_DBUS_DAEMON=/home/ubuntu/qemu/tests/dbus-vmstate-daemon.sh
QTEST_QEMU_BINARY=./qemu-system-x86_64 tests/qtest/migration-test
--tap -k' && MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255
+ 1))} QTEST_QEMU_IMG=./qemu-img
G_TEST_DBUS_DAEMON=/home/ubuntu/qemu/tests/dbus-vmstate-daemon.sh
QTEST_QEMU_BINARY=./qemu-system-x86_64 tests/qtest/migration-test
--tap -k < /dev/null | ./scripts/tap-driver.pl
--test-name="qtest-x86_64/migration-test"
ubuntu    2079  2078  8 13:24 pts/0    00:07:07
tests/qtest/migration-test --tap -k
ubuntu    2080  2078  0 13:24 pts/0    00:00:00 perl
./scripts/tap-driver.pl --test-name=qtest-x86_64/migration-test
ubuntu    3514  2079  6 13:24 pts/0    00:05:56 ./qemu-system-x86_64
-qtest unix:/tmp/qtest-2079.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-2079.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
source,debug-threads=on -m 150M -serial
file:/tmp/migration-test-EOJPDc/src_serial -drive
file=/tmp/migration-test-EOJPDc/bootsect,format=raw -accel qtest
ubuntu    3528  2079  0 13:24 pts/0    00:00:00 ./qemu-system-x86_64
-qtest unix:/tmp/qtest-2079.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-2079.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg -name
target,debug-threads=on -m 150M -serial
file:/tmp/migration-test-EOJPDc/dest_serial -incoming
unix:/tmp/migration-test-EOJPDc/migsocket -drive
file=/tmp/migration-test-EOJPDc/bootsect,format=raw -accel qtest


> > Sorry for the low-information-density bug report, but I don't really
> > have time at the moment to debug failures in merge build test runs
> > because the queue of stuff still to merge is enormous...
>
> If you get a moment to breath then a backtrace of the migration-test
> process itself would be useful to see where it's hanging.

Here you go:


Thread 2 (Thread 0x3ff9bcff910 (LWP 2081)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
#1  0x000002aa35e353cc in qemu_futex_wait (val=<optimized out>,
f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa35e69388 <rcu_call_ready_event>)
at ../../util/qemu-thread-posix.c:460
#3  0x000002aa35e37a8c in call_rcu_thread (opaque=opaque@entry=0x0) at
../../util/rcu.c:258
#4  0x000002aa35e34512 in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#5  0x000003ff9bf87aa8 in start_thread (arg=0x3ff9bcff910) at
pthread_create.c:463
#6  0x000003ff9be7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff9c378750 (LWP 2079)):
#0  0x000003ff9be44040 in __GI___nanosleep
(requested_time=requested_time@entry=0x3ffe49fe4b8,
remaining=remaining@entry=0x0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x000003ff9be72d7c in usleep (useconds=useconds@entry=1000) at
../sysdeps/posix/usleep.c:32
#2  0x000002aa35e1074e in wait_for_migration_status (who=<optimized
out>, goal=<optimized out>, ungoals=0x0)
    at ../../tests/qtest/migration-helpers.c:157
#3  0x000002aa35e0ecd2 in migrate_postcopy_complete
(from=0x2aa371feb00, to=0x2aa372092e0)
    at ../../tests/qtest/migration-test.c:746
#4  0x000002aa35e0f31a in test_postcopy_recovery () at
../../tests/qtest/migration-test.c:830
#5  0x000003ff9c17c604 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#6  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#7  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#8  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#9  0x000003ff9c17c7e6 in g_test_run_suite () from
/usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000003ff9c17c820 in g_test_run () from
/usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000002aa35e0ce7c in main (argc=<optimized out>, argv=<optimized
out>) at ../../tests/qtest/migration-test.c:1511

Here's qemu process 3514:
Thread 5 (Thread 0x3ff4affd910 (LWP 3628)):
#0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x2aa26cd74dc)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
mutex=0x2aa26cd7488, cond=0x2aa26cd74b0)
    at pthread_cond_wait.c:502
#2  0x000003ff94c8d936 in __pthread_cond_wait
(cond=cond@entry=0x2aa26cd74b0, mutex=mutex@entry=0x2aa26cd7488)
    at pthread_cond_wait.c:655
#3  0x000002aa2497072c in qemu_sem_wait (sem=sem@entry=0x2aa26cd7488)
at ../../util/qemu-thread-posix.c:328
#4  0x000002aa244f4a02 in postcopy_pause (s=0x2aa26cd7000) at
../../migration/migration.c:3192
#5  0x000002aa244f4a02 in migration_detect_error (s=0x2aa26cd7000) at
../../migration/migration.c:3255
#6  0x000002aa244f4a02 in migration_thread
(opaque=opaque@entry=0x2aa26cd7000) at
../../migration/migration.c:3564
#7  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#8  0x000003ff94c87aa8 in start_thread (arg=0x3ff4affd910) at
pthread_create.c:463
#9  0x000003ff94b7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff68adb910 (LWP 3522)):
#0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x2aa26dd2d58)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
mutex=0x2aa2525d870 <qemu_global_mutex>, cond=0x2aa26dd2d30) at
pthread_cond_wait.c:502
#2  0x000003ff94c8d936 in __pthread_cond_wait
(cond=cond@entry=0x2aa26dd2d30, mutex=mutex@entry=0x2aa2525d870
<qemu_global_mutex>) at pthread_cond_wait.c:655
#3  0x000002aa24970196 in qemu_cond_wait_impl (cond=0x2aa26dd2d30,
mutex=0x2aa2525d870 <qemu_global_mutex>, file=0x2aa24a8d162
"../../softmmu/cpus.c", line=<optimized out>) at
../../util/qemu-thread-posix.c:174
#4  0x000002aa2467cbe0 in qemu_wait_io_event
(cpu=cpu@entry=0x2aa26da31e0) at ../../softmmu/cpus.c:411
#5  0x000002aa24760976 in tcg_cpu_thread_fn
(arg=arg@entry=0x2aa26da31e0) at ../../accel/tcg/tcg-cpus.c:455
#6  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#7  0x000003ff94c87aa8 in start_thread (arg=0x3ff68adb910) at
pthread_create.c:463
#8  0x000003ff94b7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff693cd910 (LWP 3521)):
#0  0x000003ff94b6f5c0 in __GI___poll (fds=0x3ff64007700, nfds=3,
timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff96fd250c in  () at /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff96fd2990 in g_main_loop_run () at
/usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa247b76e4 in iothread_run
(opaque=opaque@entry=0x2aa26a47a30) at ../../iothread.c:80
#4  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#5  0x000003ff94c87aa8 in start_thread (arg=0x3ff693cd910) at
pthread_create.c:463
#6  0x000003ff94b7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff879b7910 (LWP 3520)):
#0  0x000003ff94b75a62 in syscall () at
../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
#1  0x000002aa249708f4 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  0x000002aa249708f4 in qemu_event_wait (ev=ev@entry=0x2aa2527dbf8
<rcu_call_ready_event>)
    at ../../util/qemu-thread-posix.c:460
#3  0x000002aa2497b57c in call_rcu_thread (opaque=opaque@entry=0x0) at
../../util/rcu.c:258
#4  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#5  0x000003ff94c87aa8 in start_thread (arg=0x3ff879b7910) at
pthread_create.c:463
#6  0x000003ff94b7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff977b8450 (LWP 3514)):
#0  0x000003ff94b6f712 in __GI_ppoll (fds=0x2aa2755ecd0, nfds=6,
timeout=<optimized out>,
    timeout@entry=0x3ffec7fe4c8, sigmask=sigmask@entry=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x000002aa24961c9c in ppoll (__ss=0x0, __timeout=0x3ffec7fe4c8,
__nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  0x000002aa24961c9c in qemu_poll_ns (fds=<optimized out>,
nfds=<optimized out>, timeout=timeout@entry=1000000000)
    at ../../util/qemu-timer.c:349
#3  0x000002aa2497bb88 in os_host_main_loop_wait (timeout=1000000000)
at ../../util/main-loop.c:239
#4  0x000002aa2497bb88 in main_loop_wait
(nonblocking=nonblocking@entry=0) at ../../util/main-loop.c:520
#5  0x000002aa2475aaf0 in qemu_main_loop () at ../../softmmu/vl.c:1678
#6  0x000002aa243ee178 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at ../../softmmu/main.c:50

And here's 3528:
Thread 6 (Thread 0x3ff6ccfd910 (LWP 3841)):
#0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x2aa387a6aac)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
mutex=0x2aa387a6a58, cond=0x2aa387a6a80)
    at pthread_cond_wait.c:502
#2  0x000003ffb1b8d936 in __pthread_cond_wait
(cond=cond@entry=0x2aa387a6a80, mutex=mutex@entry=0x2aa387a6a58)
    at pthread_cond_wait.c:655
#3  0x000002aa36bf072c in qemu_sem_wait (sem=sem@entry=0x2aa387a6a58)
at ../../util/qemu-thread-posix.c:328
#4  0x000002aa366c369a in postcopy_pause_incoming (mis=<optimized
out>) at ../../migration/savevm.c:2541
#5  0x000002aa366c369a in qemu_loadvm_state_main
(f=f@entry=0x2aa38897930, mis=mis@entry=0x2aa387a6820)
    at ../../migration/savevm.c:2615
#6  0x000002aa366c44fa in postcopy_ram_listen_thread
(opaque=opaque@entry=0x0) at ../../migration/savevm.c:1830
#7  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#8  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6ccfd910) at
pthread_create.c:463
#9  0x000003ffb1a7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 5 (Thread 0x3ff6d4fe910 (LWP 3840)):
#0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
out>, expected=0, futex_word=0x2aa387a6b0c)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
mutex=0x2aa387a6ab8, cond=0x2aa387a6ae0)
    at pthread_cond_wait.c:502
#2  0x000003ffb1b8d936 in __pthread_cond_wait
(cond=cond@entry=0x2aa387a6ae0, mutex=mutex@entry=0x2aa387a6ab8)
    at pthread_cond_wait.c:655
#3  0x000002aa36bf072c in qemu_sem_wait (sem=0x2aa387a6ab8,
sem@entry=<error reading variable: value has been optimized out>)
    at ../../util/qemu-thread-posix.c:328
#4  0x000002aa3685282c in postcopy_pause_fault_thread (mis=<optimized
out>) at ../../migration/postcopy-ram.c:841
#5  0x000002aa3685282c in postcopy_ram_fault_thread
(opaque=opaque@entry=0x2aa387a6820) at
../../migration/postcopy-ram.c:905
#6  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#7  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6d4fe910) at
pthread_create.c:463
#8  0x000003ffb1a7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff859db910 (LWP 3536)):
#0  0x000002aa369854ec in load_memop (op=MO_8, haddr=0x3ff6fa61000) at
../../accel/tcg/cputlb.c:1794
#1  0x000002aa369854ec in load_helper (full_load=<optimized out>,
code_read=false, op=MO_8, retaddr=4396002632260, oi=<optimized out>,
addr=18223104, env=0x2aa38879a90) at ../../accel/tcg/cputlb.c:1914
#2  0x000002aa369854ec in full_ldub_mmu (env=0x2aa38879a90,
addr=18223104, oi=<optimized out>, retaddr=4396002632260)
    at ../../accel/tcg/cputlb.c:1930
#3  0x000003ff862ce304 in code_gen_buffer ()
#4  0x000002aa3693ab62 in cpu_tb_exec (itb=<optimized out>,
cpu=<optimized out>) at ../../accel/tcg/cpu-exec.c:178
#5  0x000002aa3693ab62 in cpu_loop_exec_tb (tb_exit=<synthetic
pointer>, last_tb=<synthetic pointer>, tb=<optimized out>,
cpu=<optimized out>) at ../../accel/tcg/cpu-exec.c:658
#6  0x000002aa3693ab62 in cpu_exec (cpu=<optimized out>) at
../../accel/tcg/cpu-exec.c:771
#7  0x000002aa0000000c in  ()

Thread 3 (Thread 0x3ff862cd910 (LWP 3535)):
#0  0x000003ffb1a6f5c0 in __GI___poll (fds=0x3ff78007700, nfds=3,
timeout=<optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffb3ed250c in  () at /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffb3ed2990 in g_main_loop_run () at
/usr/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa36a376e4 in iothread_run
(opaque=opaque@entry=0x2aa38515a30) at ../../iothread.c:80
#4  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#5  0x000003ffb1b87aa8 in start_thread (arg=0x3ff862cd910) at
pthread_create.c:463
#6  0x000003ffb1a7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ffa48b7910 (LWP 3533)):
#0  0x000003ffb1a75a62 in syscall () at
../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
#1  0x000002aa36bf08f4 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>)
    at /home/ubuntu/qemu/include/qemu/futex.h:29
#2  0x000002aa36bf08f4 in qemu_event_wait (ev=ev@entry=0x2aa374fdbf0
<rcu_gp_event>) at ../../util/qemu-thread-posix.c:460
#3  0x000002aa36bfb190 in wait_for_readers () at ../../util/rcu.c:135
#4  0x000002aa36bfb190 in synchronize_rcu () at ../../util/rcu.c:171
#5  0x000002aa36bfb5b2 in call_rcu_thread (opaque=opaque@entry=0x0) at
../../util/rcu.c:265
#6  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
../../util/qemu-thread-posix.c:521
#7  0x000003ffb1b87aa8 in start_thread (arg=0x3ffa48b7910) at
pthread_create.c:463
#8  0x000003ffb1a7a896 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffb46b8450 (LWP 3528)):
#0  0x000003ffb1a6f712 in __GI_ppoll (fds=0x2aa39312cb0, nfds=6,
timeout=<optimized out>,
    timeout@entry=0x3ffe76feb78, sigmask=sigmask@entry=0x0) at
../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x000002aa36be1c9c in ppoll (__ss=0x0, __timeout=0x3ffe76feb78,
__nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  0x000002aa36be1c9c in qemu_poll_ns (fds=<optimized out>,
nfds=<optimized out>, timeout=timeout@entry=54858292)
    at ../../util/qemu-timer.c:349
#3  0x000002aa36bfbb88 in os_host_main_loop_wait (timeout=54858292) at
../../util/main-loop.c:239
#4  0x000002aa36bfbb88 in main_loop_wait
(nonblocking=nonblocking@entry=0) at ../../util/main-loop.c:520
#5  0x000002aa369daaf0 in qemu_main_loop () at ../../softmmu/vl.c:1678
#6  0x000002aa3666e178 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at ../../softmmu/main.c:50


thanks
-- PMM


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-29 18:55   ` Peter Maydell
@ 2020-10-29 19:34     ` Dr. David Alan Gilbert
  2020-10-29 20:28       ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Dr. David Alan Gilbert @ 2020-10-29 19:34 UTC (permalink / raw)
  To: Peter Maydell, peterx; +Cc: QEMU Developers, Juan Quintela

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Thu, 29 Oct 2020 at 17:41, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> > > Recently I've seen the 'make check' migration-test developing an
> > > intermittent hang; I've seen this now on aarch32 host several times
> > > and also on s390x host. The symptom is that the test just hangs
> > > with a couple of child qemu processes sitting around doing nothing.
> > > ^Cing out of 'make check' doesn't kill the qemu processes; they
> > > seem to need a 'kill -9'.
> >
> > If you can send us a copy of the ps -eaf of the qemu commandlines
> > in the failing case it would be interesting to see.
> 
> ubuntu    2078 46210  0 13:24 pts/0    00:00:00 bash -o pipefail -c
> echo 'MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}
> QTEST_QEMU_IMG=./qemu-img
> G_TEST_DBUS_DAEMON=/home/ubuntu/qemu/tests/dbus-vmstate-daemon.sh
> QTEST_QEMU_BINARY=./qemu-system-x86_64 tests/qtest/migration-test
> --tap -k' && MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255
> + 1))} QTEST_QEMU_IMG=./qemu-img
> G_TEST_DBUS_DAEMON=/home/ubuntu/qemu/tests/dbus-vmstate-daemon.sh
> QTEST_QEMU_BINARY=./qemu-system-x86_64 tests/qtest/migration-test
> --tap -k < /dev/null | ./scripts/tap-driver.pl
> --test-name="qtest-x86_64/migration-test"
> ubuntu    2079  2078  8 13:24 pts/0    00:07:07
> tests/qtest/migration-test --tap -k
> ubuntu    2080  2078  0 13:24 pts/0    00:00:00 perl
> ./scripts/tap-driver.pl --test-name=qtest-x86_64/migration-test
> ubuntu    3514  2079  6 13:24 pts/0    00:05:56 ./qemu-system-x86_64
> -qtest unix:/tmp/qtest-2079.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-2079.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg -name
> source,debug-threads=on -m 150M -serial
> file:/tmp/migration-test-EOJPDc/src_serial -drive
> file=/tmp/migration-test-EOJPDc/bootsect,format=raw -accel qtest
> ubuntu    3528  2079  0 13:24 pts/0    00:00:00 ./qemu-system-x86_64
> -qtest unix:/tmp/qtest-2079.sock -qtest-log /dev/null -chardev
> socket,path=/tmp/qtest-2079.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg -name
> target,debug-threads=on -m 150M -serial
> file:/tmp/migration-test-EOJPDc/dest_serial -incoming
> unix:/tmp/migration-test-EOJPDc/migsocket -drive
> file=/tmp/migration-test-EOJPDc/bootsect,format=raw -accel qtest
>

OK, so x86-64 tcg,

> 
> > > Sorry for the low-information-density bug report, but I don't really
> > > have time at the moment to debug failures in merge build test runs
> > > because the queue of stuff still to merge is enormous...
> >
> > If you get a moment to breath then a backtrace of the migration-test
> > process itself would be useful to see where it's hanging.
> 
> Here you go:
> 
> 
> Thread 2 (Thread 0x3ff9bcff910 (LWP 2081)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
> #1  0x000002aa35e353cc in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>)
>     at /home/ubuntu/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa35e69388 <rcu_call_ready_event>)
> at ../../util/qemu-thread-posix.c:460
> #3  0x000002aa35e37a8c in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../../util/rcu.c:258
> #4  0x000002aa35e34512 in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #5  0x000003ff9bf87aa8 in start_thread (arg=0x3ff9bcff910) at
> pthread_create.c:463
> #6  0x000003ff9be7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff9c378750 (LWP 2079)):
> #0  0x000003ff9be44040 in __GI___nanosleep
> (requested_time=requested_time@entry=0x3ffe49fe4b8,
> remaining=remaining@entry=0x0)
>     at ../sysdeps/unix/sysv/linux/nanosleep.c:28
> #1  0x000003ff9be72d7c in usleep (useconds=useconds@entry=1000) at
> ../sysdeps/posix/usleep.c:32
> #2  0x000002aa35e1074e in wait_for_migration_status (who=<optimized
> out>, goal=<optimized out>, ungoals=0x0)
>     at ../../tests/qtest/migration-helpers.c:157
> #3  0x000002aa35e0ecd2 in migrate_postcopy_complete
> (from=0x2aa371feb00, to=0x2aa372092e0)
>     at ../../tests/qtest/migration-test.c:746
> #4  0x000002aa35e0f31a in test_postcopy_recovery () at
> ../../tests/qtest/migration-test.c:830

Ah, postcopy-recovery, one for Peter Xu.

> #5  0x000003ff9c17c604 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0

On s390x host.

Dave

> #6  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #7  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #8  0x000003ff9c17c540 in ?? () from /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #9  0x000003ff9c17c7e6 in g_test_run_suite () from
> /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #10 0x000003ff9c17c820 in g_test_run () from
> /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #11 0x000002aa35e0ce7c in main (argc=<optimized out>, argv=<optimized
> out>) at ../../tests/qtest/migration-test.c:1511
> 
> Here's qemu process 3514:
> Thread 5 (Thread 0x3ff4affd910 (LWP 3628)):
> #0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
> out>, expected=0, futex_word=0x2aa26cd74dc)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
> mutex=0x2aa26cd7488, cond=0x2aa26cd74b0)
>     at pthread_cond_wait.c:502
> #2  0x000003ff94c8d936 in __pthread_cond_wait
> (cond=cond@entry=0x2aa26cd74b0, mutex=mutex@entry=0x2aa26cd7488)
>     at pthread_cond_wait.c:655
> #3  0x000002aa2497072c in qemu_sem_wait (sem=sem@entry=0x2aa26cd7488)
> at ../../util/qemu-thread-posix.c:328
> #4  0x000002aa244f4a02 in postcopy_pause (s=0x2aa26cd7000) at
> ../../migration/migration.c:3192
> #5  0x000002aa244f4a02 in migration_detect_error (s=0x2aa26cd7000) at
> ../../migration/migration.c:3255
> #6  0x000002aa244f4a02 in migration_thread
> (opaque=opaque@entry=0x2aa26cd7000) at
> ../../migration/migration.c:3564
> #7  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #8  0x000003ff94c87aa8 in start_thread (arg=0x3ff4affd910) at
> pthread_create.c:463
> #9  0x000003ff94b7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff68adb910 (LWP 3522)):
> #0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
> out>, expected=0, futex_word=0x2aa26dd2d58)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
> mutex=0x2aa2525d870 <qemu_global_mutex>, cond=0x2aa26dd2d30) at
> pthread_cond_wait.c:502
> #2  0x000003ff94c8d936 in __pthread_cond_wait
> (cond=cond@entry=0x2aa26dd2d30, mutex=mutex@entry=0x2aa2525d870
> <qemu_global_mutex>) at pthread_cond_wait.c:655
> #3  0x000002aa24970196 in qemu_cond_wait_impl (cond=0x2aa26dd2d30,
> mutex=0x2aa2525d870 <qemu_global_mutex>, file=0x2aa24a8d162
> "../../softmmu/cpus.c", line=<optimized out>) at
> ../../util/qemu-thread-posix.c:174
> #4  0x000002aa2467cbe0 in qemu_wait_io_event
> (cpu=cpu@entry=0x2aa26da31e0) at ../../softmmu/cpus.c:411
> #5  0x000002aa24760976 in tcg_cpu_thread_fn
> (arg=arg@entry=0x2aa26da31e0) at ../../accel/tcg/tcg-cpus.c:455
> #6  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #7  0x000003ff94c87aa8 in start_thread (arg=0x3ff68adb910) at
> pthread_create.c:463
> #8  0x000003ff94b7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff693cd910 (LWP 3521)):
> #0  0x000003ff94b6f5c0 in __GI___poll (fds=0x3ff64007700, nfds=3,
> timeout=<optimized out>)
>     at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ff96fd250c in  () at /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ff96fd2990 in g_main_loop_run () at
> /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa247b76e4 in iothread_run
> (opaque=opaque@entry=0x2aa26a47a30) at ../../iothread.c:80
> #4  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #5  0x000003ff94c87aa8 in start_thread (arg=0x3ff693cd910) at
> pthread_create.c:463
> #6  0x000003ff94b7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff879b7910 (LWP 3520)):
> #0  0x000003ff94b75a62 in syscall () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
> #1  0x000002aa249708f4 in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>)
>     at /home/ubuntu/qemu/include/qemu/futex.h:29
> #2  0x000002aa249708f4 in qemu_event_wait (ev=ev@entry=0x2aa2527dbf8
> <rcu_call_ready_event>)
>     at ../../util/qemu-thread-posix.c:460
> #3  0x000002aa2497b57c in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../../util/rcu.c:258
> #4  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #5  0x000003ff94c87aa8 in start_thread (arg=0x3ff879b7910) at
> pthread_create.c:463
> #6  0x000003ff94b7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff977b8450 (LWP 3514)):
> #0  0x000003ff94b6f712 in __GI_ppoll (fds=0x2aa2755ecd0, nfds=6,
> timeout=<optimized out>,
>     timeout@entry=0x3ffec7fe4c8, sigmask=sigmask@entry=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:39
> #1  0x000002aa24961c9c in ppoll (__ss=0x0, __timeout=0x3ffec7fe4c8,
> __nfds=<optimized out>, __fds=<optimized out>)
>     at /usr/include/s390x-linux-gnu/bits/poll2.h:77
> #2  0x000002aa24961c9c in qemu_poll_ns (fds=<optimized out>,
> nfds=<optimized out>, timeout=timeout@entry=1000000000)
>     at ../../util/qemu-timer.c:349
> #3  0x000002aa2497bb88 in os_host_main_loop_wait (timeout=1000000000)
> at ../../util/main-loop.c:239
> #4  0x000002aa2497bb88 in main_loop_wait
> (nonblocking=nonblocking@entry=0) at ../../util/main-loop.c:520
> #5  0x000002aa2475aaf0 in qemu_main_loop () at ../../softmmu/vl.c:1678
> #6  0x000002aa243ee178 in main (argc=<optimized out>, argv=<optimized
> out>, envp=<optimized out>) at ../../softmmu/main.c:50
> 
> And here's 3528:
> Thread 6 (Thread 0x3ff6ccfd910 (LWP 3841)):
> #0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
> out>, expected=0, futex_word=0x2aa387a6aac)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
> mutex=0x2aa387a6a58, cond=0x2aa387a6a80)
>     at pthread_cond_wait.c:502
> #2  0x000003ffb1b8d936 in __pthread_cond_wait
> (cond=cond@entry=0x2aa387a6a80, mutex=mutex@entry=0x2aa387a6a58)
>     at pthread_cond_wait.c:655
> #3  0x000002aa36bf072c in qemu_sem_wait (sem=sem@entry=0x2aa387a6a58)
> at ../../util/qemu-thread-posix.c:328
> #4  0x000002aa366c369a in postcopy_pause_incoming (mis=<optimized
> out>) at ../../migration/savevm.c:2541
> #5  0x000002aa366c369a in qemu_loadvm_state_main
> (f=f@entry=0x2aa38897930, mis=mis@entry=0x2aa387a6820)
>     at ../../migration/savevm.c:2615
> #6  0x000002aa366c44fa in postcopy_ram_listen_thread
> (opaque=opaque@entry=0x0) at ../../migration/savevm.c:1830
> #7  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #8  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6ccfd910) at
> pthread_create.c:463
> #9  0x000003ffb1a7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 5 (Thread 0x3ff6d4fe910 (LWP 3840)):
> #0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
> out>, expected=0, futex_word=0x2aa387a6b0c)
>     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> #1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
> mutex=0x2aa387a6ab8, cond=0x2aa387a6ae0)
>     at pthread_cond_wait.c:502
> #2  0x000003ffb1b8d936 in __pthread_cond_wait
> (cond=cond@entry=0x2aa387a6ae0, mutex=mutex@entry=0x2aa387a6ab8)
>     at pthread_cond_wait.c:655
> #3  0x000002aa36bf072c in qemu_sem_wait (sem=0x2aa387a6ab8,
> sem@entry=<error reading variable: value has been optimized out>)
>     at ../../util/qemu-thread-posix.c:328
> #4  0x000002aa3685282c in postcopy_pause_fault_thread (mis=<optimized
> out>) at ../../migration/postcopy-ram.c:841
> #5  0x000002aa3685282c in postcopy_ram_fault_thread
> (opaque=opaque@entry=0x2aa387a6820) at
> ../../migration/postcopy-ram.c:905
> #6  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #7  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6d4fe910) at
> pthread_create.c:463
> #8  0x000003ffb1a7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff859db910 (LWP 3536)):
> #0  0x000002aa369854ec in load_memop (op=MO_8, haddr=0x3ff6fa61000) at
> ../../accel/tcg/cputlb.c:1794
> #1  0x000002aa369854ec in load_helper (full_load=<optimized out>,
> code_read=false, op=MO_8, retaddr=4396002632260, oi=<optimized out>,
> addr=18223104, env=0x2aa38879a90) at ../../accel/tcg/cputlb.c:1914
> #2  0x000002aa369854ec in full_ldub_mmu (env=0x2aa38879a90,
> addr=18223104, oi=<optimized out>, retaddr=4396002632260)
>     at ../../accel/tcg/cputlb.c:1930
> #3  0x000003ff862ce304 in code_gen_buffer ()
> #4  0x000002aa3693ab62 in cpu_tb_exec (itb=<optimized out>,
> cpu=<optimized out>) at ../../accel/tcg/cpu-exec.c:178
> #5  0x000002aa3693ab62 in cpu_loop_exec_tb (tb_exit=<synthetic
> pointer>, last_tb=<synthetic pointer>, tb=<optimized out>,
> cpu=<optimized out>) at ../../accel/tcg/cpu-exec.c:658
> #6  0x000002aa3693ab62 in cpu_exec (cpu=<optimized out>) at
> ../../accel/tcg/cpu-exec.c:771
> #7  0x000002aa0000000c in  ()
> 
> Thread 3 (Thread 0x3ff862cd910 (LWP 3535)):
> #0  0x000003ffb1a6f5c0 in __GI___poll (fds=0x3ff78007700, nfds=3,
> timeout=<optimized out>)
>     at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ffb3ed250c in  () at /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ffb3ed2990 in g_main_loop_run () at
> /usr/lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa36a376e4 in iothread_run
> (opaque=opaque@entry=0x2aa38515a30) at ../../iothread.c:80
> #4  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #5  0x000003ffb1b87aa8 in start_thread (arg=0x3ff862cd910) at
> pthread_create.c:463
> #6  0x000003ffb1a7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ffa48b7910 (LWP 3533)):
> #0  0x000003ffb1a75a62 in syscall () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:53
> #1  0x000002aa36bf08f4 in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>)
>     at /home/ubuntu/qemu/include/qemu/futex.h:29
> #2  0x000002aa36bf08f4 in qemu_event_wait (ev=ev@entry=0x2aa374fdbf0
> <rcu_gp_event>) at ../../util/qemu-thread-posix.c:460
> #3  0x000002aa36bfb190 in wait_for_readers () at ../../util/rcu.c:135
> #4  0x000002aa36bfb190 in synchronize_rcu () at ../../util/rcu.c:171
> #5  0x000002aa36bfb5b2 in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../../util/rcu.c:265
> #6  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> ../../util/qemu-thread-posix.c:521
> #7  0x000003ffb1b87aa8 in start_thread (arg=0x3ffa48b7910) at
> pthread_create.c:463
> #8  0x000003ffb1a7a896 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffb46b8450 (LWP 3528)):
> #0  0x000003ffb1a6f712 in __GI_ppoll (fds=0x2aa39312cb0, nfds=6,
> timeout=<optimized out>,
>     timeout@entry=0x3ffe76feb78, sigmask=sigmask@entry=0x0) at
> ../sysdeps/unix/sysv/linux/ppoll.c:39
> #1  0x000002aa36be1c9c in ppoll (__ss=0x0, __timeout=0x3ffe76feb78,
> __nfds=<optimized out>, __fds=<optimized out>)
>     at /usr/include/s390x-linux-gnu/bits/poll2.h:77
> #2  0x000002aa36be1c9c in qemu_poll_ns (fds=<optimized out>,
> nfds=<optimized out>, timeout=timeout@entry=54858292)
>     at ../../util/qemu-timer.c:349
> #3  0x000002aa36bfbb88 in os_host_main_loop_wait (timeout=54858292) at
> ../../util/main-loop.c:239
> #4  0x000002aa36bfbb88 in main_loop_wait
> (nonblocking=nonblocking@entry=0) at ../../util/main-loop.c:520
> #5  0x000002aa369daaf0 in qemu_main_loop () at ../../softmmu/vl.c:1678
> #6  0x000002aa3666e178 in main (argc=<optimized out>, argv=<optimized
> out>, envp=<optimized out>) at ../../softmmu/main.c:50
> 
> 
> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-29 19:34     ` Dr. David Alan Gilbert
@ 2020-10-29 20:28       ` Peter Xu
  2020-10-30 11:48         ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2020-10-29 20:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Maydell
  Cc: Peter Maydell, QEMU Developers, Juan Quintela

On Thu, Oct 29, 2020 at 07:34:33PM +0000, Dr. David Alan Gilbert wrote:
> > Here's qemu process 3514:
> > Thread 5 (Thread 0x3ff4affd910 (LWP 3628)):
> > #0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
> > out>, expected=0, futex_word=0x2aa26cd74dc)
> >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > #1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
> > mutex=0x2aa26cd7488, cond=0x2aa26cd74b0)
> >     at pthread_cond_wait.c:502
> > #2  0x000003ff94c8d936 in __pthread_cond_wait
> > (cond=cond@entry=0x2aa26cd74b0, mutex=mutex@entry=0x2aa26cd7488)
> >     at pthread_cond_wait.c:655
> > #3  0x000002aa2497072c in qemu_sem_wait (sem=sem@entry=0x2aa26cd7488)
> > at ../../util/qemu-thread-posix.c:328
> > #4  0x000002aa244f4a02 in postcopy_pause (s=0x2aa26cd7000) at
> > ../../migration/migration.c:3192

So the postcopy pause state didn't continue successfully on src due to some
reason ...

> > #5  0x000002aa244f4a02 in migration_detect_error (s=0x2aa26cd7000) at
> > ../../migration/migration.c:3255
> > #6  0x000002aa244f4a02 in migration_thread
> > (opaque=opaque@entry=0x2aa26cd7000) at
> > ../../migration/migration.c:3564
> > #7  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> > ../../util/qemu-thread-posix.c:521
> > #8  0x000003ff94c87aa8 in start_thread (arg=0x3ff4affd910) at
> > pthread_create.c:463
> > #9  0x000003ff94b7a896 in thread_start () at
> > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

[...]

> > And here's 3528:
> > Thread 6 (Thread 0x3ff6ccfd910 (LWP 3841)):
> > #0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
> > out>, expected=0, futex_word=0x2aa387a6aac)
> >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > #1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
> > mutex=0x2aa387a6a58, cond=0x2aa387a6a80)
> >     at pthread_cond_wait.c:502
> > #2  0x000003ffb1b8d936 in __pthread_cond_wait
> > (cond=cond@entry=0x2aa387a6a80, mutex=mutex@entry=0x2aa387a6a58)
> >     at pthread_cond_wait.c:655
> > #3  0x000002aa36bf072c in qemu_sem_wait (sem=sem@entry=0x2aa387a6a58)
> > at ../../util/qemu-thread-posix.c:328
> > #4  0x000002aa366c369a in postcopy_pause_incoming (mis=<optimized
> > out>) at ../../migration/savevm.c:2541

Same on the destination side.

> > #5  0x000002aa366c369a in qemu_loadvm_state_main
> > (f=f@entry=0x2aa38897930, mis=mis@entry=0x2aa387a6820)
> >     at ../../migration/savevm.c:2615
> > #6  0x000002aa366c44fa in postcopy_ram_listen_thread
> > (opaque=opaque@entry=0x0) at ../../migration/savevm.c:1830
> > #7  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> > ../../util/qemu-thread-posix.c:521
> > #8  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6ccfd910) at
> > pthread_create.c:463
> > #9  0x000003ffb1a7a896 in thread_start () at
> > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test
testcase and try to capture the stderr?  With the help of commit a47295014d
("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
be able to dump quite some helpful information to further identify the issue.

I'll also try to find another s390 host to try reproduce on my side.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-29 20:28       ` Peter Xu
@ 2020-10-30 11:48         ` Peter Maydell
  2020-10-30 13:53           ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2020-10-30 11:48 UTC (permalink / raw)
  To: Peter Xu; +Cc: Juan Quintela, Dr. David Alan Gilbert, QEMU Developers

On Thu, 29 Oct 2020 at 20:28, Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Oct 29, 2020 at 07:34:33PM +0000, Dr. David Alan Gilbert wrote:
> > > Here's qemu process 3514:
> > > Thread 5 (Thread 0x3ff4affd910 (LWP 3628)):
> > > #0  0x000003ff94c8d936 in futex_wait_cancelable (private=<optimized
> > > out>, expected=0, futex_word=0x2aa26cd74dc)
> > >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > > #1  0x000003ff94c8d936 in __pthread_cond_wait_common (abstime=0x0,
> > > mutex=0x2aa26cd7488, cond=0x2aa26cd74b0)
> > >     at pthread_cond_wait.c:502
> > > #2  0x000003ff94c8d936 in __pthread_cond_wait
> > > (cond=cond@entry=0x2aa26cd74b0, mutex=mutex@entry=0x2aa26cd7488)
> > >     at pthread_cond_wait.c:655
> > > #3  0x000002aa2497072c in qemu_sem_wait (sem=sem@entry=0x2aa26cd7488)
> > > at ../../util/qemu-thread-posix.c:328
> > > #4  0x000002aa244f4a02 in postcopy_pause (s=0x2aa26cd7000) at
> > > ../../migration/migration.c:3192
>
> So the postcopy pause state didn't continue successfully on src due to some
> reason ...
>
> > > #5  0x000002aa244f4a02 in migration_detect_error (s=0x2aa26cd7000) at
> > > ../../migration/migration.c:3255
> > > #6  0x000002aa244f4a02 in migration_thread
> > > (opaque=opaque@entry=0x2aa26cd7000) at
> > > ../../migration/migration.c:3564
> > > #7  0x000002aa2496fa3a in qemu_thread_start (args=<optimized out>) at
> > > ../../util/qemu-thread-posix.c:521
> > > #8  0x000003ff94c87aa8 in start_thread (arg=0x3ff4affd910) at
> > > pthread_create.c:463
> > > #9  0x000003ff94b7a896 in thread_start () at
> > > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
>
> [...]
>
> > > And here's 3528:
> > > Thread 6 (Thread 0x3ff6ccfd910 (LWP 3841)):
> > > #0  0x000003ffb1b8d936 in futex_wait_cancelable (private=<optimized
> > > out>, expected=0, futex_word=0x2aa387a6aac)
> > >     at ../sysdeps/unix/sysv/linux/futex-internal.h:88
> > > #1  0x000003ffb1b8d936 in __pthread_cond_wait_common (abstime=0x0,
> > > mutex=0x2aa387a6a58, cond=0x2aa387a6a80)
> > >     at pthread_cond_wait.c:502
> > > #2  0x000003ffb1b8d936 in __pthread_cond_wait
> > > (cond=cond@entry=0x2aa387a6a80, mutex=mutex@entry=0x2aa387a6a58)
> > >     at pthread_cond_wait.c:655
> > > #3  0x000002aa36bf072c in qemu_sem_wait (sem=sem@entry=0x2aa387a6a58)
> > > at ../../util/qemu-thread-posix.c:328
> > > #4  0x000002aa366c369a in postcopy_pause_incoming (mis=<optimized
> > > out>) at ../../migration/savevm.c:2541
>
> Same on the destination side.
>
> > > #5  0x000002aa366c369a in qemu_loadvm_state_main
> > > (f=f@entry=0x2aa38897930, mis=mis@entry=0x2aa387a6820)
> > >     at ../../migration/savevm.c:2615
> > > #6  0x000002aa366c44fa in postcopy_ram_listen_thread
> > > (opaque=opaque@entry=0x0) at ../../migration/savevm.c:1830
> > > #7  0x000002aa36befa3a in qemu_thread_start (args=<optimized out>) at
> > > ../../util/qemu-thread-posix.c:521
> > > #8  0x000003ffb1b87aa8 in start_thread (arg=0x3ff6ccfd910) at
> > > pthread_create.c:463
> > > #9  0x000003ffb1a7a896 in thread_start () at
> > > ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
>
> Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test
> testcase and try to capture the stderr?  With the help of commit a47295014d
> ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
> be able to dump quite some helpful information to further identify the issue.

Here's the result of running just the migration test with
QTEST_LOG=1:
https://people.linaro.org/~peter.maydell/migration.log
It's 300MB because when the test hangs one of the processes
is apparently in a polling state and continues to send status
queries.

My impression is that the test is OK on an unloaded machine but
more likely to fail if the box is doing other things at the
same time. Alternatively it might be a 'parallel make check' bug.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-30 11:48         ` Peter Maydell
@ 2020-10-30 13:53           ` Peter Xu
  2020-11-02 13:55             ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2020-10-30 13:53 UTC (permalink / raw)
  To: Peter Maydell, Dr. David Alan Gilbert
  Cc: Juan Quintela, Dr. David Alan Gilbert, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 2043 bytes --]

On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test
> > testcase and try to capture the stderr?  With the help of commit a47295014d
> > ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
> > be able to dump quite some helpful information to further identify the issue.
> 
> Here's the result of running just the migration test with
> QTEST_LOG=1:
> https://people.linaro.org/~peter.maydell/migration.log
> It's 300MB because when the test hangs one of the processes
> is apparently in a polling state and continues to send status
> queries.
> 
> My impression is that the test is OK on an unloaded machine but
> more likely to fail if the box is doing other things at the
> same time. Alternatively it might be a 'parallel make check' bug.

Thanks for collecting that, Peter.

I'm copy-pasting the important information out here (with some moves and
indents to make things even clearer):

...
{"execute": "migrate-recover", "arguments": {"uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
{"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event": "MIGRATION", "data": {"status": "setup"}}
{"return": {}, "id": "recover-cmd"}
{"execute": "query-migrate"}
...
{"execute": "migrate", "arguments": {"resume": true, "uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
qemu-system-x86_64: ram_save_queue_pages no previous block
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
{"return": {}}
{"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 0}}
...

The problem is probably an misuse on last_rb on destination node.  When looking
at it, I also found a race.  So I guess I should fix both...

Peter, would it be easy to try apply the two patches I attached to see whether
the test hang would be resolved?  Dave, feel free to give early comments too on
the two fixes before I post them on the list.

Thanks!

-- 
Peter Xu

[-- Attachment #2: 0001-migration-Unify-reset-of-last_rb-on-destination-node.patch --]
[-- Type: text/plain, Size: 2851 bytes --]

From b9af559df89ae9c2eed506ebdd811356ea799f63 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 Oct 2020 09:37:44 -0400
Subject: [PATCH 1/2] migration: Unify reset of last_rb on destination node
 when recover

When postcopy recover happens, we need to reset last_rb after each return of
postcopy_pause_fault_thread() because that means we just got the postcopy
migration continued.

Unify this reset to the place right before we want to kick the fault thread
again, when we get the command MIG_CMD_POSTCOPY_RESUME from source.

This is actually more than that - because the main thread on destination will
now be able to call migrate_send_rp_req_pages_pending() too, so the fault
thread is not the only user of last_rb now.  Move the reset earlier will allow
the first call to migrate_send_rp_req_pages_pending() to use the reset value
even if called from the main thread.

(NOTE: this is not a real fix to 0c26781c09 mentioned below, however it is just
 a mark that when picking up 0c26781c09 we'd better have this one too; the real
 fix will come later)

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 2 --
 migration/savevm.c       | 6 ++++++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index d3bb3a744b..d99842eb1b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -903,7 +903,6 @@ static void *postcopy_ram_fault_thread(void *opaque)
              * the channel is rebuilt.
              */
             if (postcopy_pause_fault_thread(mis)) {
-                mis->last_rb = NULL;
                 /* Continue to read the userfaultfd */
             } else {
                 error_report("%s: paused but don't allow to continue",
@@ -985,7 +984,6 @@ retry:
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
                     /* We got reconnected somehow, try to continue */
-                    mis->last_rb = NULL;
                     goto retry;
                 } else {
                     /* This is a unavoidable fault */
diff --git a/migration/savevm.c b/migration/savevm.c
index 21ccba9fb3..e8834991ec 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2061,6 +2061,12 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
         return 0;
     }
 
+    /*
+     * Reset the last_rb before we resend any page req to source again, since
+     * the source should have it reset already.
+     */
+    mis->last_rb = NULL;
+
     /*
      * This means source VM is ready to resume the postcopy migration.
      * It's time to switch state and release the fault thread to
-- 
2.26.2


[-- Attachment #3: 0002-migration-Postpone-the-kick-of-the-fault-thread-afte.patch --]
[-- Type: text/plain, Size: 2176 bytes --]

From 2a28241de49d27d7b0968f1d1b137aea153fc3b4 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 Oct 2020 09:44:32 -0400
Subject: [PATCH 2/2] migration: Postpone the kick of the fault thread after
 recover

The new migrate_send_rp_req_pages_pending() call should greatly improve
destination responsiveness because it will resync faulted address after
postcopy recovery.  However it is also the 1st place to initiate the page
request from the main thread.

One thing is overlooked on that migrate_send_rp_message_req_pages() is not
designed to be thread-safe.  So if we wake the fault thread before syncing all
the faulted pages in the main thread, it means they can race.

Postpone the wake up operation after the sync of faulted addresses.

Fixes: 0c26781c09 ("migration: Sync requested pages after postcopy recovery")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index e8834991ec..5f937a2762 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2069,12 +2069,9 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
 
     /*
      * This means source VM is ready to resume the postcopy migration.
-     * It's time to switch state and release the fault thread to
-     * continue service page faults.
      */
     migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_RECOVER,
                       MIGRATION_STATUS_POSTCOPY_ACTIVE);
-    qemu_sem_post(&mis->postcopy_pause_sem_fault);
 
     trace_loadvm_postcopy_handle_resume();
 
@@ -2095,6 +2092,14 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
      */
     migrate_send_rp_req_pages_pending(mis);
 
+    /*
+     * It's time to switch state and release the fault thread to continue
+     * service page faults.  Note that this should be explicitly after the
+     * above call to migrate_send_rp_req_pages_pending().  In short:
+     * migrate_send_rp_message_req_pages() is not thread safe, yet.
+     */
+    qemu_sem_post(&mis->postcopy_pause_sem_fault);
+
     return 0;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-10-30 13:53           ` Peter Xu
@ 2020-11-02 13:55             ` Philippe Mathieu-Daudé
  2020-11-02 14:19               ` Christian Schoenebeck
  0 siblings, 1 reply; 10+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-11-02 13:55 UTC (permalink / raw)
  To: Peter Xu, Peter Maydell, Dr. David Alan Gilbert,
	Christian Schoenebeck
  Cc: QEMU Developers, Juan Quintela

On 10/30/20 2:53 PM, Peter Xu wrote:
> On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
>>> Peter, is it possible that you enable QTEST_LOG=1 in your future migration-test
>>> testcase and try to capture the stderr?  With the help of commit a47295014d
>>> ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test should
>>> be able to dump quite some helpful information to further identify the issue.
>>
>> Here's the result of running just the migration test with
>> QTEST_LOG=1:
>> https://people.linaro.org/~peter.maydell/migration.log
>> It's 300MB because when the test hangs one of the processes
>> is apparently in a polling state and continues to send status
>> queries.
>>
>> My impression is that the test is OK on an unloaded machine but
>> more likely to fail if the box is doing other things at the
>> same time. Alternatively it might be a 'parallel make check' bug.
> 
> Thanks for collecting that, Peter.
> 
> I'm copy-pasting the important information out here (with some moves and
> indents to make things even clearer):
> 
> ...
> {"execute": "migrate-recover", "arguments": {"uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
> {"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event": "MIGRATION", "data": {"status": "setup"}}
> {"return": {}, "id": "recover-cmd"}
> {"execute": "query-migrate"}
> ...
> {"execute": "migrate", "arguments": {"resume": true, "uri": "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
> qemu-system-x86_64: ram_save_queue_pages no previous block
> qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> {"return": {}}
> {"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 0}}
> ...
> 
> The problem is probably an misuse on last_rb on destination node.  When looking
> at it, I also found a race.  So I guess I should fix both...
> 
> Peter, would it be easy to try apply the two patches I attached to see whether
> the test hang would be resolved?  Dave, feel free to give early comments too on
> the two fixes before I post them on the list.

Per this comment:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
You could add:
Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-11-02 13:55             ` Philippe Mathieu-Daudé
@ 2020-11-02 14:19               ` Christian Schoenebeck
  2020-11-02 15:14                 ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Schoenebeck @ 2020-11-02 14:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Philippe Mathieu-Daudé, Peter Xu, Peter Maydell,
	Dr. David Alan Gilbert, Juan Quintela

On Montag, 2. November 2020 14:55:04 CET Philippe Mathieu-Daudé wrote:
> On 10/30/20 2:53 PM, Peter Xu wrote:
> > On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> >>> Peter, is it possible that you enable QTEST_LOG=1 in your future
> >>> migration-test testcase and try to capture the stderr?  With the help
> >>> of commit a47295014d ("migration-test: Only hide error if !QTEST_LOG",
> >>> 2020-10-26), the test should be able to dump quite some helpful
> >>> information to further identify the issue.>> 
> >> Here's the result of running just the migration test with
> >> QTEST_LOG=1:
> >> https://people.linaro.org/~peter.maydell/migration.log
> >> It's 300MB because when the test hangs one of the processes
> >> is apparently in a polling state and continues to send status
> >> queries.
> >> 
> >> My impression is that the test is OK on an unloaded machine but
> >> more likely to fail if the box is doing other things at the
> >> same time. Alternatively it might be a 'parallel make check' bug.
> > 
> > Thanks for collecting that, Peter.
> > 
> > I'm copy-pasting the important information out here (with some moves and
> > indents to make things even clearer):
> > 
> > ...
> > {"execute": "migrate-recover", "arguments": {"uri":
> > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id":
> > "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds":
> > 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return":
> > {}, "id": "recover-cmd"}
> > {"execute": "query-migrate"}
> > ...
> > {"execute": "migrate", "arguments": {"resume": true, "uri":
> > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64:
> > ram_save_queue_pages no previous block
> > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > {"return": {}}
> > {"execute": "migrate-set-parameters", "arguments":
> > {"max-postcopy-bandwidth": 0}} ...
> > 
> > The problem is probably an misuse on last_rb on destination node.  When
> > looking at it, I also found a race.  So I guess I should fix both...
> > 
> > Peter, would it be easy to try apply the two patches I attached to see
> > whether the test hang would be resolved?  Dave, feel free to give early
> > comments too on the two fixes before I post them on the list.
> 
> Per this comment:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
> You could add:
> Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>

Yes, you can do that.

We've extensively tested with Peter Xu's patches in the last couple days on 
various systems and haven't encountered any further lockup since then.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: recent flakiness (intermittent hangs) of migration-test
  2020-11-02 14:19               ` Christian Schoenebeck
@ 2020-11-02 15:14                 ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2020-11-02 15:14 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Peter Maydell, Juan Quintela, Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert

On Mon, Nov 02, 2020 at 03:19:50PM +0100, Christian Schoenebeck wrote:
> On Montag, 2. November 2020 14:55:04 CET Philippe Mathieu-Daudé wrote:
> > On 10/30/20 2:53 PM, Peter Xu wrote:
> > > On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > >>> Peter, is it possible that you enable QTEST_LOG=1 in your future
> > >>> migration-test testcase and try to capture the stderr?  With the help
> > >>> of commit a47295014d ("migration-test: Only hide error if !QTEST_LOG",
> > >>> 2020-10-26), the test should be able to dump quite some helpful
> > >>> information to further identify the issue.>> 
> > >> Here's the result of running just the migration test with
> > >> QTEST_LOG=1:
> > >> https://people.linaro.org/~peter.maydell/migration.log
> > >> It's 300MB because when the test hangs one of the processes
> > >> is apparently in a polling state and continues to send status
> > >> queries.
> > >> 
> > >> My impression is that the test is OK on an unloaded machine but
> > >> more likely to fail if the box is doing other things at the
> > >> same time. Alternatively it might be a 'parallel make check' bug.
> > > 
> > > Thanks for collecting that, Peter.
> > > 
> > > I'm copy-pasting the important information out here (with some moves and
> > > indents to make things even clearer):
> > > 
> > > ...
> > > {"execute": "migrate-recover", "arguments": {"uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id":
> > > "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds":
> > > 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return":
> > > {}, "id": "recover-cmd"}
> > > {"execute": "query-migrate"}
> > > ...
> > > {"execute": "migrate", "arguments": {"resume": true, "uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64:
> > > ram_save_queue_pages no previous block
> > > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > > {"return": {}}
> > > {"execute": "migrate-set-parameters", "arguments":
> > > {"max-postcopy-bandwidth": 0}} ...
> > > 
> > > The problem is probably an misuse on last_rb on destination node.  When
> > > looking at it, I also found a race.  So I guess I should fix both...
> > > 
> > > Peter, would it be easy to try apply the two patches I attached to see
> > > whether the test hang would be resolved?  Dave, feel free to give early
> > > comments too on the two fixes before I post them on the list.
> > 
> > Per this comment:
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
> > You could add:
> > Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> Yes, you can do that.
> 
> We've extensively tested with Peter Xu's patches in the last couple days on 
> various systems and haven't encountered any further lockup since then.

Thanks, Christian & all.  I'll post them formally soon with your tags.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-02 15:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-10-29 17:20 recent flakiness (intermittent hangs) of migration-test Peter Maydell
2020-10-29 17:41 ` Dr. David Alan Gilbert
2020-10-29 18:55   ` Peter Maydell
2020-10-29 19:34     ` Dr. David Alan Gilbert
2020-10-29 20:28       ` Peter Xu
2020-10-30 11:48         ` Peter Maydell
2020-10-30 13:53           ` Peter Xu
2020-11-02 13:55             ` Philippe Mathieu-Daudé
2020-11-02 14:19               ` Christian Schoenebeck
2020-11-02 15:14                 ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).