From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([209.51.188.92]:59588) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gi2C5-000240-7f for qemu-devel@nongnu.org; Fri, 11 Jan 2019 14:10:14 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gi2C3-000452-6S for qemu-devel@nongnu.org; Fri, 11 Jan 2019 14:10:12 -0500 Received: from mail-wr1-x430.google.com ([2a00:1450:4864:20::430]:37460) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gi2C2-000447-5a for qemu-devel@nongnu.org; Fri, 11 Jan 2019 14:10:11 -0500 Received: by mail-wr1-x430.google.com with SMTP id s12so16359344wrt.4 for ; Fri, 11 Jan 2019 11:10:09 -0800 (PST) From: Alex =?utf-8?Q?Benn=C3=A9e?= Date: Fri, 11 Jan 2019 19:10:07 +0000 Message-ID: <87lg3rui28.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] Remaining CI failures List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "qemu-devel@nongnu.org" Cc: Paolo Bonzini , Philippe =?utf-8?Q?Mathieu-Daud?= =?utf-8?Q?=C3=A9?= , "cota@braap.org" , Peter Maydell , =?utf-8?Q?Daniel_P=2E_Berrang?= =?utf-8?Q?=C3=A9?= Hi, So trying to narrow down the remaining failures in the CI system. There is one with a patch in flight (use g_usleep instead of sleep) but there remains two failure modes, both erratic. tests/qht-par: I can trigger this on my dev machine with a gprof enabled build: # QEMU configure log Fri Jan 11 14:10:45 GMT 2019 # Configured with: './configure' '--disable-tools' '--disable-docs' '--en= able-gprof' '--enable-gcov' I only seem to be able to trigger it when running via the wrapper in the make system: retry.py -n 30 --invert make check-tests/test-qht-par Eventually this crashes with: ERROR:tests/test-qht-par.c:20:test_qht: assertion failed (rc =3D=3D 0): (= 35584 =3D=3D 0) Leaving a core dump for the child: Core was generated by `tests/qht-bench -R -S0.1 -D10000 -N1 -n 2 -u 20 -d= 1' (gdb) info thread Id Target Id Frame * 1 Thread 0x7ffff6a7e700 (LWP 15473) 0x000055555557c306 in call_rcu_t= hread (opaque=3D0x0) at util/rcu.c:255 2 Thread 0x7ffff7fbe780 (LWP 15472) 0x00005555555b8d50 in gcov_read_= words () (gdb) bt #0 __mcount_internal (frompc=3D, selfpc=3D93824992383630)= at mcount.c:98 #1 0x00007ffff6e15e24 in mcount () at ../sysdeps/x86_64/_mcount.S:51 #2 0x000055555557928e in qemu_event_reset (ev=3D0x3cc692b8d452f400) at u= til/qemu-thread-posix.c:408 #3 0x000055555557c306 in call_rcu_thread (opaque=3D0x0) at util/rcu.c:255 #4 0x0000555555579630 in qemu_thread_start (args=3D0x555555808080) at ut= il/qemu-thread-posix.c:502 #5 0x00007ffff70e96db in start_thread (arg=3D0x7ffff6a7e700) at pthread_= create.c:463 #6 0x00007ffff6e1288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/c= lone.S:95 (gdb) thread 2 [Switching to thread 2 (Thread 0x7ffff7fbe780 (LWP 15472))] #0 0x00005555555b8d50 in gcov_read_words () (gdb) bt #0 0x00005555555b8d50 in gcov_read_words () #1 0x00005555555b9453 in __gcov_read_summary () #2 0x00005555555ba461 in gcov_do_dump () #3 0x00005555555bab62 in __gcov_exit () #4 0x00005555555b8c22 in _GLOBAL__sub_D_00100_1_json_lexer_init () at qo= bject/json-lexer.c:365 #5 0x00007ffff7de5b73 in _dl_fini () at dl-fini.c:138 #6 0x00007ffff6d34041 in __run_exit_handlers (status=3D0, listp=3D0x7fff= f70dc718 <__exit_funcs>, run_list_atexit=3Drun_list_atexit@entry=3Dtrue, ru= n_dtors=3Drun_dtors@entry=3Dtrue) at exit.c:108 #7 0x00007ffff6d3413a in __GI_exit (status=3D) at exit.c:= 139 #8 0x00007ffff6d12b9e in __libc_start_main (main=3D0x555555575d50
= , argc=3D11, argv=3D0x7fffffffdf78, init=3D, fini=3D, rtld_fini=3D, stack_end=3D0x7fffffffdf68) at ../csu= /libc-start.c:344 #9 0x0000555555573c1a in _start () To trigger the second failure I had to run on a limited Xenial machine (16.04, 2 cores, 8Gb RAM) again with gprof build: # QEMU configure log Thu 10 Jan 22:22:52 GMT 2019 # Configured with: './configure' '--enable-gprof' '--enable-gcov' '--disa= ble-pie' '--target-list=3Daarch64-softmmu,arm-softmmu,i386-softmmu,mips-sof= tmmu,mips64-softmmu,ppc64-softmmu,riscv64-softmmu,s390x-softmmu,x86_64-soft= mmu' Running it like Travis does: retry.py -n 40 --invert -- make -j 3 check V=3D1 It eventually fails with: PASS: tests/test-hmp make: write error: stdout It's hard to tell from the output what was running that failed. So far I've managed to get the following information out of execsnoop: qemu-system-x86 1345 1332 0 x86_64-softmmu/qemu-system-x86_64 -qte= st unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -chardev socket,pa= th=3D/tmp/qtest-1332.qmp,no wait,id=3Dchar0 -mon chardev=3Dchar0,mode=3Dcontrol -machine accel=3Dqtes= t -display none -S -M pc-i440fx-4.0 sh 1350 1332 0 /bin/sh -c exec x86_64-softmmu/qemu-sy= stem-x86_64 -qtest unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -c= hardev socket,path=3D/tmp/q t qemu-system-x86 1350 1332 0 x86_64-softmmu/qemu-system-x86_64 -qte= st unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -chardev socket,pa= th=3D/tmp/qtest-1332.qmp,no wait,id=3Dchar0 -mon chardev=3Dchar0,mode=3Dcontrol -machine accel=3Dqtes= t -display none -S -M pc-q35-3.1 sh 1356 1332 0 /bin/sh -c exec x86_64-softmmu/qemu-sy= stem-x86_64 -qtest unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -c= hardev socket,path=3D/tmp/q t qemu-system-x86 1356 1332 0 x86_64-softmmu/qemu-system-x86_64 -qte= st unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -chardev socket,pa= th=3D/tmp/qtest-1332.qmp,no wait,id=3Dchar0 -mon chardev=3Dchar0,mode=3Dcontrol -machine accel=3Dqtes= t -display none -S -M pc-i440fx-3.1 sh 1361 1332 0 /bin/sh -c exec x86_64-softmmu/qemu-sy= stem-x86_64 -qtest unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -c= hardev socket,path=3D/tmp/q t qemu-system-x86 1361 1332 0 x86_64-softmmu/qemu-system-x86_64 -qte= st unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -chardev socket,pa= th=3D/tmp/qtest-1332.qmp,no wait,id=3Dchar0 -mon chardev=3Dchar0,mode=3Dcontrol -machine accel=3Dqtes= t -display none -S -M pc-q35-4.0 sh 1366 1332 0 /bin/sh -c exec x86_64-softmmu/qemu-sy= stem-x86_64 -qtest unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -c= hardev socket,path=3D/tmp/q t qemu-system-x86 1366 1332 0 x86_64-softmmu/qemu-system-x86_64 -qte= st unix:/tmp/qtest-1332.sock,nowait -qtest-log /dev/null -chardev socket,pa= th=3D/tmp/qtest-1332.qmp,no wait,id=3Dchar0 -mon chardev=3Dchar0,mode=3Dcontrol -machine accel=3Dqtes= t -display none -S -M none -m 2 tset 1370 2129 0 /usr/bin/tset -c git 1374 1373 0 /usr/bin/git rev-parse --git-dir git 1376 1375 0 /usr/bin/git symbolic-ref HEAD git 1378 1377 0 /usr/bin/git config branch.testing/nex= t.remote So even though tests/test-hmp has nominally passed I think it has to be one of the x86_64 tests unless there is something that was triggered a lot longer ago finally crashing out somehow. I did run the make under strace to see who is using O_NONBLOCK but even after filtering out a bunch of stuff it seems to be quite embedded: ag "O_NONBLOCK" check.strace | ag -v "/sys" | ag -v "/dev" | grep -v ".gi= t" | wc -l 2449 Anyway I thought it would be worth dumping my current debug state to the list in case anyone has any bright ideas about whats going on or fancies some weekend debugging. The retry.py is just a hacky script I use, I'm sure everyone else has something similar. See https://github.com/stsquad/retry -- Alex Benn=C3=A9e