intermittent hang, s390x host, bios-tables-test test, TPM

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* intermittent hang, s390x host, bios-tables-test test, TPM
@ 2023-01-06 12:10 Peter Maydell
  2023-01-06 13:53 ` Stefan Berger
  2023-01-06 15:16 ` Stefan Berger
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Maydell @ 2023-01-06 12:10 UTC (permalink / raw)
  To: QEMU Developers
  Cc: Eric Auger, Alex Bennée, Thomas Huth, Stefan Berger,
	Laurent Vivier, Paolo Bonzini

I'm seeing an intermittent hang on the s390 CI runner in the
bios-tables-test test. It looks like we've deadlocked because:

 * the TPM device is waiting for data on its socket that never arrives,
   and it's holding the iothread lock
 * QEMU is therefore not making forward progress;
   in particular it is unable to handle qtest queries/responses
 * the test binary thread 1 is waiting to get a response to its
   qtest command, which is not going to arrive
 * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
   assertion and is trying to kill QEMU via qtest_kill_qemu()
 * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
   forever, because QEMU won't respond to the SIGTERM while it's
   blocked waiting for the TPM device to release the iothread lock
 * because the ctrl-thread is waiting for QEMU to exit, it's never
   going to send the data that would unblock the TPM device emulation

A secondary problem revealed by this is that gitlab CI runner
jobs can be cancelled or time out in the gitlab CI, whilst leaving
processes lying around (and consuming CPU in the case of livelocks)
on the CI host. This means that future CI jobs then time out because
the host's CPU is all being wasted by these stuck old jobs. Why does
this happen and how can we configure the CI runner so timed out
jobs reliably clean themselves up ?


Process tree:
gitlab-+ 2659701  0.0  0.0 233272  5952 ?        Ssl  Jan05   0:02
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/build/tests/qtest/bios-tables-test
--tap -k
gitlab-+ 2661317  0.0  0.3 1783272 61500 ?       Sl   Jan05   0:00  \_
./qemu-system-x86_64 -qtest unix:/tmp/qtest-2659701.sock -qtest-log
/dev/null -chardev socket,path=/tmp/qtest-2659701.qmp,id=char0 -mon
chardev=char0,mode=control -display none -machine q35 -accel kvm
-accel tcg -net none -chardev
socket,id=chr,path=/tmp/qemu-test_acpi_q35_tcg_tis.Y88AY1/sock -tpmdev
emulator,id=dev,chardev=chr -device tpm-tis,tpmdev=dev -drive
id=hd0,if=none,file=tests/acpi-test-disk-bjuUil,format=raw -device
ide-hd,drive=hd0 -accel qtest

Here's the backtrace of the qemu-system-x86_64 process:


Thread 4 (Thread 0x3ff2bfff900 (LWP 2661322)):
#0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff2bff9e48,
flags=flags@entry=1073741824) at
../sysdeps/unix/sysv/linux/recvmsg.c:30
#1  0x000002aa124aab4e in qio_channel_socket_readv (ioc=<optimized
out>, iov=<optimized out>, niov=<optimized out>, fds=0x3ff2bffa030,
nfds=0x3ff2bffa038, errp=0x0) at ../io/channel-socket.c:521
#2  0x000002aa124afca4 in qio_channel_readv_full (ioc=0x2aa142f5280,
iov=iov@entry=0x3ff2bffa040, niov=niov@entry=1,
fds=fds@entry=0x3ff2bffa030, nfds=nfds@entry=0x3ff2bffa038, errp=0x0)
at ../io/channel.c:66
#3  0x000002aa12579098 in tcp_chr_recv (chr=chr@entry=0x2aa141aaa60,
buf=buf@entry=0x3ff2bffa308 "", len=len@entry=4) at
../chardev/char-socket.c:284
#4  0x000002aa1257b75a in tcp_chr_sync_read (chr=0x2aa141aaa60,
buf=0x3ff2bffa308 "", len=<optimized out>) at
../chardev/char-socket.c:535
#5  0x000002aa12575aa0 in qemu_chr_fe_read_all
(be=be@entry=0x2aa1423df00, buf=buf@entry=0x3ff2bffa308 "",
len=len@entry=4) at
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/include/chardev/char.h:231
#6  0x000002aa12249872 in tpm_emulator_ctrlcmd (tpm=0x2aa1423dea0,
cmd=cmd@entry=9, msg=msg@entry=0x3ff2bffa308,
msg_len_in=msg_len_in@entry=0, msg_len_out=msg_len_out@entry=4) at
../backends/tpm/tpm_emulator.c:145
#7  0x000002aa1224a2ca in tpm_emulator_cancel_cmd (tb=<optimized out>)
at ../backends/tpm/tpm_emulator.c:500
#8  0x000002aa121e68c4 in tpm_tis_mmio_write (opaque=0x2aa1529ec20,
addr=24, val=64, size=<optimized out>) at
../hw/tpm/tpm_tis_common.c:663
#9  0x000002aa12402348 in memory_region_write_accessor
(mr=0x2aa1529ec20, addr=24, value=<optimized out>, size=<optimized
out>, shift=<optimized out>, mask=255, attrs=...) at
../softmmu/memory.c:493
#10 0x000002aa123fde26 in access_with_adjusted_size
(addr=addr@entry=24, value=value@entry=0x3ff2bffa678,
size=size@entry=1, access_size_min=<optimized out>,
access_size_max=<optimized out>, access_size_max@entry=0,
access_fn=0x2aa12402298 <memory_region_write_accessor>, mr=<optimized
out>, attrs=...) at ../softmmu/memory.c:555
Python Exception <class 'gdb.error'> PC not saved:

Thread 3 (Thread 0x3ff70aa0900 (LWP 2661321)):
#0  0x000003ff801f1b32 in __GI___poll (fds=0x3ff24003280, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff829d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff829d4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa124d29d6 in iothread_run
(opaque=opaque@entry=0x2aa1434d300) at ../iothread.c:74
#4  0x000002aa1261c36a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:505
#5  0x000003ff80307e66 in start_thread (arg=0x3ff70aa0900) at
pthread_create.c:477
#6  0x000003ff801fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff71423900 (LWP 2661319)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1261d30c in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa12e87220 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:430
#3  0x000002aa12625eea in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa1261c36a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:505
#5  0x000003ff80307e66 in start_thread (arg=0x3ff71423900) at
pthread_create.c:477
#6  0x000003ff801fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff83b7e460 (LWP 2661317)):
#0  __lll_lock_wait (futex=futex@entry=0x2aa12e55a00
<qemu_global_mutex>, private=<optimized out>) at lowlevellock.c:52
#1  0x000003ff8030a6ce in __GI___pthread_mutex_lock
(mutex=mutex@entry=0x2aa12e55a00 <qemu_global_mutex>) at
../nptl/pthread_mutex_lock.c:80
#2  0x000002aa1261c6ce in qemu_mutex_lock_impl (mutex=0x2aa12e55a00
<qemu_global_mutex>, file=0x2aa12809064 "../util/main-loop.c",
line=<optimized out>) at ../util/qemu-thread-posix.c:89
#3  0x000002aa122317a4 in qemu_mutex_lock_iothread_impl
(file=file@entry=0x2aa12809064 "../util/main-loop.c",
line=line@entry=318) at ../softmmu/cpus.c:503
#4  0x000002aa1262eec2 in os_host_main_loop_wait (timeout=<optimized
out>) at ../util/main-loop.c:318
#5  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:606
#6  0x000002aa12237b64 in qemu_main_loop () at ../softmmu/runstate.c:731
#7  0x000002aa1206d7ca in qemu_default_main () at ../softmmu/main.c:37
#8  0x000003ff8012440a in __libc_start_main (main=0x2aa1206aea8
<main>, argc=<optimized out>, argv=0x3fff997a7c8, init=<optimized
out>, fini=<optimized out>, rtld_fini=0x3ff83b90e50 <_dl_fini>,
stack_end=0x3fff997a710) at libc-start.c:308
#9  0x000002aa1206d6f4 in _start () at ../softmmu/main.c:48

And here's the test binary:

Thread 4 (Thread 0x3ff8d2fd900 (LWP 2661320)):
#0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff8d2fc7f8,
flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
#1  0x000002aa095479e6 in qio_channel_socket_readv (ioc=<optimized
out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
errp=0x2aa096105f0 <error_abort>) at ../io/channel-socket.c:521
#2  0x000002aa0954cb3c in qio_channel_readv_full (ioc=0x3ff80004680,
iov=0x3ff8d2fc9e0, niov=1, fds=0x0, nfds=0x0, errp=0x2aa096105f0
<error_abort>) at ../io/channel.c:66
#3  0x000002aa0954cd78 in qio_channel_read
(ioc=ioc@entry=0x3ff80004680, buf=<optimized out>,
buflen=buflen@entry=6, errp=<optimized out>) at ../io/channel.c:306
#4  0x000002aa095224e4 in tpm_emu_tpm_thread (data=0x3fff5ffa160) at
../tests/qtest/tpm-emu.c:48
#5  0x000003ff8f0ffb7c in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#6  0x000003ff8eb07e66 in start_thread (arg=0x3ff8d2fd900) at
pthread_create.c:477
#7  0x000003ff8e9fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
#0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
#1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
waitpid.c:38
#2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
../tests/qtest/libqtest.c:206
#3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
../tests/qtest/libqtest.c:229
#4  0x000003ff8f0c288e in g_hook_list_invoke () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#5  <signal handler called>
#6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
#8  0x000003ff8f0feda8 in g_assertion_message () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#9  0x000003ff8f0fedfe in g_assertion_message_expr () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
../tests/qtest/tpm-emu.c:189
#11 0x000003ff8f0ffb7c in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ff8eb07e66 in start_thread (arg=0x3ff8dafe900) at
pthread_create.c:477
#13 0x000003ff8e9fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff8e2ff900 (LWP 2659702)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa0957d044 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/-LCfcJ2T/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa09610890 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:430
#3  0x000002aa0958541a in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa0957c0a2 in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:505
#5  0x000003ff8eb07e66 in start_thread (arg=0x3ff8e2ff900) at
pthread_create.c:477
#6  0x000003ff8e9fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff8f36db80 (LWP 2659701)):
#0  __libc_recv (fd=<optimized out>, buf=buf@entry=0x3fff5ff96c0,
len=len@entry=1024, flags=flags@entry=0) at
../sysdeps/unix/sysv/linux/recv.c:30
#1  0x000002aa0952998a in recv (__flags=0, __n=1024,
__buf=0x3fff5ff96c0, __fd=<optimized out>) at
/usr/include/s390x-linux-gnu/bits/socket2.h:44
#2  qtest_client_socket_recv_line (s=0x2aa0b42c9b0) at
../tests/qtest/libqtest.c:578
#3  0x000002aa09529b16 in qtest_rsp_args (s=s@entry=0x2aa0b42c9b0,
expected_args=expected_args@entry=2) at ../tests/qtest/libqtest.c:605
#4  0x000002aa0952a3e6 in qtest_read (s=0x2aa0b42c9b0, cmd=<optimized
out>, addr=<optimized out>) at ../tests/qtest/libqtest.c:1044
#5  0x000002aa0952bcac in qtest_readb (s=s@entry=0x2aa0b42c9b0,
addr=addr@entry=31761) at ../tests/qtest/libqtest.c:1054
#6  0x000002aa09521c72 in boot_sector_test (qts=0x2aa0b42c9b0) at
../tests/qtest/boot-sector.c:153
#7  0x000002aa0951e19e in test_acpi_load_tables (data=0x3fff5ffa1a8,
use_uefi=<optimized out>) at ../tests/qtest/bios-tables-test.c:757
#8  0x000002aa0951ea32 in test_acpi_one (params=<optimized out>,
data=0x3fff5ffa1a8) at ../tests/qtest/bios-tables-test.c:812
#9  0x000002aa0952197a in test_acpi_tcg_tpm (tpm_version=<optimized
out>, base=4275306496, tpm_if=0x2aa095b7a26 "tis",
machine=0x2aa095b5a18 "q35") at ../tests/qtest/bios-tables-test.c:1293
#10 0x000003ff8f0fe608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000003ff8f0fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ff8f0fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ff8f0fe392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ff8f0feada in g_test_run_suite () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ff8f0feb10 in g_test_run () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000002aa0951cea8 in main (argc=<optimized out>, argv=<optimized
out>) at ../tests/qtest/bios-tables-test.c:2067


thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 12:10 intermittent hang, s390x host, bios-tables-test test, TPM Peter Maydell
@ 2023-01-06 13:53 ` Stefan Berger
  2023-01-06 14:04   ` Peter Maydell
  2023-01-06 15:16 ` Stefan Berger
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Berger @ 2023-01-06 13:53 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers
  Cc: Eric Auger, Alex Bennée, Thomas Huth, Laurent Vivier,
	Paolo Bonzini



On 1/6/23 07:10, Peter Maydell wrote:
> I'm seeing an intermittent hang on the s390 CI runner in the
Is this a new hang or has this been occurring for a while? I am asking because the test case is not new.

    Stefan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 13:53 ` Stefan Berger
@ 2023-01-06 14:04   ` Peter Maydell
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Maydell @ 2023-01-06 14:04 UTC (permalink / raw)
  To: Stefan Berger
  Cc: QEMU Developers, Eric Auger, Alex Bennée, Thomas Huth,
	Laurent Vivier, Paolo Bonzini

On Fri, 6 Jan 2023 at 13:53, Stefan Berger <stefanb@linux.ibm.com> wrote:
>
>
>
> On 1/6/23 07:10, Peter Maydell wrote:
> > I'm seeing an intermittent hang on the s390 CI runner in the
> Is this a new hang or has this been occurring for a while? I am asking because the test case is not new.

It's intermittent, so no idea.

-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 12:10 intermittent hang, s390x host, bios-tables-test test, TPM Peter Maydell
  2023-01-06 13:53 ` Stefan Berger
@ 2023-01-06 15:16 ` Stefan Berger
  2023-01-06 15:39   ` Peter Maydell
                     ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Stefan Berger @ 2023-01-06 15:16 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers
  Cc: Eric Auger, Alex Bennée, Thomas Huth, Laurent Vivier,
	Paolo Bonzini



On 1/6/23 07:10, Peter Maydell wrote:
> I'm seeing an intermittent hang on the s390 CI runner in the
> bios-tables-test test. It looks like we've deadlocked because:
> 
>   * the TPM device is waiting for data on its socket that never arrives,
>     and it's holding the iothread lock
>   * QEMU is therefore not making forward progress;
>     in particular it is unable to handle qtest queries/responses
>   * the test binary thread 1 is waiting to get a response to its
>     qtest command, which is not going to arrive
>   * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
>     assertion and is trying to kill QEMU via qtest_kill_qemu()
>   * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
>     forever, because QEMU won't respond to the SIGTERM while it's
>     blocked waiting for the TPM device to release the iothread lock
>   * because the ctrl-thread is waiting for QEMU to exit, it's never
>     going to send the data that would unblock the TPM device emulation
> 
[...]

> 
> Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
> #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
> usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
> #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
> waitpid.c:38
> #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
> ../tests/qtest/libqtest.c:206
> #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
> ../tests/qtest/libqtest.c:229
> #4  0x000003ff8f0c288e in g_hook_list_invoke () from
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #5  <signal handler called>
> #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
> #8  0x000003ff8f0feda8 in g_assertion_message () from
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
> ../tests/qtest/tpm-emu.c:189

This here seems to be the root cause. An unknown control channel command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().

https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189



         ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
         if (ret <= 0) {
             break;
         }

         cmd = be32_to_cpu(cmd);
         switch (cmd) {
  [...]
         default:
             g_debug("unimplemented %u", cmd);
             g_assert_not_reached();                <------------------
         }

I will run this test case in an endless loop on an x86_64 host and see what we get there ...

   Stefan


> #11 0x000003ff8f0ffb7c in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #12 0x000003ff8eb07e66 in start_thread (arg=0x3ff8dafe900) at
> pthread_create.c:477
> #13 0x000003ff8e9fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 15:16 ` Stefan Berger
@ 2023-01-06 15:39   ` Peter Maydell
  2023-01-06 15:58     ` Stefan Berger
  2023-01-10 19:25     ` Daniel P. Berrangé
  2023-01-10 18:50   ` Stefan Berger
  2023-01-10 19:44   ` Daniel P. Berrangé
  2 siblings, 2 replies; 15+ messages in thread
From: Peter Maydell @ 2023-01-06 15:39 UTC (permalink / raw)
  To: Stefan Berger
  Cc: QEMU Developers, Eric Auger, Alex Bennée, Thomas Huth,
	Laurent Vivier, Paolo Bonzini

On Fri, 6 Jan 2023 at 15:16, Stefan Berger <stefanb@linux.ibm.com> wrote:
>
>
>
> On 1/6/23 07:10, Peter Maydell wrote:
> > I'm seeing an intermittent hang on the s390 CI runner in the
> > bios-tables-test test. It looks like we've deadlocked because:
> >
> >   * the TPM device is waiting for data on its socket that never arrives,
> >     and it's holding the iothread lock
> >   * QEMU is therefore not making forward progress;
> >     in particular it is unable to handle qtest queries/responses
> >   * the test binary thread 1 is waiting to get a response to its
> >     qtest command, which is not going to arrive
> >   * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
> >     assertion and is trying to kill QEMU via qtest_kill_qemu()
> >   * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
> >     forever, because QEMU won't respond to the SIGTERM while it's
> >     blocked waiting for the TPM device to release the iothread lock
> >   * because the ctrl-thread is waiting for QEMU to exit, it's never
> >     going to send the data that would unblock the TPM device emulation
> >
> [...]
>
> >
> > Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
> > #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
> > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
> > usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
> > #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
> > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
> > waitpid.c:38
> > #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
> > ../tests/qtest/libqtest.c:206
> > #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
> > ../tests/qtest/libqtest.c:229
> > #4  0x000003ff8f0c288e in g_hook_list_invoke () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #5  <signal handler called>
> > #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
> > #8  0x000003ff8f0feda8 in g_assertion_message () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
> > ../tests/qtest/tpm-emu.c:189
>
> This here seems to be the root cause. An unknown control channel command
> was received from the TPM emulator backend by the control channel thread
> and we end up in g_assert_not_reached().

Yeah. It would be good if we didn't deadlock without printing
the assertion, though...

I guess we could improve qtest_kill_qemu() so it doesn't wait
indefinitely for QEMU to exit but instead sends a SIGKILL 20
seconds after the SIGTERM. (Annoyingly, there is no convenient
"waitpid but with a timeout" function...)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 15:39   ` Peter Maydell
@ 2023-01-06 15:58     ` Stefan Berger
  2023-01-10 19:25     ` Daniel P. Berrangé
  1 sibling, 0 replies; 15+ messages in thread
From: Stefan Berger @ 2023-01-06 15:58 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Eric Auger, Alex Bennée, Thomas Huth,
	Laurent Vivier, Paolo Bonzini



On 1/6/23 10:39, Peter Maydell wrote:
> On Fri, 6 Jan 2023 at 15:16, Stefan Berger <stefanb@linux.ibm.com> wrote:
>>
>>
>>
>> On 1/6/23 07:10, Peter Maydell wrote:
>>> I'm seeing an intermittent hang on the s390 CI runner in the
>>> bios-tables-test test. It looks like we've deadlocked because:

>>
>>>
>>> Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
>>> #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
>>> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
>>> usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
>>> #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
>>> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
>>> waitpid.c:38
>>> #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
>>> ../tests/qtest/libqtest.c:206
>>> #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
>>> ../tests/qtest/libqtest.c:229
>>> #4  0x000003ff8f0c288e in g_hook_list_invoke () from
>>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>>> #5  <signal handler called>
>>> #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>>> #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
>>> #8  0x000003ff8f0feda8 in g_assertion_message () from
>>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>>> #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
>>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>>> #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
>>> ../tests/qtest/tpm-emu.c:189
>>
>> This here seems to be the root cause. An unknown control channel command
>> was received from the TPM emulator backend by the control channel thread
>> and we end up in g_assert_not_reached().
> 
> Yeah. It would be good if we didn't deadlock without printing
> the assertion, though...
> 
> I guess we could improve qtest_kill_qemu() so it doesn't wait
> indefinitely for QEMU to exit but instead sends a SIGKILL 20
> seconds after the SIGTERM. (Annoyingly, there is no convenient
> "waitpid but with a timeout" function...)

Yes, wait5(&to,...) doesn't exist, yet. I guess one would have to use a loop sending signal 0 to the pid for 20 seconds?

Though I'd really like to know where that data race is coming from and why we get an unknown command. I am now running this on a ppc64 and x86_64 host over the weekend to see what happens. All good so far.

    Stefan
> 
> thanks
> -- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 15:16 ` Stefan Berger
  2023-01-06 15:39   ` Peter Maydell
@ 2023-01-10 18:50   ` Stefan Berger
  2023-01-10 19:27     ` Daniel P. Berrangé
  2023-01-10 19:44   ` Daniel P. Berrangé
  2 siblings, 1 reply; 15+ messages in thread
From: Stefan Berger @ 2023-01-10 18:50 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers
  Cc: Eric Auger, Alex Bennée, Thomas Huth, Laurent Vivier,
	Paolo Bonzini



On 1/6/23 10:16, Stefan Berger wrote:
> 
> 
> On 1/6/23 07:10, Peter Maydell wrote:
>> I'm seeing an intermittent hang on the s390 CI runner in the
>> bios-tables-test test. It looks like we've deadlocked because:
>>
>>   * the TPM device is waiting for data on its socket that never arrives,
>>     and it's holding the iothread lock
>>   * QEMU is therefore not making forward progress;
>>     in particular it is unable to handle qtest queries/responses
>>   * the test binary thread 1 is waiting to get a response to its
>>     qtest command, which is not going to arrive
>>   * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
>>     assertion and is trying to kill QEMU via qtest_kill_qemu()
>>   * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
>>     forever, because QEMU won't respond to the SIGTERM while it's
>>     blocked waiting for the TPM device to release the iothread lock
>>   * because the ctrl-thread is waiting for QEMU to exit, it's never
>>     going to send the data that would unblock the TPM device emulation
>>
> [...]
> 
>>
>> Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
>> #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
>> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
>> usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
>> #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
>> stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
>> waitpid.c:38
>> #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
>> ../tests/qtest/libqtest.c:206
>> #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
>> ../tests/qtest/libqtest.c:229
>> #4  0x000003ff8f0c288e in g_hook_list_invoke () from
>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>> #5  <signal handler called>
>> #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
>> #8  0x000003ff8f0feda8 in g_assertion_message () from
>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>> #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
>> /lib/s390x-linux-gnu/libglib-2.0.so.0
>> #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
>> ../tests/qtest/tpm-emu.c:189
> 
> This here seems to be the root cause. An unknown control channel command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
> 
> https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
> 
> 
> 
>          ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
>          if (ret <= 0) {
>              break;
>          }
> 
>          cmd = be32_to_cpu(cmd);
>          switch (cmd) {
>   [...]
>          default:
>              g_debug("unimplemented %u", cmd);
>              g_assert_not_reached();                <------------------
>          }
> 
> I will run this test case in an endless loop on an x86_64 host and see what we get there ...

I could not recreate the issue running the  test on a ppc64 and x86_64 host. There we like >100k test runs on ppc64 and >40k on x86_64. Also simulating the reception of an unsupported command did not lead to a hang like shown here.

Further, it's not clear to me how to check the status of a process before calling wait() in a portable way. kill -0 on a <defunct> process, which has exited to to SIGTERM, still returns 0, so this check doesn't work to determine whether the process has exited.

    Stefan


> 
>    Stefan
> 
> 
>> #11 0x000003ff8f0ffb7c in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
>> #12 0x000003ff8eb07e66 in start_thread (arg=0x3ff8dafe900) at
>> pthread_create.c:477
>> #13 0x000003ff8e9fcbe6 in thread_start () at
>> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 15:39   ` Peter Maydell
  2023-01-06 15:58     ` Stefan Berger
@ 2023-01-10 19:25     ` Daniel P. Berrangé
  2023-01-10 22:10       ` Peter Maydell
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2023-01-10 19:25 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Stefan Berger, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Fri, Jan 06, 2023 at 03:39:31PM +0000, Peter Maydell wrote:
> On Fri, 6 Jan 2023 at 15:16, Stefan Berger <stefanb@linux.ibm.com> wrote:
> >
> >
> >
> > On 1/6/23 07:10, Peter Maydell wrote:
> > > I'm seeing an intermittent hang on the s390 CI runner in the
> > > bios-tables-test test. It looks like we've deadlocked because:
> > >
> > >   * the TPM device is waiting for data on its socket that never arrives,
> > >     and it's holding the iothread lock
> > >   * QEMU is therefore not making forward progress;
> > >     in particular it is unable to handle qtest queries/responses
> > >   * the test binary thread 1 is waiting to get a response to its
> > >     qtest command, which is not going to arrive
> > >   * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
> > >     assertion and is trying to kill QEMU via qtest_kill_qemu()
> > >   * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
> > >     forever, because QEMU won't respond to the SIGTERM while it's
> > >     blocked waiting for the TPM device to release the iothread lock
> > >   * because the ctrl-thread is waiting for QEMU to exit, it's never
> > >     going to send the data that would unblock the TPM device emulation
> > >
> > [...]
> >
> > >
> > > Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
> > > #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
> > > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
> > > usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
> > > #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
> > > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
> > > waitpid.c:38
> > > #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
> > > ../tests/qtest/libqtest.c:206
> > > #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
> > > ../tests/qtest/libqtest.c:229
> > > #4  0x000003ff8f0c288e in g_hook_list_invoke () from
> > > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > > #5  <signal handler called>
> > > #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > > #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
> > > #8  0x000003ff8f0feda8 in g_assertion_message () from
> > > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > > #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
> > > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > > #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
> > > ../tests/qtest/tpm-emu.c:189
> >
> > This here seems to be the root cause. An unknown control channel command
> > was received from the TPM emulator backend by the control channel thread
> > and we end up in g_assert_not_reached().
> 
> Yeah. It would be good if we didn't deadlock without printing
> the assertion, though...
> 
> I guess we could improve qtest_kill_qemu() so it doesn't wait
> indefinitely for QEMU to exit but instead sends a SIGKILL 20
> seconds after the SIGTERM. (Annoyingly, there is no convenient
> "waitpid but with a timeout" function...)

We don't need to touch that. Instead the tpm-emu.c file needs to
call  qtest_add_abrt_handler() passing a callback that will invoke
qio_channel_close on its end of the socket. This will cause the
QEMU process to get EOF on the other end of the socket. It then
won't be stuck holding the iothread lock, and will be able to
respond to SIGTERM.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-10 18:50   ` Stefan Berger
@ 2023-01-10 19:27     ` Daniel P. Berrangé
  2023-01-10 19:47       ` Stefan Berger
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2023-01-10 19:27 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Tue, Jan 10, 2023 at 01:50:26PM -0500, Stefan Berger wrote:
> 
> 
> On 1/6/23 10:16, Stefan Berger wrote:
>> This here seems to be the root cause. An unknown control channel
>> command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
> > 
> > https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
> > 
> > 
> > 
> >          ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
> >          if (ret <= 0) {
> >              break;
> >          }
> > 
> >          cmd = be32_to_cpu(cmd);
> >          switch (cmd) {
> >   [...]
> >          default:
> >              g_debug("unimplemented %u", cmd);
> >              g_assert_not_reached();                <------------------
> >          }
> > 
> > I will run this test case in an endless loop on an x86_64 host and see what we get there ...
> 
> I could not recreate the issue running the  test on a ppc64 and x86_64
> host. There we like >100k test runs on ppc64 and >40k on x86_64. Also
> simulating the reception of an unsupported command did not lead to a
> hang like shown here.

Assuming your ppc64 host is running an little endian OS, and
we're only seeing the test failure on s390x, then it points towards
the problem being an endianness issue in the TPM code. Something
missing a byteswap somewhere along the way ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-06 15:16 ` Stefan Berger
  2023-01-06 15:39   ` Peter Maydell
  2023-01-10 18:50   ` Stefan Berger
@ 2023-01-10 19:44   ` Daniel P. Berrangé
  2 siblings, 0 replies; 15+ messages in thread
From: Daniel P. Berrangé @ 2023-01-10 19:44 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Fri, Jan 06, 2023 at 10:16:36AM -0500, Stefan Berger wrote:
> 
> 
> On 1/6/23 07:10, Peter Maydell wrote:
> > I'm seeing an intermittent hang on the s390 CI runner in the
> > bios-tables-test test. It looks like we've deadlocked because:
> > 
> >   * the TPM device is waiting for data on its socket that never arrives,
> >     and it's holding the iothread lock
> >   * QEMU is therefore not making forward progress;
> >     in particular it is unable to handle qtest queries/responses
> >   * the test binary thread 1 is waiting to get a response to its
> >     qtest command, which is not going to arrive
> >   * test binary thread 3 (tpm_emu_ctrl_thread) is has hit an
> >     assertion and is trying to kill QEMU via qtest_kill_qemu()
> >   * qtest_kill_qemu() is only a "SIGTERM and wait", so will wait
> >     forever, because QEMU won't respond to the SIGTERM while it's
> >     blocked waiting for the TPM device to release the iothread lock
> >   * because the ctrl-thread is waiting for QEMU to exit, it's never
> >     going to send the data that would unblock the TPM device emulation
> > 
> [...]
> 
> > 
> > Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)):
> > #0  0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>,
> > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>,
> > usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
> > #1  0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>,
> > stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at
> > waitpid.c:38
> > #2  0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at
> > ../tests/qtest/libqtest.c:206
> > #3  0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at
> > ../tests/qtest/libqtest.c:229
> > #4  0x000003ff8f0c288e in g_hook_list_invoke () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #5  <signal handler called>
> > #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #7  0x000003ff8e9240a2 in __GI_abort () at abort.c:79
> > #8  0x000003ff8f0feda8 in g_assertion_message () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #9  0x000003ff8f0fedfe in g_assertion_message_expr () from
> > /lib/s390x-linux-gnu/libglib-2.0.so.0
> > #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at
> > ../tests/qtest/tpm-emu.c:189
> 
> This here seems to be the root cause. An unknown control channel command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
> 
> https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
> 
> 
> 
>         ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
>         if (ret <= 0) {
>             break;
>         }
> 
>         cmd = be32_to_cpu(cmd);
>         switch (cmd) {
>  [...]
>         default:
>             g_debug("unimplemented %u", cmd);
>             g_assert_not_reached();                <------------------
>         }
> 
> I will run this test case in an endless loop on an x86_64 host and see what we get there ...

The QEMU stack trace shows:

#7  0x000002aa1224a2ca in tpm_emulator_cancel_cmd (tb=<optimized out>)
at ../backends/tpm/tpm_emulator.c:500
#8  0x000002aa121e68c4 in tpm_tis_mmio_write (opaque=0x2aa1529ec20,
addr=24, val=64, size=<optimized out>) at
../hw/tpm/tpm_tis_common.c:663


IOW, we're getting CMD_CANCEL_TPM_CMD, which is indeed not handled
by any 'case:' in the switch in qtest/tpm-emu.c


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-10 19:27     ` Daniel P. Berrangé
@ 2023-01-10 19:47       ` Stefan Berger
  2023-01-10 22:02         ` Stefan Berger
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Berger @ 2023-01-10 19:47 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini



On 1/10/23 14:27, Daniel P. Berrangé wrote:
> On Tue, Jan 10, 2023 at 01:50:26PM -0500, Stefan Berger wrote:
>>
>>
>> On 1/6/23 10:16, Stefan Berger wrote:
>>> This here seems to be the root cause. An unknown control channel
>>> command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
>>>
>>> https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
>>>
>>>
>>>
>>>           ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
>>>           if (ret <= 0) {
>>>               break;
>>>           }
>>>
>>>           cmd = be32_to_cpu(cmd);
>>>           switch (cmd) {
>>>    [...]
>>>           default:
>>>               g_debug("unimplemented %u", cmd);
>>>               g_assert_not_reached();                <------------------
>>>           }
>>>
>>> I will run this test case in an endless loop on an x86_64 host and see what we get there ...
>>
>> I could not recreate the issue running the  test on a ppc64 and x86_64
>> host. There we like >100k test runs on ppc64 and >40k on x86_64. Also
>> simulating the reception of an unsupported command did not lead to a
>> hang like shown here.
> 
> Assuming your ppc64 host is running an little endian OS, and
> we're only seeing the test failure on s390x, then it points towards
> the problem being an endianness issue in the TPM code. Something
> missing a byteswap somewhere along the way ?

Yes, my ppc64 machine is also little endian. If the issue  was not an intermittent but a permanent
failure I would look for something like that. I would think it's more some sort of initialization
issue, like a value on the stack that occasionally set to an undesirable value -- maybe even in a
dependency.

    Stefan

> 
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-10 19:47       ` Stefan Berger
@ 2023-01-10 22:02         ` Stefan Berger
  2023-01-11  9:05           ` Daniel P. Berrangé
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Berger @ 2023-01-10 22:02 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini



On 1/10/23 14:47, Stefan Berger wrote:
> 
> 
> On 1/10/23 14:27, Daniel P. Berrangé wrote:
>> On Tue, Jan 10, 2023 at 01:50:26PM -0500, Stefan Berger wrote:
>>>
>>>
>>> On 1/6/23 10:16, Stefan Berger wrote:
>>>> This here seems to be the root cause. An unknown control channel
>>>> command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
>>>>
>>>> https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
>>>>
>>>>
>>>>
>>>>           ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
>>>>           if (ret <= 0) {
>>>>               break;
>>>>           }
>>>>
>>>>           cmd = be32_to_cpu(cmd);
>>>>           switch (cmd) {
>>>>    [...]
>>>>           default:
>>>>               g_debug("unimplemented %u", cmd);
>>>>               g_assert_not_reached();                <------------------
>>>>           }
>>>>
>>>> I will run this test case in an endless loop on an x86_64 host and see what we get there ...
>>>
>>> I could not recreate the issue running the  test on a ppc64 and x86_64
>>> host. There we like >100k test runs on ppc64 and >40k on x86_64. Also
>>> simulating the reception of an unsupported command did not lead to a
>>> hang like shown here.
>>
>> Assuming your ppc64 host is running an little endian OS, and
>> we're only seeing the test failure on s390x, then it points towards
>> the problem being an endianness issue in the TPM code. Something
>> missing a byteswap somewhere along the way ?
> 
> Yes, my ppc64 machine is also little endian. If the issue  was not an intermittent but a permanent
> failure I would look for something like that. I would think it's more some sort of initialization
> issue, like a value on the stack that occasionally set to an undesirable value -- maybe even in a
> dependency.

I found I still had access to an s390x machine. ~2700 loops on this test case
so far but nothing... it would be good to be able to recreate the issue and
apply the fix but we'll have to do it without testing then I guess.

Does this look about right? From my tests with injecting an error it at least
seems to do what it is intended to do.

diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
index 2994d1cf42..dbc308a572 100644
--- a/tests/qtest/tpm-emu.c
+++ b/tests/qtest/tpm-emu.c
@@ -36,11 +36,19 @@ void tpm_emu_test_wait_cond(TPMTestState *s)
      g_mutex_unlock(&s->data_mutex);
  }

+static void tpm_emu_close_data_ioc(void *ioc)
+{
+    g_debug("CLOSE DATA IOC");
+    qio_channel_close(ioc, NULL);
+}
+
  static void *tpm_emu_tpm_thread(void *data)
  {
      TPMTestState *s = data;
      QIOChannel *ioc = s->tpm_ioc;

+    qtest_add_abrt_handler(tpm_emu_close_data_ioc, ioc);
+
      s->tpm_msg = g_new(struct tpm_hdr, 1);
      while (true) {
          int minhlen = sizeof(s->tpm_msg->tag) + sizeof(s->tpm_msg->len);
@@ -77,12 +85,19 @@ static void *tpm_emu_tpm_thread(void *data)
                            &error_abort);
      }

+    qtest_remove_abrt_handler(ioc);
      g_free(s->tpm_msg);
      s->tpm_msg = NULL;
      object_unref(OBJECT(s->tpm_ioc));
      return NULL;
  }

+static void tpm_emu_close_ctrl_ioc(void *ioc)
+{
+    g_debug("CLOSE CTRL IOC");
+    qio_channel_close(ioc, NULL);
+}
+
  void *tpm_emu_ctrl_thread(void *data)
  {
      TPMTestState *s = data;
@@ -119,6 +134,8 @@ void *tpm_emu_ctrl_thread(void *data)
          s->emu_tpm_thread = g_thread_new(NULL, tpm_emu_tpm_thread, s);
      }

+    qtest_add_abrt_handler(tpm_emu_close_ctrl_ioc, ioc);
+
      while (true) {
          uint32_t cmd;
          ssize_t ret;
@@ -129,6 +146,9 @@ void *tpm_emu_ctrl_thread(void *data)
          }

          cmd = be32_to_cpu(cmd);
+        //g_debug("cmd=%u", cmd);
+        //if (cmd == 14)
+        //    cmd = 100;
          switch (cmd) {
          case CMD_GET_CAPABILITY: {
              ptm_cap cap = cpu_to_be64(0x3fff);
@@ -190,6 +210,8 @@ void *tpm_emu_ctrl_thread(void *data)
          }
      }

+    qtest_remove_abrt_handler(ioc);
+
      object_unref(OBJECT(ioc));
      object_unref(OBJECT(lioc));
      return NULL;

> 
>     Stefan
> 
>>
>>
>> With regards,
>> Daniel
> 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-10 19:25     ` Daniel P. Berrangé
@ 2023-01-10 22:10       ` Peter Maydell
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Maydell @ 2023-01-10 22:10 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Stefan Berger, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Tue, 10 Jan 2023 at 19:25, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Jan 06, 2023 at 03:39:31PM +0000, Peter Maydell wrote:
> > Yeah. It would be good if we didn't deadlock without printing
> > the assertion, though...
> >
> > I guess we could improve qtest_kill_qemu() so it doesn't wait
> > indefinitely for QEMU to exit but instead sends a SIGKILL 20
> > seconds after the SIGTERM. (Annoyingly, there is no convenient
> > "waitpid but with a timeout" function...)
>
> We don't need to touch that. Instead the tpm-emu.c file needs to
> call  qtest_add_abrt_handler() passing a callback that will invoke
> qio_channel_close on its end of the socket. This will cause the
> QEMU process to get EOF on the other end of the socket. It then
> won't be stuck holding the iothread lock, and will be able to
> respond to SIGTERM.

That sounds straightforward and will fix this specific case
of "the QEMU process didn't exit on SIGTERM", but it would
be nice more generally if the test harness did not sit there
forever without printing the assertion in this situation.
"QEMU got permanently stuck" is something that can happen
in more than on way, after all...

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-10 22:02         ` Stefan Berger
@ 2023-01-11  9:05           ` Daniel P. Berrangé
  2023-01-11 13:00             ` Stefan Berger
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel P. Berrangé @ 2023-01-11  9:05 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Tue, Jan 10, 2023 at 05:02:58PM -0500, Stefan Berger wrote:
> 
> 
> On 1/10/23 14:47, Stefan Berger wrote:
> > 
> > 
> > On 1/10/23 14:27, Daniel P. Berrangé wrote:
> > > On Tue, Jan 10, 2023 at 01:50:26PM -0500, Stefan Berger wrote:
> > > > 
> > > > 
> > > > On 1/6/23 10:16, Stefan Berger wrote:
> > > > > This here seems to be the root cause. An unknown control channel
> > > > > command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().
> > > > > 
> > > > > https://github.com/qemu/qemu/blob/master/tests/qtest/tpm-emu.c#L189
> > > > > 
> > > > > 
> > > > > 
> > > > >           ret = qio_channel_read(ioc, (char *)&cmd, sizeof(cmd), NULL);
> > > > >           if (ret <= 0) {
> > > > >               break;
> > > > >           }
> > > > > 
> > > > >           cmd = be32_to_cpu(cmd);
> > > > >           switch (cmd) {
> > > > >    [...]
> > > > >           default:
> > > > >               g_debug("unimplemented %u", cmd);
> > > > >               g_assert_not_reached();                <------------------
> > > > >           }
> > > > > 
> > > > > I will run this test case in an endless loop on an x86_64 host and see what we get there ...
> > > > 
> > > > I could not recreate the issue running the  test on a ppc64 and x86_64
> > > > host. There we like >100k test runs on ppc64 and >40k on x86_64. Also
> > > > simulating the reception of an unsupported command did not lead to a
> > > > hang like shown here.
> > > 
> > > Assuming your ppc64 host is running an little endian OS, and
> > > we're only seeing the test failure on s390x, then it points towards
> > > the problem being an endianness issue in the TPM code. Something
> > > missing a byteswap somewhere along the way ?
> > 
> > Yes, my ppc64 machine is also little endian. If the issue  was not an intermittent but a permanent
> > failure I would look for something like that. I would think it's more some sort of initialization
> > issue, like a value on the stack that occasionally set to an undesirable value -- maybe even in a
> > dependency.
> 
> I found I still had access to an s390x machine. ~2700 loops on this test case
> so far but nothing... it would be good to be able to recreate the issue and
> apply the fix but we'll have to do it without testing then I guess.
> 
> Does this look about right? From my tests with injecting an error it at least
> seems to do what it is intended to do.
> 
> diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
> index 2994d1cf42..dbc308a572 100644
> --- a/tests/qtest/tpm-emu.c
> +++ b/tests/qtest/tpm-emu.c
> @@ -36,11 +36,19 @@ void tpm_emu_test_wait_cond(TPMTestState *s)
>      g_mutex_unlock(&s->data_mutex);
>  }
> 
> +static void tpm_emu_close_data_ioc(void *ioc)
> +{
> +    g_debug("CLOSE DATA IOC");
> +    qio_channel_close(ioc, NULL);
> +}
> +
>  static void *tpm_emu_tpm_thread(void *data)
>  {
>      TPMTestState *s = data;
>      QIOChannel *ioc = s->tpm_ioc;
> 
> +    qtest_add_abrt_handler(tpm_emu_close_data_ioc, ioc);
> +
>      s->tpm_msg = g_new(struct tpm_hdr, 1);
>      while (true) {
>          int minhlen = sizeof(s->tpm_msg->tag) + sizeof(s->tpm_msg->len);
> @@ -77,12 +85,19 @@ static void *tpm_emu_tpm_thread(void *data)
>                            &error_abort);
>      }
> 
> +    qtest_remove_abrt_handler(ioc);
>      g_free(s->tpm_msg);
>      s->tpm_msg = NULL;
>      object_unref(OBJECT(s->tpm_ioc));
>      return NULL;
>  }
> 
> +static void tpm_emu_close_ctrl_ioc(void *ioc)
> +{
> +    g_debug("CLOSE CTRL IOC");
> +    qio_channel_close(ioc, NULL);
> +}
> +
>  void *tpm_emu_ctrl_thread(void *data)
>  {
>      TPMTestState *s = data;
> @@ -119,6 +134,8 @@ void *tpm_emu_ctrl_thread(void *data)
>          s->emu_tpm_thread = g_thread_new(NULL, tpm_emu_tpm_thread, s);
>      }
> 
> +    qtest_add_abrt_handler(tpm_emu_close_ctrl_ioc, ioc);

I'd suggest this be done before starting tpm_emu_tpm_thread,
immediately after the "ioc" is created. 

> +
>      while (true) {
>          uint32_t cmd;
>          ssize_t ret;
> @@ -129,6 +146,9 @@ void *tpm_emu_ctrl_thread(void *data)
>          }
> 
>          cmd = be32_to_cpu(cmd);
> +        //g_debug("cmd=%u", cmd);
> +        //if (cmd == 14)
> +        //    cmd = 100;
>          switch (cmd) {
>          case CMD_GET_CAPABILITY: {
>              ptm_cap cap = cpu_to_be64(0x3fff);
> @@ -190,6 +210,8 @@ void *tpm_emu_ctrl_thread(void *data)
>          }
>      }
> 
> +    qtest_remove_abrt_handler(ioc);
> +
>      object_unref(OBJECT(ioc));
>      object_unref(OBJECT(lioc));
>      return NULL;
> 
> > 
> >     Stefan
> > 
> > > 
> > > 
> > > With regards,
> > > Daniel
> > 
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: intermittent hang, s390x host, bios-tables-test test, TPM
  2023-01-11  9:05           ` Daniel P. Berrangé
@ 2023-01-11 13:00             ` Stefan Berger
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Berger @ 2023-01-11 13:00 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Maydell, QEMU Developers, Eric Auger, Alex Bennée,
	Thomas Huth, Laurent Vivier, Paolo Bonzini



On 1/11/23 04:05, Daniel P. Berrangé wrote:
> On Tue, Jan 10, 2023 at 05:02:58PM -0500, Stefan Berger wrote:
>>
>>

>>>
>>> Yes, my ppc64 machine is also little endian. If the issue  was not an intermittent but a permanent
>>> failure I would look for something like that. I would think it's more some sort of initialization
>>> issue, like a value on the stack that occasionally set to an undesirable value -- maybe even in a
>>> dependency.
>>
>> I found I still had access to an s390x machine. ~2700 loops on this test case
>> so far but nothing... it would be good to be able to recreate the issue and
>> apply the fix but we'll have to do it without testing then I guess.

I just stopped the tests after looping for ~37000 times. No hang here.

    Stefan


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-01-11 13:01 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-06 12:10 intermittent hang, s390x host, bios-tables-test test, TPM Peter Maydell
2023-01-06 13:53 ` Stefan Berger
2023-01-06 14:04   ` Peter Maydell
2023-01-06 15:16 ` Stefan Berger
2023-01-06 15:39   ` Peter Maydell
2023-01-06 15:58     ` Stefan Berger
2023-01-10 19:25     ` Daniel P. Berrangé
2023-01-10 22:10       ` Peter Maydell
2023-01-10 18:50   ` Stefan Berger
2023-01-10 19:27     ` Daniel P. Berrangé
2023-01-10 19:47       ` Stefan Berger
2023-01-10 22:02         ` Stefan Berger
2023-01-11  9:05           ` Daniel P. Berrangé
2023-01-11 13:00             ` Stefan Berger
2023-01-10 19:44   ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).