* Problem with iotest 233
@ 2025-02-25  7:20 Thomas Huth
  2025-02-25 10:12 ` Kevin Wolf
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Huth @ 2025-02-25  7:20 UTC (permalink / raw)
  To: Kevin Wolf, Hanna Reitz, Daniel P. Berrange, Eric Blake,
	Qemu-block
  Cc: QEMU Developers
  Hi!
I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running
  ./check -raw 233
the test simply hangs. Looking at the log, the last message is "== check 
plain client to TLS server fails ==". I added some debug messages, and it 
seems like the previous NBD server is not correctly terminated here.
The test works fine again if I apply this patch:
diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd
--- a/tests/qemu-iotests/common.nbd
+++ b/tests/qemu-iotests/common.nbd
@@ -35,7 +35,7 @@ nbd_server_stop()
          read NBD_PID < "$nbd_pid_file"
          rm -f "$nbd_pid_file"
          if [ -n "$NBD_PID" ]; then
-            kill "$NBD_PID"
+            kill -9 "$NBD_PID"
          fi
      fi
      rm -f "$nbd_unix_socket" "$nbd_stderr_fifo"
... but that does not look like the right solution to me. What could prevent 
the qemu-nbd from correctly shutting down when it receives a normal SIGTERM 
signal?
  Thomas
^ permalink raw reply	[flat|nested] 10+ messages in thread* Re: Problem with iotest 233 2025-02-25 7:20 Problem with iotest 233 Thomas Huth @ 2025-02-25 10:12 ` Kevin Wolf 2025-02-25 17:44 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Kevin Wolf @ 2025-02-25 10:12 UTC (permalink / raw) To: Thomas Huth Cc: Hanna Reitz, Daniel P. Berrange, Eric Blake, Qemu-block, QEMU Developers Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: > > Hi! > > I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running > > ./check -raw 233 > > the test simply hangs. Looking at the log, the last message is "== check > plain client to TLS server fails ==". I added some debug messages, and it > seems like the previous NBD server is not correctly terminated here. > The test works fine again if I apply this patch: > > diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd > --- a/tests/qemu-iotests/common.nbd > +++ b/tests/qemu-iotests/common.nbd > @@ -35,7 +35,7 @@ nbd_server_stop() > read NBD_PID < "$nbd_pid_file" > rm -f "$nbd_pid_file" > if [ -n "$NBD_PID" ]; then > - kill "$NBD_PID" > + kill -9 "$NBD_PID" > fi > fi > rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" > > ... but that does not look like the right solution to me. What could prevent > the qemu-nbd from correctly shutting down when it receives a normal SIGTERM > signal? Not sure. In theory, qemu_system_killed() should set state = TERMINATE and make main_loop_wait() return through the notification, which should then make it shut down. Maybe you can attach gdb and check what 'state' is when it hangs and if it's still in the main loop? I can't reproduce the problem, though I'm on F40. I tried it both on my working branch and with current git master (b69801dd). Kevin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 10:12 ` Kevin Wolf @ 2025-02-25 17:44 ` Thomas Huth 2025-02-25 17:52 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-25 17:44 UTC (permalink / raw) To: Kevin Wolf Cc: Hanna Reitz, Daniel P. Berrange, Eric Blake, Qemu-block, QEMU Developers On 25/02/2025 11.12, Kevin Wolf wrote: > Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >> >> Hi! >> >> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running >> >> ./check -raw 233 >> >> the test simply hangs. Looking at the log, the last message is "== check >> plain client to TLS server fails ==". I added some debug messages, and it >> seems like the previous NBD server is not correctly terminated here. >> The test works fine again if I apply this patch: >> >> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd >> --- a/tests/qemu-iotests/common.nbd >> +++ b/tests/qemu-iotests/common.nbd >> @@ -35,7 +35,7 @@ nbd_server_stop() >> read NBD_PID < "$nbd_pid_file" >> rm -f "$nbd_pid_file" >> if [ -n "$NBD_PID" ]; then >> - kill "$NBD_PID" >> + kill -9 "$NBD_PID" >> fi >> fi >> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >> >> ... but that does not look like the right solution to me. What could prevent >> the qemu-nbd from correctly shutting down when it receives a normal SIGTERM >> signal? > > Not sure. In theory, qemu_system_killed() should set state = TERMINATE > and make main_loop_wait() return through the notification, which should > then make it shut down. Maybe you can attach gdb and check what 'state' > is when it hangs and if it's still in the main loop? I attached a gdb and ran "bt", and it looks like it is hanging in an exit() handler: (gdb) bt #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 #2 0x00005583df3048b2 in flush_trace_file (wait=true) at ../../devel/qemu/trace/simple.c:140 #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized out>) at ../../devel/qemu/qemu-nbd.c:1242 Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 17:44 ` Thomas Huth @ 2025-02-25 17:52 ` Thomas Huth 2025-02-25 17:57 ` Daniel P. Berrangé 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-25 17:52 UTC (permalink / raw) To: Kevin Wolf Cc: Hanna Reitz, Daniel P. Berrange, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On 25/02/2025 18.44, Thomas Huth wrote: > On 25/02/2025 11.12, Kevin Wolf wrote: >> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >>> >>> Hi! >>> >>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running >>> >>> ./check -raw 233 >>> >>> the test simply hangs. Looking at the log, the last message is "== check >>> plain client to TLS server fails ==". I added some debug messages, and it >>> seems like the previous NBD server is not correctly terminated here. >>> The test works fine again if I apply this patch: >>> >>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd >>> --- a/tests/qemu-iotests/common.nbd >>> +++ b/tests/qemu-iotests/common.nbd >>> @@ -35,7 +35,7 @@ nbd_server_stop() >>> read NBD_PID < "$nbd_pid_file" >>> rm -f "$nbd_pid_file" >>> if [ -n "$NBD_PID" ]; then >>> - kill "$NBD_PID" >>> + kill -9 "$NBD_PID" >>> fi >>> fi >>> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >>> >>> ... but that does not look like the right solution to me. What could prevent >>> the qemu-nbd from correctly shutting down when it receives a normal SIGTERM >>> signal? >> >> Not sure. In theory, qemu_system_killed() should set state = TERMINATE >> and make main_loop_wait() return through the notification, which should >> then make it shut down. Maybe you can attach gdb and check what 'state' >> is when it hangs and if it's still in the main loop? > > I attached a gdb and ran "bt", and it looks like it is hanging in an exit() > handler: > > (gdb) bt > #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 > #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 > #2 0x00005583df3048b2 in flush_trace_file (wait=true) at ../../devel/qemu/ > trace/simple.c:140 > #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 > #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 > #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 > #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized out>) > at ../../devel/qemu/qemu-nbd.c:1242 Ah, now that I wrote that: I recently ran "configure" with --enable-trace-backends=simple ... when I remove that from "config.status" again, then the test works fine again 8-) Still, I think it should not hang with the simple trace backend here, should it? Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 17:52 ` Thomas Huth @ 2025-02-25 17:57 ` Daniel P. Berrangé 2025-02-25 20:35 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Daniel P. Berrangé @ 2025-02-25 17:57 UTC (permalink / raw) To: Thomas Huth Cc: Kevin Wolf, Hanna Reitz, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: > On 25/02/2025 18.44, Thomas Huth wrote: > > On 25/02/2025 11.12, Kevin Wolf wrote: > > > Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: > > > > > > > > Hi! > > > > > > > > I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running > > > > > > > > ./check -raw 233 > > > > > > > > the test simply hangs. Looking at the log, the last message is "== check > > > > plain client to TLS server fails ==". I added some debug messages, and it > > > > seems like the previous NBD server is not correctly terminated here. > > > > The test works fine again if I apply this patch: > > > > > > > > diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd > > > > --- a/tests/qemu-iotests/common.nbd > > > > +++ b/tests/qemu-iotests/common.nbd > > > > @@ -35,7 +35,7 @@ nbd_server_stop() > > > > read NBD_PID < "$nbd_pid_file" > > > > rm -f "$nbd_pid_file" > > > > if [ -n "$NBD_PID" ]; then > > > > - kill "$NBD_PID" > > > > + kill -9 "$NBD_PID" > > > > fi > > > > fi > > > > rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" > > > > > > > > ... but that does not look like the right solution to me. What could prevent > > > > the qemu-nbd from correctly shutting down when it receives a normal SIGTERM > > > > signal? > > > > > > Not sure. In theory, qemu_system_killed() should set state = TERMINATE > > > and make main_loop_wait() return through the notification, which should > > > then make it shut down. Maybe you can attach gdb and check what 'state' > > > is when it hangs and if it's still in the main loop? > > > > I attached a gdb and ran "bt", and it looks like it is hanging in an > > exit() handler: > > > > (gdb) bt > > #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 > > #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 > > #2 0x00005583df3048b2 in flush_trace_file (wait=true) at > > ../../devel/qemu/ trace/simple.c:140 > > #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 > > #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 > > #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 > > #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized > > out>) at ../../devel/qemu/qemu-nbd.c:1242 > > Ah, now that I wrote that: I recently ran "configure" with > --enable-trace-backends=simple ... when I remove that from "config.status" > again, then the test works fine again 8-) > > Still, I think it should not hang with the simple trace backend here, should it? IIUC this is waiting on trace_empty_cond. This condition should be signalled from wait_for_trace_records_available which is in turn called from writeout_thread. This thread is started from st_init, which is called from trace_init_backends which should be called from qemu-nbd. I would expect this thread to still be running when exit() handlers are run. Does GDB show any other threads running at the time of this hang ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 17:57 ` Daniel P. Berrangé @ 2025-02-25 20:35 ` Thomas Huth 2025-02-25 21:00 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-25 20:35 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Kevin Wolf, Hanna Reitz, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On 25/02/2025 18.57, Daniel P. Berrangé wrote: > On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: >> On 25/02/2025 18.44, Thomas Huth wrote: >>> On 25/02/2025 11.12, Kevin Wolf wrote: >>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >>>>> >>>>> Hi! >>>>> >>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When running >>>>> >>>>> ./check -raw 233 >>>>> >>>>> the test simply hangs. Looking at the log, the last message is "== check >>>>> plain client to TLS server fails ==". I added some debug messages, and it >>>>> seems like the previous NBD server is not correctly terminated here. >>>>> The test works fine again if I apply this patch: >>>>> >>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/common.nbd >>>>> --- a/tests/qemu-iotests/common.nbd >>>>> +++ b/tests/qemu-iotests/common.nbd >>>>> @@ -35,7 +35,7 @@ nbd_server_stop() >>>>> read NBD_PID < "$nbd_pid_file" >>>>> rm -f "$nbd_pid_file" >>>>> if [ -n "$NBD_PID" ]; then >>>>> - kill "$NBD_PID" >>>>> + kill -9 "$NBD_PID" >>>>> fi >>>>> fi >>>>> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >>>>> >>>>> ... but that does not look like the right solution to me. What could prevent >>>>> the qemu-nbd from correctly shutting down when it receives a normal SIGTERM >>>>> signal? >>>> >>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE >>>> and make main_loop_wait() return through the notification, which should >>>> then make it shut down. Maybe you can attach gdb and check what 'state' >>>> is when it hangs and if it's still in the main loop? >>> >>> I attached a gdb and ran "bt", and it looks like it is hanging in an >>> exit() handler: >>> >>> (gdb) bt >>> #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 >>> #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 >>> #2 0x00005583df3048b2 in flush_trace_file (wait=true) at >>> ../../devel/qemu/ trace/simple.c:140 >>> #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 >>> #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 >>> #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 >>> #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized >>> out>) at ../../devel/qemu/qemu-nbd.c:1242 >> >> Ah, now that I wrote that: I recently ran "configure" with >> --enable-trace-backends=simple ... when I remove that from "config.status" >> again, then the test works fine again 8-) >> >> Still, I think it should not hang with the simple trace backend here, should it? > > IIUC this is waiting on trace_empty_cond. > > This condition should be signalled from wait_for_trace_records_available > which is in turn called from writeout_thread. > > This thread is started from st_init, which is called from trace_init_backends > which should be called from qemu-nbd. I would expect this thread to still > be running when exit() handlers are run. > > Does GDB show any other threads running at the time of this hang ? There is indeed a second thread running: (gdb) thread apply all bt Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"): #0 0x00007f6573419f1d in syscall () from /lib64/libc.so.6 #1 0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/futex.h:29 #2 0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465 #3 0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/qemu/util/rcu.c:278 #4 0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) at ../../devel/qemu/util/qemu-thread-posix.c:542 #5 0x00007f6573398168 in start_thread () from /lib64/libc.so.6 #6 0x00007f657341c14c in __clone3 () from /lib64/libc.so.6 Thread 1 (Thread 0x7f65711c1240 (LWP 1117883) "qemu-nbd"): #0 0x00007f6573419f1d in syscall () from /lib64/libc.so.6 #1 0x00007f6573932e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 #2 0x0000562bbadc8d4f in flush_trace_file (wait=true) at ../../devel/qemu/trace/simple.c:140 #3 0x0000562bbadc96fa in st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 #4 0x00007f65733436c1 in __run_exit_handlers () from /lib64/libc.so.6 #5 0x00007f657334378e in exit () from /lib64/libc.so.6 #6 0x0000562bbad2952e in main (argc=12, argv=0x7ffc6939ee58) at ../../devel/qemu/qemu-nbd.c:1242 Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 20:35 ` Thomas Huth @ 2025-02-25 21:00 ` Thomas Huth 2025-02-26 6:40 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-25 21:00 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Kevin Wolf, Hanna Reitz, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On 25/02/2025 21.35, Thomas Huth wrote: > On 25/02/2025 18.57, Daniel P. Berrangé wrote: >> On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: >>> On 25/02/2025 18.44, Thomas Huth wrote: >>>> On 25/02/2025 11.12, Kevin Wolf wrote: >>>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >>>>>> >>>>>> Hi! >>>>>> >>>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When >>>>>> running >>>>>> >>>>>> ./check -raw 233 >>>>>> >>>>>> the test simply hangs. Looking at the log, the last message is "== check >>>>>> plain client to TLS server fails ==". I added some debug messages, and it >>>>>> seems like the previous NBD server is not correctly terminated here. >>>>>> The test works fine again if I apply this patch: >>>>>> >>>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/ >>>>>> common.nbd >>>>>> --- a/tests/qemu-iotests/common.nbd >>>>>> +++ b/tests/qemu-iotests/common.nbd >>>>>> @@ -35,7 +35,7 @@ nbd_server_stop() >>>>>> read NBD_PID < "$nbd_pid_file" >>>>>> rm -f "$nbd_pid_file" >>>>>> if [ -n "$NBD_PID" ]; then >>>>>> - kill "$NBD_PID" >>>>>> + kill -9 "$NBD_PID" >>>>>> fi >>>>>> fi >>>>>> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >>>>>> >>>>>> ... but that does not look like the right solution to me. What could >>>>>> prevent >>>>>> the qemu-nbd from correctly shutting down when it receives a normal >>>>>> SIGTERM >>>>>> signal? >>>>> >>>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE >>>>> and make main_loop_wait() return through the notification, which should >>>>> then make it shut down. Maybe you can attach gdb and check what 'state' >>>>> is when it hangs and if it's still in the main loop? >>>> >>>> I attached a gdb and ran "bt", and it looks like it is hanging in an >>>> exit() handler: >>>> >>>> (gdb) bt >>>> #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 >>>> #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 >>>> #2 0x00005583df3048b2 in flush_trace_file (wait=true) at >>>> ../../devel/qemu/ trace/simple.c:140 >>>> #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 >>>> #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 >>>> #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 >>>> #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized >>>> out>) at ../../devel/qemu/qemu-nbd.c:1242 >>> >>> Ah, now that I wrote that: I recently ran "configure" with >>> --enable-trace-backends=simple ... when I remove that from "config.status" >>> again, then the test works fine again 8-) >>> >>> Still, I think it should not hang with the simple trace backend here, >>> should it? >> >> IIUC this is waiting on trace_empty_cond. >> >> This condition should be signalled from wait_for_trace_records_available >> which is in turn called from writeout_thread. >> >> This thread is started from st_init, which is called from trace_init_backends >> which should be called from qemu-nbd. I would expect this thread to still >> be running when exit() handlers are run. >> >> Does GDB show any other threads running at the time of this hang ? > > There is indeed a second thread running: > > (gdb) thread apply all bt > > Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"): > #0 0x00007f6573419f1d in syscall () from /lib64/libc.so.6 > #1 0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 > <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/ > futex.h:29 > #2 0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 > <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465 > #3 0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/qemu/ > util/rcu.c:278 > #4 0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) at ../../ > devel/qemu/util/qemu-thread-posix.c:542 > #5 0x00007f6573398168 in start_thread () from /lib64/libc.so.6 > #6 0x00007f657341c14c in __clone3 () from /lib64/libc.so.6 Though, that does not look like the thread from the simpletrace, but the the QEMU RCU thread instead ... so no clue where that writer thread might have gone... Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-25 21:00 ` Thomas Huth @ 2025-02-26 6:40 ` Thomas Huth 2025-02-26 8:55 ` Thomas Huth 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-26 6:40 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Kevin Wolf, Hanna Reitz, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On 25/02/2025 22.00, Thomas Huth wrote: > On 25/02/2025 21.35, Thomas Huth wrote: >> On 25/02/2025 18.57, Daniel P. Berrangé wrote: >>> On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: >>>> On 25/02/2025 18.44, Thomas Huth wrote: >>>>> On 25/02/2025 11.12, Kevin Wolf wrote: >>>>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >>>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When >>>>>>> running >>>>>>> >>>>>>> ./check -raw 233 >>>>>>> >>>>>>> the test simply hangs. Looking at the log, the last message is "== check >>>>>>> plain client to TLS server fails ==". I added some debug messages, >>>>>>> and it >>>>>>> seems like the previous NBD server is not correctly terminated here. >>>>>>> The test works fine again if I apply this patch: >>>>>>> >>>>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/ >>>>>>> common.nbd >>>>>>> --- a/tests/qemu-iotests/common.nbd >>>>>>> +++ b/tests/qemu-iotests/common.nbd >>>>>>> @@ -35,7 +35,7 @@ nbd_server_stop() >>>>>>> read NBD_PID < "$nbd_pid_file" >>>>>>> rm -f "$nbd_pid_file" >>>>>>> if [ -n "$NBD_PID" ]; then >>>>>>> - kill "$NBD_PID" >>>>>>> + kill -9 "$NBD_PID" >>>>>>> fi >>>>>>> fi >>>>>>> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >>>>>>> >>>>>>> ... but that does not look like the right solution to me. What could >>>>>>> prevent >>>>>>> the qemu-nbd from correctly shutting down when it receives a normal >>>>>>> SIGTERM >>>>>>> signal? >>>>>> >>>>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE >>>>>> and make main_loop_wait() return through the notification, which should >>>>>> then make it shut down. Maybe you can attach gdb and check what 'state' >>>>>> is when it hangs and if it's still in the main loop? >>>>> >>>>> I attached a gdb and ran "bt", and it looks like it is hanging in an >>>>> exit() handler: >>>>> >>>>> (gdb) bt >>>>> #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 >>>>> #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 >>>>> #2 0x00005583df3048b2 in flush_trace_file (wait=true) at >>>>> ../../devel/qemu/ trace/simple.c:140 >>>>> #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 >>>>> #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 >>>>> #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 >>>>> #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized >>>>> out>) at ../../devel/qemu/qemu-nbd.c:1242 >>>> >>>> Ah, now that I wrote that: I recently ran "configure" with >>>> --enable-trace-backends=simple ... when I remove that from "config.status" >>>> again, then the test works fine again 8-) >>>> >>>> Still, I think it should not hang with the simple trace backend here, >>>> should it? >>> >>> IIUC this is waiting on trace_empty_cond. >>> >>> This condition should be signalled from wait_for_trace_records_available >>> which is in turn called from writeout_thread. >>> >>> This thread is started from st_init, which is called from >>> trace_init_backends >>> which should be called from qemu-nbd. I would expect this thread to still >>> be running when exit() handlers are run. >>> >>> Does GDB show any other threads running at the time of this hang ? >> >> There is indeed a second thread running: >> >> (gdb) thread apply all bt >> >> Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"): >> #0 0x00007f6573419f1d in syscall () from /lib64/libc.so.6 >> #1 0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 >> <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/ >> futex.h:29 >> #2 0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 >> <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465 >> #3 0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/ >> qemu/ util/rcu.c:278 >> #4 0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) >> at ../../ devel/qemu/util/qemu-thread-posix.c:542 >> #5 0x00007f6573398168 in start_thread () from /lib64/libc.so.6 >> #6 0x00007f657341c14c in __clone3 () from /lib64/libc.so.6 > > Though, that does not look like the thread from the simpletrace, but the the > QEMU RCU thread instead ... so no clue where that writer thread might have > gone... OK, I think I now understood the problem: qemu-nbd is calling trace_init_backends() first, which creates the simpletrace threads and installs the atexit() handler. Then it is calling fork() since the test uses the --fork command line option. But fork() does not clone the simpletrace thread into the new process, only the main thread (see man-page of fork, the new process starts single-threaded). So when the new child process exits, the exit handler calls the simple trace flush function which tries to wait for a thread that has never been created in that process. The test works when I move the trace_init_backends() behind the fork() in the main function... but I am not sure if we would miss some logs this way, so I don't know whether that's the right solution. Could a qemu-nbd expert please have a look at this? Thanks, Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-26 6:40 ` Thomas Huth @ 2025-02-26 8:55 ` Thomas Huth 2025-02-27 19:18 ` Eric Blake 0 siblings, 1 reply; 10+ messages in thread From: Thomas Huth @ 2025-02-26 8:55 UTC (permalink / raw) To: Daniel P. Berrangé Cc: Kevin Wolf, Hanna Reitz, Eric Blake, Qemu-block, QEMU Developers, Stefan Hajnoczi On 26/02/2025 07.40, Thomas Huth wrote: > On 25/02/2025 22.00, Thomas Huth wrote: >> On 25/02/2025 21.35, Thomas Huth wrote: >>> On 25/02/2025 18.57, Daniel P. Berrangé wrote: >>>> On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote: >>>>> On 25/02/2025 18.44, Thomas Huth wrote: >>>>>> On 25/02/2025 11.12, Kevin Wolf wrote: >>>>>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben: >>>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When >>>>>>>> running >>>>>>>> >>>>>>>> ./check -raw 233 >>>>>>>> >>>>>>>> the test simply hangs. Looking at the log, the last message is "== >>>>>>>> check >>>>>>>> plain client to TLS server fails ==". I added some debug messages, >>>>>>>> and it >>>>>>>> seems like the previous NBD server is not correctly terminated here. >>>>>>>> The test works fine again if I apply this patch: >>>>>>>> >>>>>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/ >>>>>>>> common.nbd >>>>>>>> --- a/tests/qemu-iotests/common.nbd >>>>>>>> +++ b/tests/qemu-iotests/common.nbd >>>>>>>> @@ -35,7 +35,7 @@ nbd_server_stop() >>>>>>>> read NBD_PID < "$nbd_pid_file" >>>>>>>> rm -f "$nbd_pid_file" >>>>>>>> if [ -n "$NBD_PID" ]; then >>>>>>>> - kill "$NBD_PID" >>>>>>>> + kill -9 "$NBD_PID" >>>>>>>> fi >>>>>>>> fi >>>>>>>> rm -f "$nbd_unix_socket" "$nbd_stderr_fifo" >>>>>>>> >>>>>>>> ... but that does not look like the right solution to me. What could >>>>>>>> prevent >>>>>>>> the qemu-nbd from correctly shutting down when it receives a normal >>>>>>>> SIGTERM >>>>>>>> signal? >>>>>>> >>>>>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE >>>>>>> and make main_loop_wait() return through the notification, which should >>>>>>> then make it shut down. Maybe you can attach gdb and check what 'state' >>>>>>> is when it hangs and if it's still in the main loop? >>>>>> >>>>>> I attached a gdb and ran "bt", and it looks like it is hanging in an >>>>>> exit() handler: >>>>>> >>>>>> (gdb) bt >>>>>> #0 0x00007f127f8fff1d in syscall () from /lib64/libc.so.6 >>>>>> #1 0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0 >>>>>> #2 0x00005583df3048b2 in flush_trace_file (wait=true) at >>>>>> ../../devel/qemu/ trace/simple.c:140 >>>>>> #3 st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383 >>>>>> #4 0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6 >>>>>> #5 0x00007f127f82978e in exit () from /lib64/libc.so.6 >>>>>> #6 0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized >>>>>> out>) at ../../devel/qemu/qemu-nbd.c:1242 >>>>> >>>>> Ah, now that I wrote that: I recently ran "configure" with >>>>> --enable-trace-backends=simple ... when I remove that from "config.status" >>>>> again, then the test works fine again 8-) >>>>> >>>>> Still, I think it should not hang with the simple trace backend here, >>>>> should it? >>>> >>>> IIUC this is waiting on trace_empty_cond. >>>> >>>> This condition should be signalled from wait_for_trace_records_available >>>> which is in turn called from writeout_thread. >>>> >>>> This thread is started from st_init, which is called from >>>> trace_init_backends >>>> which should be called from qemu-nbd. I would expect this thread to still >>>> be running when exit() handlers are run. >>>> >>>> Does GDB show any other threads running at the time of this hang ? >>> >>> There is indeed a second thread running: >>> >>> (gdb) thread apply all bt >>> >>> Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"): >>> #0 0x00007f6573419f1d in syscall () from /lib64/libc.so.6 >>> #1 0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 >>> <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/ >>> futex.h:29 >>> #2 0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 >>> <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465 >>> #3 0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/ >>> qemu/ util/rcu.c:278 >>> #4 0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) >>> at ../../ devel/qemu/util/qemu-thread-posix.c:542 >>> #5 0x00007f6573398168 in start_thread () from /lib64/libc.so.6 >>> #6 0x00007f657341c14c in __clone3 () from /lib64/libc.so.6 >> >> Though, that does not look like the thread from the simpletrace, but the >> the QEMU RCU thread instead ... so no clue where that writer thread might >> have gone... > > OK, I think I now understood the problem: qemu-nbd is calling > trace_init_backends() first, which creates the simpletrace threads and > installs the atexit() handler. Then it is calling fork() since the test uses > the --fork command line option. But fork() does not clone the simpletrace > thread into the new process, only the main thread (see man-page of fork, the > new process starts single-threaded). So when the new child process exits, > the exit handler calls the simple trace flush function which tries to wait > for a thread that has never been created in that process. > > The test works when I move the trace_init_backends() behind the fork() in > the main function... but I am not sure if we would miss some logs this way, > so I don't know whether that's the right solution. Could a qemu-nbd expert > please have a look at this? After pondering about it for a while, maybe the best solution is to handle it within the simpletrace backend itself, by using pthread_atfork() : https://lore.kernel.org/qemu-devel/20250226085015.1143991-1-thuth@redhat.com/ Thomas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Problem with iotest 233 2025-02-26 8:55 ` Thomas Huth @ 2025-02-27 19:18 ` Eric Blake 0 siblings, 0 replies; 10+ messages in thread From: Eric Blake @ 2025-02-27 19:18 UTC (permalink / raw) To: Thomas Huth Cc: Daniel P. Berrangé, Kevin Wolf, Hanna Reitz, Qemu-block, QEMU Developers, Stefan Hajnoczi On Wed, Feb 26, 2025 at 09:55:18AM +0100, Thomas Huth wrote: > > > Though, that does not look like the thread from the simpletrace, but > > > the the QEMU RCU thread instead ... so no clue where that writer > > > thread might have gone... > > > > OK, I think I now understood the problem: qemu-nbd is calling > > trace_init_backends() first, which creates the simpletrace threads and > > installs the atexit() handler. Then it is calling fork() since the test > > uses the --fork command line option. But fork() does not clone the > > simpletrace thread into the new process, only the main thread (see > > man-page of fork, the new process starts single-threaded). So when the > > new child process exits, the exit handler calls the simple trace flush > > function which tries to wait for a thread that has never been created in > > that process. That definitely explains the symptoms. > > > > The test works when I move the trace_init_backends() behind the fork() > > in the main function... but I am not sure if we would miss some logs > > this way, so I don't know whether that's the right solution. Could a > > qemu-nbd expert please have a look at this? I'm also thinking about ways to avoid it. > > After pondering about it for a while, maybe the best solution is to handle > it within the simpletrace backend itself, by using pthread_atfork() : > > https://lore.kernel.org/qemu-devel/20250226085015.1143991-1-thuth@redhat.com/ pthread_atfork() is an odd function - POSIX itself says it is unreliable, because there is NO sane way you can possibly know every action that any library you call into that might possibly need protection before fork. That doesn't mean we can't try it, just that we can't expect it to solve every fork-related problem we might encounter. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-02-27 19:20 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-02-25 7:20 Problem with iotest 233 Thomas Huth 2025-02-25 10:12 ` Kevin Wolf 2025-02-25 17:44 ` Thomas Huth 2025-02-25 17:52 ` Thomas Huth 2025-02-25 17:57 ` Daniel P. Berrangé 2025-02-25 20:35 ` Thomas Huth 2025-02-25 21:00 ` Thomas Huth 2025-02-26 6:40 ` Thomas Huth 2025-02-26 8:55 ` Thomas Huth 2025-02-27 19:18 ` Eric Blake
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).