Re: Problem with iotest 233

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Huth <thuth@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	Qemu-block <qemu-block@nongnu.org>,
	QEMU Developers <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: Problem with iotest 233
Date: Wed, 26 Feb 2025 07:40:48 +0100	[thread overview]
Message-ID: <4dbc253d-fd6c-4feb-9597-81afbfa8d670@redhat.com> (raw)
In-Reply-To: <7a81b491-d982-4a5b-b250-f7b772f64eb4@redhat.com>

On 25/02/2025 22.00, Thomas Huth wrote:
> On 25/02/2025 21.35, Thomas Huth wrote:
>> On 25/02/2025 18.57, Daniel P. Berrangé wrote:
>>> On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote:
>>>> On 25/02/2025 18.44, Thomas Huth wrote:
>>>>> On 25/02/2025 11.12, Kevin Wolf wrote:
>>>>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben:
>>>>>>>
>>>>>>>    Hi!
>>>>>>>
>>>>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When 
>>>>>>> running
>>>>>>>
>>>>>>>    ./check -raw 233
>>>>>>>
>>>>>>> the test simply hangs. Looking at the log, the last message is "== check
>>>>>>> plain client to TLS server fails ==". I added some debug messages, 
>>>>>>> and it
>>>>>>> seems like the previous NBD server is not correctly terminated here.
>>>>>>> The test works fine again if I apply this patch:
>>>>>>>
>>>>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/ 
>>>>>>> common.nbd
>>>>>>> --- a/tests/qemu-iotests/common.nbd
>>>>>>> +++ b/tests/qemu-iotests/common.nbd
>>>>>>> @@ -35,7 +35,7 @@ nbd_server_stop()
>>>>>>>            read NBD_PID < "$nbd_pid_file"
>>>>>>>            rm -f "$nbd_pid_file"
>>>>>>>            if [ -n "$NBD_PID" ]; then
>>>>>>> -            kill "$NBD_PID"
>>>>>>> +            kill -9 "$NBD_PID"
>>>>>>>            fi
>>>>>>>        fi
>>>>>>>        rm -f "$nbd_unix_socket" "$nbd_stderr_fifo"
>>>>>>>
>>>>>>> ... but that does not look like the right solution to me. What could 
>>>>>>> prevent
>>>>>>> the qemu-nbd from correctly shutting down when it receives a normal 
>>>>>>> SIGTERM
>>>>>>> signal?
>>>>>>
>>>>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE
>>>>>> and make main_loop_wait() return through the notification, which should
>>>>>> then make it shut down. Maybe you can attach gdb and check what 'state'
>>>>>> is when it hangs and if it's still in the main loop?
>>>>>
>>>>> I attached a gdb and ran "bt", and it looks like it is hanging in an
>>>>> exit() handler:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00007f127f8fff1d in syscall () from /lib64/libc.so.6
>>>>> #1  0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0
>>>>> #2  0x00005583df3048b2 in flush_trace_file (wait=true) at
>>>>> ../../devel/qemu/ trace/simple.c:140
>>>>> #3  st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383
>>>>> #4  0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6
>>>>> #5  0x00007f127f82978e in exit () from /lib64/libc.so.6
>>>>> #6  0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized
>>>>> out>) at ../../devel/qemu/qemu-nbd.c:1242
>>>>
>>>> Ah, now that I wrote that: I recently ran "configure" with
>>>> --enable-trace-backends=simple ... when I remove that from "config.status"
>>>> again, then the test works fine again 8-)
>>>>
>>>> Still, I think it should not hang with the simple trace backend here, 
>>>> should it?
>>>
>>> IIUC this is waiting on trace_empty_cond.
>>>
>>> This condition should be signalled from wait_for_trace_records_available
>>> which is in turn called from writeout_thread.
>>>
>>> This thread is started from st_init, which is called from 
>>> trace_init_backends
>>> which should be called from qemu-nbd. I would expect this thread to still
>>> be running when exit() handlers are run.
>>>
>>> Does GDB show any other threads running at the time of this hang ?
>>
>> There is indeed a second thread running:
>>
>> (gdb) thread apply all bt
>>
>> Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"):
>> #0  0x00007f6573419f1d in syscall () from /lib64/libc.so.6
>> #1  0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 
>> <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/ 
>> futex.h:29
>> #2  0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 
>> <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465
>> #3  0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/ 
>> qemu/ util/rcu.c:278
>> #4  0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) 
>> at ../../ devel/qemu/util/qemu-thread-posix.c:542
>> #5  0x00007f6573398168 in start_thread () from /lib64/libc.so.6
>> #6  0x00007f657341c14c in __clone3 () from /lib64/libc.so.6
> 
> Though, that does not look like the thread from the simpletrace, but the the 
> QEMU RCU thread instead ... so no clue where that writer thread might have 
> gone...

OK, I think I now understood the problem: qemu-nbd is calling 
trace_init_backends() first, which creates the simpletrace threads and 
installs the atexit() handler. Then it is calling fork() since the test uses 
the --fork command line option. But fork() does not clone the simpletrace 
thread into the new process, only the main thread (see man-page of fork, the 
new process starts single-threaded). So when the new child process exits, 
the exit handler calls the simple trace flush function which tries to wait 
for a thread that has never been created in that process.

The test works when I move the trace_init_backends() behind the fork() in 
the main function... but I am not sure if we would miss some logs this way, 
so I don't know whether that's the right solution. Could a qemu-nbd expert 
please have a look at this?

  Thanks,
   Thomas

next prev parent reply	other threads:[~2025-02-26  6:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-25  7:20 Problem with iotest 233 Thomas Huth
2025-02-25 10:12 ` Kevin Wolf
2025-02-25 17:44   ` Thomas Huth
2025-02-25 17:52     ` Thomas Huth
2025-02-25 17:57       ` Daniel P. Berrangé
2025-02-25 20:35         ` Thomas Huth
2025-02-25 21:00           ` Thomas Huth
2025-02-26  6:40             ` Thomas Huth [this message]
2025-02-26  8:55               ` Thomas Huth
2025-02-27 19:18                 ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4dbc253d-fd6c-4feb-9597-81afbfa8d670@redhat.com \
    --to=thuth@redhat.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).