Re: Problem with iotest 233

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Huth <thuth@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	Qemu-block <qemu-block@nongnu.org>,
	QEMU Developers <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: Problem with iotest 233
Date: Wed, 26 Feb 2025 09:55:18 +0100	[thread overview]
Message-ID: <5eb59c74-f426-4bee-a854-08c7a5f5050d@redhat.com> (raw)
In-Reply-To: <4dbc253d-fd6c-4feb-9597-81afbfa8d670@redhat.com>

On 26/02/2025 07.40, Thomas Huth wrote:
> On 25/02/2025 22.00, Thomas Huth wrote:
>> On 25/02/2025 21.35, Thomas Huth wrote:
>>> On 25/02/2025 18.57, Daniel P. Berrangé wrote:
>>>> On Tue, Feb 25, 2025 at 06:52:43PM +0100, Thomas Huth wrote:
>>>>> On 25/02/2025 18.44, Thomas Huth wrote:
>>>>>> On 25/02/2025 11.12, Kevin Wolf wrote:
>>>>>>> Am 25.02.2025 um 08:20 hat Thomas Huth geschrieben:
>>>>>>>>
>>>>>>>>    Hi!
>>>>>>>>
>>>>>>>> I'm facing a weird hang in iotest 233 on my Fedora 41 laptop. When 
>>>>>>>> running
>>>>>>>>
>>>>>>>>    ./check -raw 233
>>>>>>>>
>>>>>>>> the test simply hangs. Looking at the log, the last message is "== 
>>>>>>>> check
>>>>>>>> plain client to TLS server fails ==". I added some debug messages, 
>>>>>>>> and it
>>>>>>>> seems like the previous NBD server is not correctly terminated here.
>>>>>>>> The test works fine again if I apply this patch:
>>>>>>>>
>>>>>>>> diff --git a/tests/qemu-iotests/common.nbd b/tests/qemu-iotests/ 
>>>>>>>> common.nbd
>>>>>>>> --- a/tests/qemu-iotests/common.nbd
>>>>>>>> +++ b/tests/qemu-iotests/common.nbd
>>>>>>>> @@ -35,7 +35,7 @@ nbd_server_stop()
>>>>>>>>            read NBD_PID < "$nbd_pid_file"
>>>>>>>>            rm -f "$nbd_pid_file"
>>>>>>>>            if [ -n "$NBD_PID" ]; then
>>>>>>>> -            kill "$NBD_PID"
>>>>>>>> +            kill -9 "$NBD_PID"
>>>>>>>>            fi
>>>>>>>>        fi
>>>>>>>>        rm -f "$nbd_unix_socket" "$nbd_stderr_fifo"
>>>>>>>>
>>>>>>>> ... but that does not look like the right solution to me. What could 
>>>>>>>> prevent
>>>>>>>> the qemu-nbd from correctly shutting down when it receives a normal 
>>>>>>>> SIGTERM
>>>>>>>> signal?
>>>>>>>
>>>>>>> Not sure. In theory, qemu_system_killed() should set state = TERMINATE
>>>>>>> and make main_loop_wait() return through the notification, which should
>>>>>>> then make it shut down. Maybe you can attach gdb and check what 'state'
>>>>>>> is when it hangs and if it's still in the main loop?
>>>>>>
>>>>>> I attached a gdb and ran "bt", and it looks like it is hanging in an
>>>>>> exit() handler:
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0  0x00007f127f8fff1d in syscall () from /lib64/libc.so.6
>>>>>> #1  0x00007f127fd32e1d in g_cond_wait () from /lib64/libglib-2.0.so.0
>>>>>> #2  0x00005583df3048b2 in flush_trace_file (wait=true) at
>>>>>> ../../devel/qemu/ trace/simple.c:140
>>>>>> #3  st_flush_trace_buffer () at ../../devel/qemu/trace/simple.c:383
>>>>>> #4  0x00007f127f8296c1 in __run_exit_handlers () from /lib64/libc.so.6
>>>>>> #5  0x00007f127f82978e in exit () from /lib64/libc.so.6
>>>>>> #6  0x00005583df1ae9e1 in main (argc=<optimized out>, argv=<optimized
>>>>>> out>) at ../../devel/qemu/qemu-nbd.c:1242
>>>>>
>>>>> Ah, now that I wrote that: I recently ran "configure" with
>>>>> --enable-trace-backends=simple ... when I remove that from "config.status"
>>>>> again, then the test works fine again 8-)
>>>>>
>>>>> Still, I think it should not hang with the simple trace backend here, 
>>>>> should it?
>>>>
>>>> IIUC this is waiting on trace_empty_cond.
>>>>
>>>> This condition should be signalled from wait_for_trace_records_available
>>>> which is in turn called from writeout_thread.
>>>>
>>>> This thread is started from st_init, which is called from 
>>>> trace_init_backends
>>>> which should be called from qemu-nbd. I would expect this thread to still
>>>> be running when exit() handlers are run.
>>>>
>>>> Does GDB show any other threads running at the time of this hang ?
>>>
>>> There is indeed a second thread running:
>>>
>>> (gdb) thread apply all bt
>>>
>>> Thread 2 (Thread 0x7f657096b6c0 (LWP 1117884) "qemu-nbd"):
>>> #0  0x00007f6573419f1d in syscall () from /lib64/libc.so.6
>>> #1  0x0000562bbad9b783 in qemu_futex_wait (f=0x562bbaed25d8 
>>> <rcu_call_ready_event>, val=4294967295) at ../../devel/qemu/include/qemu/ 
>>> futex.h:29
>>> #2  0x0000562bbad9b9af in qemu_event_wait (ev=0x562bbaed25d8 
>>> <rcu_call_ready_event>) at ../../devel/qemu/util/qemu-thread-posix.c:465
>>> #3  0x0000562bbada86a6 in call_rcu_thread (opaque=0x0) at ../../devel/ 
>>> qemu/ util/rcu.c:278
>>> #4  0x0000562bbad9bba3 in qemu_thread_start (args=0x562bf958a5c0) 
>>> at ../../ devel/qemu/util/qemu-thread-posix.c:542
>>> #5  0x00007f6573398168 in start_thread () from /lib64/libc.so.6
>>> #6  0x00007f657341c14c in __clone3 () from /lib64/libc.so.6
>>
>> Though, that does not look like the thread from the simpletrace, but the 
>> the QEMU RCU thread instead ... so no clue where that writer thread might 
>> have gone...
> 
> OK, I think I now understood the problem: qemu-nbd is calling 
> trace_init_backends() first, which creates the simpletrace threads and 
> installs the atexit() handler. Then it is calling fork() since the test uses 
> the --fork command line option. But fork() does not clone the simpletrace 
> thread into the new process, only the main thread (see man-page of fork, the 
> new process starts single-threaded). So when the new child process exits, 
> the exit handler calls the simple trace flush function which tries to wait 
> for a thread that has never been created in that process.
> 
> The test works when I move the trace_init_backends() behind the fork() in 
> the main function... but I am not sure if we would miss some logs this way, 
> so I don't know whether that's the right solution. Could a qemu-nbd expert 
> please have a look at this?

After pondering about it for a while, maybe the best solution is to handle 
it within the simpletrace backend itself, by using pthread_atfork() :

  https://lore.kernel.org/qemu-devel/20250226085015.1143991-1-thuth@redhat.com/

  Thomas

next prev parent reply	other threads:[~2025-02-26  8:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-25  7:20 Problem with iotest 233 Thomas Huth
2025-02-25 10:12 ` Kevin Wolf
2025-02-25 17:44   ` Thomas Huth
2025-02-25 17:52     ` Thomas Huth
2025-02-25 17:57       ` Daniel P. Berrangé
2025-02-25 20:35         ` Thomas Huth
2025-02-25 21:00           ` Thomas Huth
2025-02-26  6:40             ` Thomas Huth
2025-02-26  8:55               ` Thomas Huth [this message]
2025-02-27 19:18                 ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5eb59c74-f426-4bee-a854-08c7a5f5050d@redhat.com \
    --to=thuth@redhat.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).