* deadlock when using iothread during backup_clean()
@ 2023-09-05 10:01 Fiona Ebner
2023-09-05 11:25 ` Fiona Ebner
2023-09-05 11:42 ` Paolo Bonzini
0 siblings, 2 replies; 16+ messages in thread
From: Fiona Ebner @ 2023-09-05 10:01 UTC (permalink / raw)
To: QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow, Vladimir Sementsov-Ogievskiy,
open list:Network Block Dev..., Thomas Lamprecht
Hi,
as the title says, a deadlock is possible in backup_clean() when using a
drive with an IO thread. The main thread keeps waiting in
bdrv_graph_wrlock() for reader_count() to become 0, while the IO thread
waits for its AioContext lock, which the main thread holds (because it
also is the backup job's AioContext). See the stack trace below [1].
The reason why it happens becomes clear with following commit message:
> commit 31b2ddfea304afc498aca8cac171020ef33eb89b
> Author: Kevin Wolf <kwolf@redhat.com>
> Date: Mon Jun 5 10:57:10 2023 +0200
>
> graph-lock: Unlock the AioContext while polling
>
> If the caller keeps the AioContext lock for a block node in an iothread,
> polling in bdrv_graph_wrlock() deadlocks if the condition isn't
> fulfilled immediately.
>
> Now that all callers make sure to actually have the AioContext locked
> when they call bdrv_replace_child_noperm() like they should, we can
> change bdrv_graph_wrlock() to take a BlockDriverState whose AioContext
> lock the caller holds (NULL if it doesn't) and unlock it temporarily
> while polling.
>
and noting that bdrv_graph_wrlock() is called with bs being NULL while
the caller holds the IO thread's AioContext lock.
The question is where should the IO thread's AioContext lock be dropped
by the main thread? The "Graph locking part 4 (node management)" series
[0] (also reproduced the dealock with that applied) moves the
bdrv_graph_wrlock() further up to block_job_remove_all_bdrv(), but it
still passes along NULL as the argument.
Can we assume block_job_remove_all_bdrv() to always hold the job's
AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
needs to release that before polling to fix the deadlock?
Best Regards,
Fiona
[0]: https://lists.nongnu.org/archive/html/qemu-devel/2023-08/msg02831.html
[1]:
IO thread:
> Thread 3 (Thread 0x7fa2209596c0 (LWP 119031) "qemu-system-x86"):
> #0 futex_wait (private=0, expected=2, futex_word=0x5648fc93b2c0) at ../sysdeps/nptl/futex-internal.h:146
> #1 __GI___lll_lock_wait (futex=futex@entry=0x5648fc93b2c0, private=0) at ./nptl/lowlevellock.c:49
> #2 0x00007fa2240e532a in lll_mutex_lock_optimized (mutex=0x5648fc93b2c0) at ./nptl/pthread_mutex_lock.c:48
> #3 ___pthread_mutex_lock (mutex=0x5648fc93b2c0) at ./nptl/pthread_mutex_lock.c:128
> #4 0x00005648fa1742f8 in qemu_mutex_lock_impl (mutex=0x5648fc93b2c0, file=0x5648fa446579 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:94
> #5 0x00005648fa174528 in qemu_rec_mutex_lock_impl (mutex=0x5648fc93b2c0, file=0x5648fa446579 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:149
> #6 0x00005648fa18dd6f in aio_context_acquire (ctx=0x5648fc93b260) at ../util/async.c:728
> #7 0x00005648fa18dce4 in aio_co_enter (ctx=0x5648fc93b260, co=0x7fa217f49210) at ../util/async.c:710
> #8 0x00005648fa18dc31 in aio_co_wake (co=0x7fa217f49210) at ../util/async.c:695
> #9 0x00005648fa08e4a2 in luring_process_completions (s=0x5648fdf2dd00) at ../block/io_uring.c:216
> #10 0x00005648fa08e6c7 in luring_process_completions_and_submit (s=0x5648fdf2dd00) at ../block/io_uring.c:268
> #11 0x00005648fa08e727 in qemu_luring_completion_cb (opaque=0x5648fdf2dd00) at ../block/io_uring.c:284
> #12 0x00005648fa16f433 in aio_dispatch_handler (ctx=0x5648fc93b260, node=0x5648fdf2de50) at ../util/aio-posix.c:372
> #13 0x00005648fa16f506 in aio_dispatch_ready_handlers (ctx=0x5648fc93b260, ready_list=0x7fa220954180) at ../util/aio-posix.c:401
> #14 0x00005648fa16ff7d in aio_poll (ctx=0x5648fc93b260, blocking=true) at ../util/aio-posix.c:723
> #15 0x00005648f9fbaa59 in iothread_run (opaque=0x5648fc5ed6b0) at ../iothread.c:63
> #16 0x00005648fa175046 in qemu_thread_start (args=0x5648fc93b8f0) at ../util/qemu-thread-posix.c:541
> #17 0x00007fa2240e2044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
> #18 0x00007fa2241625fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>
main thread:
> Thread 1 (Thread 0x7fa2214bf340 (LWP 119029) "qemu-system-x86"):
> #0 0x00007fa224155136 in __ppoll (fds=0x5648fdf2ce50, nfds=2, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1 0x00005648fa193102 in qemu_poll_ns (fds=0x5648fdf2ce50, nfds=2, timeout=-1) at ../util/qemu-timer.c:339
> #2 0x00005648fa17042f in fdmon_poll_wait (ctx=0x5648fc6c9740, ready_list=0x7ffff2757260, timeout=-1) at ../util/fdmon-poll.c:79
> #3 0x00005648fa16fd6b in aio_poll (ctx=0x5648fc6c9740, blocking=true) at ../util/aio-posix.c:670
> #4 0x00005648f9ffc0bc in bdrv_graph_wrlock (bs=0x0) at ../block/graph-lock.c:145
> #5 0x00005648f9fc78fa in bdrv_replace_child_noperm (child=0x5648fd729fd0, new_bs=0x0) at ../block.c:2897
> #6 0x00005648f9fc8487 in bdrv_root_unref_child (child=0x5648fd729fd0) at ../block.c:3227
> #7 0x00005648f9fd4b50 in block_job_remove_all_bdrv (job=0x5648fe0b6960) at ../blockjob.c:208
> #8 0x00005648f9feb11b in backup_clean (job=0x5648fe0b6960) at ../block/backup.c:104
> #9 0x00005648f9fd8632 in job_clean (job=0x5648fe0b6960) at ../job.c:836
> #10 0x00005648f9fd86db in job_finalize_single_locked (job=0x5648fe0b6960) at ../job.c:863
> #11 0x00005648f9fd89f5 in job_completed_txn_abort_locked (job=0x5648fe0b6960) at ../job.c:970
> #12 0x00005648f9fd8e6e in job_completed_locked (job=0x5648fe0b6960) at ../job.c:1080
> #13 0x00005648f9fd8f23 in job_exit (opaque=0x5648fe0b6960) at ../job.c:1103
> #14 0x00005648fa18cc01 in aio_bh_call (bh=0x7f9f68202010) at ../util/async.c:169
> #15 0x00005648fa18cd1c in aio_bh_poll (ctx=0x5648fc6c9740) at ../util/async.c:216
> #16 0x00005648fa16f615 in aio_dispatch (ctx=0x5648fc6c9740) at ../util/aio-posix.c:423
> #17 0x00005648fa18d15b in aio_ctx_dispatch (source=0x5648fc6c9740, callback=0x0, user_data=0x0) at ../util/async.c:358
> #18 0x00007fa22533e7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #19 0x00005648fa18e916 in glib_pollfds_poll () at ../util/main-loop.c:290
> #20 0x00005648fa18e993 in os_host_main_loop_wait (timeout=991053) at ../util/main-loop.c:313
> #21 0x00005648fa18eaa1 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
> #22 0x00005648f9cb027d in qemu_main_loop () at ../softmmu/runstate.c:732
> #23 0x00005648f9f6aff9 in qemu_default_main () at ../softmmu/main.c:37
> #24 0x00005648f9f6b02f in main (argc=55, argv=0x7ffff2757848) at ../softmmu/main.c:48
which still holds the IO thread's AioContext
> (gdb) p ((AioContext*)0x5648fc93b260)->lock
> $10 = {m = {lock = {__data = {__lock = 2, __count = 1, __owner = 119029, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = "\002\000\000\000\001\000\000\000\365\320\001\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}, file = 0x0, line = 0, initialized = true}}
which is also the job's AioContext
> (gdb) p ((Job*)0x5648fe0b6960)->aio_context
> $13 = (AioContext *) 0x5648fc93b260
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-09-05 10:01 deadlock when using iothread during backup_clean() Fiona Ebner
@ 2023-09-05 11:25 ` Fiona Ebner
2023-10-04 16:51 ` Vladimir Sementsov-Ogievskiy
2023-09-05 11:42 ` Paolo Bonzini
1 sibling, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-09-05 11:25 UTC (permalink / raw)
To: QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow, Vladimir Sementsov-Ogievskiy,
open list:Network Block Dev..., Thomas Lamprecht
Am 05.09.23 um 12:01 schrieb Fiona Ebner:
>
> Can we assume block_job_remove_all_bdrv() to always hold the job's
> AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
> needs to release that before polling to fix the deadlock?
>
I tried a doing something similar as a proof-of-concept
> diff --git a/blockjob.c b/blockjob.c
> index 58c5d64539..1a696241a0 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -198,19 +198,19 @@ void block_job_remove_all_bdrv(BlockJob *job)
> * one to make sure that such a concurrent access does not attempt
> * to process an already freed BdrvChild.
> */
> - bdrv_graph_wrlock(NULL);
> while (job->nodes) {
> GSList *l = job->nodes;
> BdrvChild *c = l->data;
>
> job->nodes = l->next;
>
> + bdrv_graph_wrlock(c->bs);
> bdrv_op_unblock_all(c->bs, job->blocker);
> bdrv_root_unref_child(c);
> + bdrv_graph_wrunlock();
>
> g_slist_free_1(l);
> }
> - bdrv_graph_wrunlock();
> }
and while it did get slightly further, I ran into another deadlock with
> #0 0x00007f1941155136 in __ppoll (fds=0x55992068fb20, nfds=2, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1 0x000055991c6a1a3f in qemu_poll_ns (fds=0x55992068fb20, nfds=2, timeout=-1) at ../util/qemu-timer.c:339
> #2 0x000055991c67ed6c in fdmon_poll_wait (ctx=0x55991f058810, ready_list=0x7ffda8c987b0, timeout=-1) at ../util/fdmon-poll.c:79
> #3 0x000055991c67e6a8 in aio_poll (ctx=0x55991f058810, blocking=true) at ../util/aio-posix.c:670
> #4 0x000055991c50a763 in bdrv_graph_wrlock (bs=0x0) at ../block/graph-lock.c:145
> #5 0x000055991c4daf85 in bdrv_close (bs=0x55991fff2f30) at ../block.c:5166
> #6 0x000055991c4dc050 in bdrv_delete (bs=0x55991fff2f30) at ../block.c:5606
> #7 0x000055991c4df205 in bdrv_unref (bs=0x55991fff2f30) at ../block.c:7173
> #8 0x000055991c4fb8ca in bdrv_cbw_drop (bs=0x55991fff2f30) at ../block/copy-before-write.c:566
> #9 0x000055991c4f9685 in backup_clean (job=0x55992016d0b0) at ../block/backup.c:105
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-09-05 10:01 deadlock when using iothread during backup_clean() Fiona Ebner
2023-09-05 11:25 ` Fiona Ebner
@ 2023-09-05 11:42 ` Paolo Bonzini
2023-09-28 8:06 ` Fiona Ebner
1 sibling, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2023-09-05 11:42 UTC (permalink / raw)
To: Fiona Ebner, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow, Vladimir Sementsov-Ogievskiy,
open list:Network Block Dev..., Thomas Lamprecht
On 9/5/23 12:01, Fiona Ebner wrote:
> Can we assume block_job_remove_all_bdrv() to always hold the job's
> AioContext?
I think so, see job_unref_locked(), job_prepare_locked() and
job_finalize_single_locked(). These call the callbacks that ultimately
get to block_job_remove_all_bdrv().
> And if yes, can we just tell bdrv_graph_wrlock() that it
> needs to release that before polling to fix the deadlock?
No, but I think it should be released and re-acquired in
block_job_remove_all_bdrv() itself.
mirror_exit_common() however holds _two_ AioContext locks at the time it
calls block_job_remove_all_bdrv(), qemu_get_aio_context() has to be
released and reacquired in mirror_exit_common() itself.
Paolo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-09-05 11:42 ` Paolo Bonzini
@ 2023-09-28 8:06 ` Fiona Ebner
2023-10-04 17:08 ` Vladimir Sementsov-Ogievskiy
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-09-28 8:06 UTC (permalink / raw)
To: Paolo Bonzini, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow, Vladimir Sementsov-Ogievskiy,
open list:Network Block Dev..., Thomas Lamprecht
Am 05.09.23 um 13:42 schrieb Paolo Bonzini:
> On 9/5/23 12:01, Fiona Ebner wrote:
>> Can we assume block_job_remove_all_bdrv() to always hold the job's
>> AioContext?
>
> I think so, see job_unref_locked(), job_prepare_locked() and
> job_finalize_single_locked(). These call the callbacks that ultimately
> get to block_job_remove_all_bdrv().
>
>> And if yes, can we just tell bdrv_graph_wrlock() that it
>> needs to release that before polling to fix the deadlock?
>
> No, but I think it should be released and re-acquired in
> block_job_remove_all_bdrv() itself.
>
For fixing the backup cancel deadlock, I tried the following:
> diff --git a/blockjob.c b/blockjob.c
> index 58c5d64539..fd6132ebfe 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
> * one to make sure that such a concurrent access does not attempt
> * to process an already freed BdrvChild.
> */
> + aio_context_release(job->job.aio_context);
> bdrv_graph_wrlock(NULL);
> + aio_context_acquire(job->job.aio_context);
> while (job->nodes) {
> GSList *l = job->nodes;
> BdrvChild *c = l->data;
but it's not enough unfortunately. And I don't just mean with the later
deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
other mail.
Even when I got lucky and that deadlock didn't trigger by chance or with
an additional change to try and avoid that one
> diff --git a/block.c b/block.c
> index e7f349b25c..02d2c4e777 100644
> --- a/block.c
> +++ b/block.c
> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
> bs->drv = NULL;
> }
>
> - bdrv_graph_wrlock(NULL);
> + bdrv_graph_wrlock(bs);
> QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
> bdrv_unref_child(bs, child);
> }
often guest IO would get completely stuck after canceling the backup.
There's nothing obvious to me in the backtraces at that point, but it
seems the vCPU and main threads running like usual, while the IO thread
is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
would happen with both, a VirtIO SCSI and a VirtIO block disk and with
both aio=io_uring and aio=threads.
I should also mention I'm using
> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k --ioengine=psync --numjobs=5 --runtime=6000 --time_based
inside the guest during canceling of the backup.
I'd be glad for any pointers what to look for and happy to provide more
information.
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-09-05 11:25 ` Fiona Ebner
@ 2023-10-04 16:51 ` Vladimir Sementsov-Ogievskiy
0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-10-04 16:51 UTC (permalink / raw)
To: Fiona Ebner, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow,
open list:Network Block Dev..., Thomas Lamprecht
On 05.09.23 14:25, Fiona Ebner wrote:
> Am 05.09.23 um 12:01 schrieb Fiona Ebner:
>>
>> Can we assume block_job_remove_all_bdrv() to always hold the job's
>> AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
>> needs to release that before polling to fix the deadlock?
>>
>
> I tried a doing something similar as a proof-of-concept
>
>> diff --git a/blockjob.c b/blockjob.c
>> index 58c5d64539..1a696241a0 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -198,19 +198,19 @@ void block_job_remove_all_bdrv(BlockJob *job)
>> * one to make sure that such a concurrent access does not attempt
>> * to process an already freed BdrvChild.
>> */
>> - bdrv_graph_wrlock(NULL);
>> while (job->nodes) {
>> GSList *l = job->nodes;
>> BdrvChild *c = l->data;
>>
>> job->nodes = l->next;
>>
>> + bdrv_graph_wrlock(c->bs);
>> bdrv_op_unblock_all(c->bs, job->blocker);
>> bdrv_root_unref_child(c);
>> + bdrv_graph_wrunlock();
>>
>> g_slist_free_1(l);
>> }
>> - bdrv_graph_wrunlock();
>> }
>
> and while it did get slightly further, I ran into another deadlock with
>
>> #0 0x00007f1941155136 in __ppoll (fds=0x55992068fb20, nfds=2, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
>> #1 0x000055991c6a1a3f in qemu_poll_ns (fds=0x55992068fb20, nfds=2, timeout=-1) at ../util/qemu-timer.c:339
>> #2 0x000055991c67ed6c in fdmon_poll_wait (ctx=0x55991f058810, ready_list=0x7ffda8c987b0, timeout=-1) at ../util/fdmon-poll.c:79
>> #3 0x000055991c67e6a8 in aio_poll (ctx=0x55991f058810, blocking=true) at ../util/aio-posix.c:670
>> #4 0x000055991c50a763 in bdrv_graph_wrlock (bs=0x0) at ../block/graph-lock.c:145
Interesting, why in bdrv_close() we pass NULL to bdrv_graph_wrlock.. Shouldn't bdrv_close() be called with bs aio context locked?
>> #5 0x000055991c4daf85 in bdrv_close (bs=0x55991fff2f30) at ../block.c:5166
>> #6 0x000055991c4dc050 in bdrv_delete (bs=0x55991fff2f30) at ../block.c:5606
>> #7 0x000055991c4df205 in bdrv_unref (bs=0x55991fff2f30) at ../block.c:7173
>> #8 0x000055991c4fb8ca in bdrv_cbw_drop (bs=0x55991fff2f30) at ../block/copy-before-write.c:566
>> #9 0x000055991c4f9685 in backup_clean (job=0x55992016d0b0) at ../block/backup.c:105
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-09-28 8:06 ` Fiona Ebner
@ 2023-10-04 17:08 ` Vladimir Sementsov-Ogievskiy
2023-10-06 12:18 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-10-04 17:08 UTC (permalink / raw)
To: Fiona Ebner, Paolo Bonzini, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow,
open list:Network Block Dev..., Thomas Lamprecht
On 28.09.23 11:06, Fiona Ebner wrote:
> Am 05.09.23 um 13:42 schrieb Paolo Bonzini:
>> On 9/5/23 12:01, Fiona Ebner wrote:
>>> Can we assume block_job_remove_all_bdrv() to always hold the job's
>>> AioContext?
>>
>> I think so, see job_unref_locked(), job_prepare_locked() and
>> job_finalize_single_locked(). These call the callbacks that ultimately
>> get to block_job_remove_all_bdrv().
>>
>>> And if yes, can we just tell bdrv_graph_wrlock() that it
>>> needs to release that before polling to fix the deadlock?
>>
>> No, but I think it should be released and re-acquired in
>> block_job_remove_all_bdrv() itself.
>>
>
> For fixing the backup cancel deadlock, I tried the following:
>
>> diff --git a/blockjob.c b/blockjob.c
>> index 58c5d64539..fd6132ebfe 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
>> * one to make sure that such a concurrent access does not attempt
>> * to process an already freed BdrvChild.
>> */
>> + aio_context_release(job->job.aio_context);
>> bdrv_graph_wrlock(NULL);
>> + aio_context_acquire(job->job.aio_context);
>> while (job->nodes) {
>> GSList *l = job->nodes;
>> BdrvChild *c = l->data;
>
> but it's not enough unfortunately. And I don't just mean with the later
> deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
> other mail.
>
> Even when I got lucky and that deadlock didn't trigger by chance or with
> an additional change to try and avoid that one
>
>> diff --git a/block.c b/block.c
>> index e7f349b25c..02d2c4e777 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
>> bs->drv = NULL;
>> }
>>
>> - bdrv_graph_wrlock(NULL);
>> + bdrv_graph_wrlock(bs);
>> QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
>> bdrv_unref_child(bs, child);
>> }
>
> often guest IO would get completely stuck after canceling the backup.
> There's nothing obvious to me in the backtraces at that point, but it
> seems the vCPU and main threads running like usual, while the IO thread
> is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
> would happen with both, a VirtIO SCSI and a VirtIO block disk and with
> both aio=io_uring and aio=threads.
When IO is stuck, it may be helpful to look at bs->tracked_requests list in gdb. If there are requests, each one has field .co, which points to the coroutine of request.
Next step is to look at coroutine stack.
Something like (in gdb):
source scripts/qemu-gdb.py
qemu coroutine <coroutine pointer>
may work. ("may", because it was long ago when I used this last time)
>
> I should also mention I'm using
>
>> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k --ioengine=psync --numjobs=5 --runtime=6000 --time_based
>
> inside the guest during canceling of the backup.
>
> I'd be glad for any pointers what to look for and happy to provide more
> information.
>
> Best Regards,
> Fiona
>
--
Best regards,
Vladimir
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-04 17:08 ` Vladimir Sementsov-Ogievskiy
@ 2023-10-06 12:18 ` Fiona Ebner
2023-10-17 10:18 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-06 12:18 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow,
open list:Network Block Dev..., Thomas Lamprecht
Am 04.10.23 um 19:08 schrieb Vladimir Sementsov-Ogievskiy:
> On 28.09.23 11:06, Fiona Ebner wrote:
>> For fixing the backup cancel deadlock, I tried the following:
>>
>>> diff --git a/blockjob.c b/blockjob.c
>>> index 58c5d64539..fd6132ebfe 100644
>>> --- a/blockjob.c
>>> +++ b/blockjob.c
>>> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
>>> * one to make sure that such a concurrent access does not attempt
>>> * to process an already freed BdrvChild.
>>> */
>>> + aio_context_release(job->job.aio_context);
>>> bdrv_graph_wrlock(NULL);
>>> + aio_context_acquire(job->job.aio_context);
>>> while (job->nodes) {
>>> GSList *l = job->nodes;
>>> BdrvChild *c = l->data;
>>
>> but it's not enough unfortunately. And I don't just mean with the later
>> deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
>> other mail.
>>
>> Even when I got lucky and that deadlock didn't trigger by chance or with
>> an additional change to try and avoid that one
>>
>>> diff --git a/block.c b/block.c
>>> index e7f349b25c..02d2c4e777 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
>>> bs->drv = NULL;
>>> }
>>> - bdrv_graph_wrlock(NULL);
>>> + bdrv_graph_wrlock(bs);
>>> QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
>>> bdrv_unref_child(bs, child);
>>> }
>>
>> often guest IO would get completely stuck after canceling the backup.
>> There's nothing obvious to me in the backtraces at that point, but it
>> seems the vCPU and main threads running like usual, while the IO thread
>> is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
>> would happen with both, a VirtIO SCSI and a VirtIO block disk and with
>> both aio=io_uring and aio=threads.
>
> When IO is stuck, it may be helpful to look at bs->tracked_requests list
> in gdb. If there are requests, each one has field .co, which points to
> the coroutine of request.
After the IO was stuck in the guest, I used bdrv_next_all_states() to
iterate over the states and there's only the bdrv_raw and the
bdrv_host_device. For both, tracked_requests was empty.
What is also very interesting is that the IO isn't always dead
immediately. It can be that the fio command still runs with lower speed
for a while (sometimes even up to about a minute, but most often about
10-15 seconds or so). During that time, I still can see calls to
virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
>>
>> I should also mention I'm using
>>
>>> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k
>>> --ioengine=psync --numjobs=5 --runtime=6000 --time_based
>>
>> inside the guest during canceling of the backup.
One single time, it got stuck polling while canceling [0]. I usually
need to do the backup+cancel a few times, because the IO doesn't get
stuck each time, so this was a subsequent invocation. The reentrancy to
virtio_scsi_handle_cmd()/etc. seems strange at a first glance. Can that
be normal?
Best Regards,
Fiona
[0]:
> Thread 3 (Thread 0x7ff7f28946c0 (LWP 1815909) "qemu-system-x86"):
> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
> #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
> #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
> #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
> #14 0x000055871323e3ff in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:723
> #15 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ef60) at /home/febner/repos/qemu/block/block-gen.h:43
> #16 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
> #17 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
> #18 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e401c800) at ../hw/scsi/virtio-scsi.c:788
> #19 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d210) at ../hw/scsi/virtio-scsi.c:831
> #20 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d210) at ../hw/scsi/virtio-scsi.c:867
> #21 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d210) at ../hw/virtio/virtio.c:2263
> #22 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d284) at ../hw/virtio/virtio.c:3575
> #23 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x5587162b5d80) at ../util/aio-posix.c:372
> #24 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288f180) at ../util/aio-posix.c:401
> #25 0x000055871323e3ff in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:723
> #26 0x00005587130878bf in iothread_run (opaque=0x55871563d6b0) at ../iothread.c:63
> #27 0x00005587132434c0 in qemu_thread_start (args=0x55871598dc70) at ../util/qemu-thread-posix.c:541
> #28 0x00007ff7f5ee2044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
> #29 0x00007ff7f5f625fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>
> Thread 2 (Thread 0x7ff7f31966c0 (LWP 1815908) "qemu-system-x86"):
> #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> #1 0x0000558713243126 in qemu_futex_wait (f=0x558713c963f8 <rcu_call_ready_event>, val=4294967295) at /home/febner/repos/qemu/include/qemu/futex.h:29
> #2 0x000055871324330d in qemu_event_wait (ev=0x558713c963f8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3 0x00005587132506c4 in call_rcu_thread (opaque=0x0) at ../util/rcu.c:278
> #4 0x00005587132434c0 in qemu_thread_start (args=0x5587156aeaa0) at ../util/qemu-thread-posix.c:541
> #5 0x00007ff7f5ee2044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
> #6 0x00007ff7f5f625fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>
> Thread 1 (Thread 0x7ff7f33fa340 (LWP 1815907) "qemu-system-x86"):
> #0 futex_wait (private=0, expected=2, futex_word=0x55871598d640) at ../sysdeps/nptl/futex-internal.h:146
> #1 __GI___lll_lock_wait (futex=futex@entry=0x55871598d640, private=0) at ./nptl/lowlevellock.c:49
> #2 0x00007ff7f5ee532a in lll_mutex_lock_optimized (mutex=0x55871598d640) at ./nptl/pthread_mutex_lock.c:48
> #3 ___pthread_mutex_lock (mutex=0x55871598d640) at ./nptl/pthread_mutex_lock.c:128
> #4 0x0000558713242772 in qemu_mutex_lock_impl (mutex=0x55871598d640, file=0x55871351bfb9 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:94
> #5 0x00005587132429a2 in qemu_rec_mutex_lock_impl (mutex=0x55871598d640, file=0x55871351bfb9 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:149
> #6 0x000055871325c23d in aio_context_acquire (ctx=0x55871598d5e0) at ../util/async.c:728
> #7 0x00005587130d1956 in bdrv_drain_all_end () at ../block/io.c:567
> #8 0x00005587130cf1e8 in bdrv_graph_wrlock (bs=0x5587168255f0) at ../block/graph-lock.c:156
> #9 0x0000558713099c8f in bdrv_close (bs=0x5587168255f0) at ../block.c:5169
> #10 0x000055871309ad5a in bdrv_delete (bs=0x5587168255f0) at ../block.c:5609
> #11 0x000055871309df21 in bdrv_unref (bs=0x5587168255f0) at ../block.c:7178
> #12 0x00005587130ca195 in bdrv_cbw_drop (bs=0x5587168255f0) at ../block/copy-before-write.c:566
> #13 0x00005587130b8430 in backup_clean (job=0x558716d54210) at ../block/backup.c:105
> #14 0x00005587130a58b9 in job_clean (job=0x558716d54210) at ../job.c:836
> #15 0x00005587130a5962 in job_finalize_single_locked (job=0x558716d54210) at ../job.c:863
> #16 0x00005587130a5c7c in job_completed_txn_abort_locked (job=0x558716d54210) at ../job.c:970
> #17 0x00005587130a60f5 in job_completed_locked (job=0x558716d54210) at ../job.c:1080
> #18 0x00005587130a61aa in job_exit (opaque=0x558716d54210) at ../job.c:1103
> #19 0x000055871325b0cf in aio_bh_call (bh=0x7ff7e4004bd0) at ../util/async.c:169
> #20 0x000055871325b1ea in aio_bh_poll (ctx=0x55871571a8a0) at ../util/async.c:216
> #21 0x000055871323da97 in aio_dispatch (ctx=0x55871571a8a0) at ../util/aio-posix.c:423
> #22 0x000055871325b629 in aio_ctx_dispatch (source=0x55871571a8a0, callback=0x0, user_data=0x0) at ../util/async.c:358
> #23 0x00007ff7f72bf7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #24 0x000055871325cde4 in glib_pollfds_poll () at ../util/main-loop.c:290
> #25 0x000055871325ce61 in os_host_main_loop_wait (timeout=1412902959) at ../util/main-loop.c:313
> #26 0x000055871325cf6f in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
> #27 0x0000558712d813de in qemu_main_loop () at ../softmmu/runstate.c:772
> #28 0x000055871303811b in qemu_default_main () at ../softmmu/main.c:37
> #29 0x0000558713038151 in main (argc=57, argv=0x7fffd393e828) at ../softmmu/main.c:48
> (gdb) p ((AioContext*)0x55871598d5e0)->lock
> $1 = {m = {lock = {__data = {__lock = 2, __count = 2, __owner = 1815909, __nusers = 1, __kind = 1,
> __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = "\002\000\000\000\002\000\000\000e\265\033\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 8589934594}, file = 0x0, line = 0, initialized = true}}
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-06 12:18 ` Fiona Ebner
@ 2023-10-17 10:18 ` Fiona Ebner
2023-10-17 12:12 ` Kevin Wolf
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-17 10:18 UTC (permalink / raw)
To: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers
Cc: Kevin Wolf, Hanna Reitz, John Snow,
open list:Network Block Dev..., Thomas Lamprecht
Am 06.10.23 um 14:18 schrieb Fiona Ebner:
> Am 04.10.23 um 19:08 schrieb Vladimir Sementsov-Ogievskiy:
>> On 28.09.23 11:06, Fiona Ebner wrote:
>>> For fixing the backup cancel deadlock, I tried the following:
>>>
>>>> diff --git a/blockjob.c b/blockjob.c
>>>> index 58c5d64539..fd6132ebfe 100644
>>>> --- a/blockjob.c
>>>> +++ b/blockjob.c
>>>> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
>>>> * one to make sure that such a concurrent access does not attempt
>>>> * to process an already freed BdrvChild.
>>>> */
>>>> + aio_context_release(job->job.aio_context);
>>>> bdrv_graph_wrlock(NULL);
>>>> + aio_context_acquire(job->job.aio_context);
>>>> while (job->nodes) {
>>>> GSList *l = job->nodes;
>>>> BdrvChild *c = l->data;
>>>
>>> but it's not enough unfortunately. And I don't just mean with the later
>>> deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
>>> other mail.
>>>
>>> Even when I got lucky and that deadlock didn't trigger by chance or with
>>> an additional change to try and avoid that one
>>>
>>>> diff --git a/block.c b/block.c
>>>> index e7f349b25c..02d2c4e777 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
>>>> bs->drv = NULL;
>>>> }
>>>> - bdrv_graph_wrlock(NULL);
>>>> + bdrv_graph_wrlock(bs);
>>>> QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
>>>> bdrv_unref_child(bs, child);
>>>> }
>>>
>>> often guest IO would get completely stuck after canceling the backup.
>>> There's nothing obvious to me in the backtraces at that point, but it
>>> seems the vCPU and main threads running like usual, while the IO thread
>>> is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
>>> would happen with both, a VirtIO SCSI and a VirtIO block disk and with
>>> both aio=io_uring and aio=threads.
>>
>> When IO is stuck, it may be helpful to look at bs->tracked_requests list
>> in gdb. If there are requests, each one has field .co, which points to
>> the coroutine of request.
>
> After the IO was stuck in the guest, I used bdrv_next_all_states() to
> iterate over the states and there's only the bdrv_raw and the
> bdrv_host_device. For both, tracked_requests was empty.
>
> What is also very interesting is that the IO isn't always dead
> immediately. It can be that the fio command still runs with lower speed
> for a while (sometimes even up to about a minute, but most often about
> 10-15 seconds or so). During that time, I still can see calls to
> virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
>
>>>
>>> I should also mention I'm using
>>>
>>>> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k
>>>> --ioengine=psync --numjobs=5 --runtime=6000 --time_based
>>>
>>> inside the guest during canceling of the backup.
>
> One single time, it got stuck polling while canceling [0]. I usually
> need to do the backup+cancel a few times, because the IO doesn't get
> stuck each time, so this was a subsequent invocation. The reentrancy to
> virtio_scsi_handle_cmd()/etc. seems strange at a first glance. Can that
> be normal?
>
I ran into similar issues now with mirror, (both deadlocks and stuck
guest IO at other times), and interestingly, also during job start.
Also had a backtrace similar to [0] once, so I took a closer look.
Probably was obvious to others already, but for the record:
1. the graph is locked by the main thread
2. the iothread holds the AioContext lock
3. the main thread waits on the AioContext lock
4. the iothread waits for coroutine spawned by blk_is_available()
As for why it doesn't progress, blk_co_is_available_entry() uses
bdrv_graph_co_rdlock() and can't get it, because the main thread has the
write lock. Should be fixed once the AioContext locks are gone, but not
sure what should be done to avoid it until then.
I'm still trying to figure out what happens in the cases where the guest
IO gets stuck. I should mention that I'm not experiencing the issues
when disabling graph locking. It's rather obvious for the deadlocks, but
I also can't reproduce the issues with stuck guest IO. Did it on top of
QEMU 8.1.1, because stuff like bdrv_schedule_unref() came in later and I
didn't want to mess that up :)
> Best Regards,
> Fiona
>
> [0]:
>
>> Thread 3 (Thread 0x7ff7f28946c0 (LWP 1815909) "qemu-system-x86"):
>> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
>> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
>> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
>> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
>> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
>> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
>> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
>> #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
>> #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
>> #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
>> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
>> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
>> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
>> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
>> #14 0x000055871323e3ff in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:723
>> #15 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ef60) at /home/febner/repos/qemu/block/block-gen.h:43
>> #16 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
>> #17 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
>> #18 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e401c800) at ../hw/scsi/virtio-scsi.c:788
>> #19 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d210) at ../hw/scsi/virtio-scsi.c:831
>> #20 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d210) at ../hw/scsi/virtio-scsi.c:867
>> #21 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d210) at ../hw/virtio/virtio.c:2263
>> #22 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d284) at ../hw/virtio/virtio.c:3575
>> #23 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x5587162b5d80) at ../util/aio-posix.c:372
>> #24 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288f180) at ../util/aio-posix.c:401
>> #25 0x000055871323e3ff in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:723
>> #26 0x00005587130878bf in iothread_run (opaque=0x55871563d6b0) at ../iothread.c:63
>> #27 0x00005587132434c0 in qemu_thread_start (args=0x55871598dc70) at ../util/qemu-thread-posix.c:541
>> #28 0x00007ff7f5ee2044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
>> #29 0x00007ff7f5f625fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>>
>> Thread 2 (Thread 0x7ff7f31966c0 (LWP 1815908) "qemu-system-x86"):
>> #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> #1 0x0000558713243126 in qemu_futex_wait (f=0x558713c963f8 <rcu_call_ready_event>, val=4294967295) at /home/febner/repos/qemu/include/qemu/futex.h:29
>> #2 0x000055871324330d in qemu_event_wait (ev=0x558713c963f8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
>> #3 0x00005587132506c4 in call_rcu_thread (opaque=0x0) at ../util/rcu.c:278
>> #4 0x00005587132434c0 in qemu_thread_start (args=0x5587156aeaa0) at ../util/qemu-thread-posix.c:541
>> #5 0x00007ff7f5ee2044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
>> #6 0x00007ff7f5f625fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>>
>> Thread 1 (Thread 0x7ff7f33fa340 (LWP 1815907) "qemu-system-x86"):
>> #0 futex_wait (private=0, expected=2, futex_word=0x55871598d640) at ../sysdeps/nptl/futex-internal.h:146
>> #1 __GI___lll_lock_wait (futex=futex@entry=0x55871598d640, private=0) at ./nptl/lowlevellock.c:49
>> #2 0x00007ff7f5ee532a in lll_mutex_lock_optimized (mutex=0x55871598d640) at ./nptl/pthread_mutex_lock.c:48
>> #3 ___pthread_mutex_lock (mutex=0x55871598d640) at ./nptl/pthread_mutex_lock.c:128
>> #4 0x0000558713242772 in qemu_mutex_lock_impl (mutex=0x55871598d640, file=0x55871351bfb9 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:94
>> #5 0x00005587132429a2 in qemu_rec_mutex_lock_impl (mutex=0x55871598d640, file=0x55871351bfb9 "../util/async.c", line=728) at ../util/qemu-thread-posix.c:149
>> #6 0x000055871325c23d in aio_context_acquire (ctx=0x55871598d5e0) at ../util/async.c:728
>> #7 0x00005587130d1956 in bdrv_drain_all_end () at ../block/io.c:567
>> #8 0x00005587130cf1e8 in bdrv_graph_wrlock (bs=0x5587168255f0) at ../block/graph-lock.c:156
>> #9 0x0000558713099c8f in bdrv_close (bs=0x5587168255f0) at ../block.c:5169
>> #10 0x000055871309ad5a in bdrv_delete (bs=0x5587168255f0) at ../block.c:5609
>> #11 0x000055871309df21 in bdrv_unref (bs=0x5587168255f0) at ../block.c:7178
>> #12 0x00005587130ca195 in bdrv_cbw_drop (bs=0x5587168255f0) at ../block/copy-before-write.c:566
>> #13 0x00005587130b8430 in backup_clean (job=0x558716d54210) at ../block/backup.c:105
>> #14 0x00005587130a58b9 in job_clean (job=0x558716d54210) at ../job.c:836
>> #15 0x00005587130a5962 in job_finalize_single_locked (job=0x558716d54210) at ../job.c:863
>> #16 0x00005587130a5c7c in job_completed_txn_abort_locked (job=0x558716d54210) at ../job.c:970
>> #17 0x00005587130a60f5 in job_completed_locked (job=0x558716d54210) at ../job.c:1080
>> #18 0x00005587130a61aa in job_exit (opaque=0x558716d54210) at ../job.c:1103
>> #19 0x000055871325b0cf in aio_bh_call (bh=0x7ff7e4004bd0) at ../util/async.c:169
>> #20 0x000055871325b1ea in aio_bh_poll (ctx=0x55871571a8a0) at ../util/async.c:216
>> #21 0x000055871323da97 in aio_dispatch (ctx=0x55871571a8a0) at ../util/aio-posix.c:423
>> #22 0x000055871325b629 in aio_ctx_dispatch (source=0x55871571a8a0, callback=0x0, user_data=0x0) at ../util/async.c:358
>> #23 0x00007ff7f72bf7a9 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
>> #24 0x000055871325cde4 in glib_pollfds_poll () at ../util/main-loop.c:290
>> #25 0x000055871325ce61 in os_host_main_loop_wait (timeout=1412902959) at ../util/main-loop.c:313
>> #26 0x000055871325cf6f in main_loop_wait (nonblocking=0) at ../util/main-loop.c:592
>> #27 0x0000558712d813de in qemu_main_loop () at ../softmmu/runstate.c:772
>> #28 0x000055871303811b in qemu_default_main () at ../softmmu/main.c:37
>> #29 0x0000558713038151 in main (argc=57, argv=0x7fffd393e828) at ../softmmu/main.c:48
>> (gdb) p ((AioContext*)0x55871598d5e0)->lock
>> $1 = {m = {lock = {__data = {__lock = 2, __count = 2, __owner = 1815909, __nusers = 1, __kind = 1,
>> __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
>> __size = "\002\000\000\000\002\000\000\000e\265\033\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 8589934594}, file = 0x0, line = 0, initialized = true}}
>
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-17 10:18 ` Fiona Ebner
@ 2023-10-17 12:12 ` Kevin Wolf
2023-10-17 13:37 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2023-10-17 12:12 UTC (permalink / raw)
To: Fiona Ebner
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
> Am 06.10.23 um 14:18 schrieb Fiona Ebner:
> > Am 04.10.23 um 19:08 schrieb Vladimir Sementsov-Ogievskiy:
> >> On 28.09.23 11:06, Fiona Ebner wrote:
> >>> For fixing the backup cancel deadlock, I tried the following:
> >>>
> >>>> diff --git a/blockjob.c b/blockjob.c
> >>>> index 58c5d64539..fd6132ebfe 100644
> >>>> --- a/blockjob.c
> >>>> +++ b/blockjob.c
> >>>> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
> >>>> * one to make sure that such a concurrent access does not attempt
> >>>> * to process an already freed BdrvChild.
> >>>> */
> >>>> + aio_context_release(job->job.aio_context);
> >>>> bdrv_graph_wrlock(NULL);
> >>>> + aio_context_acquire(job->job.aio_context);
> >>>> while (job->nodes) {
> >>>> GSList *l = job->nodes;
> >>>> BdrvChild *c = l->data;
> >>>
> >>> but it's not enough unfortunately. And I don't just mean with the later
> >>> deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
> >>> other mail.
> >>>
> >>> Even when I got lucky and that deadlock didn't trigger by chance or with
> >>> an additional change to try and avoid that one
> >>>
> >>>> diff --git a/block.c b/block.c
> >>>> index e7f349b25c..02d2c4e777 100644
> >>>> --- a/block.c
> >>>> +++ b/block.c
> >>>> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
> >>>> bs->drv = NULL;
> >>>> }
> >>>> - bdrv_graph_wrlock(NULL);
> >>>> + bdrv_graph_wrlock(bs);
> >>>> QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
> >>>> bdrv_unref_child(bs, child);
> >>>> }
> >>>
> >>> often guest IO would get completely stuck after canceling the backup.
> >>> There's nothing obvious to me in the backtraces at that point, but it
> >>> seems the vCPU and main threads running like usual, while the IO thread
> >>> is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
> >>> would happen with both, a VirtIO SCSI and a VirtIO block disk and with
> >>> both aio=io_uring and aio=threads.
> >>
> >> When IO is stuck, it may be helpful to look at bs->tracked_requests list
> >> in gdb. If there are requests, each one has field .co, which points to
> >> the coroutine of request.
> >
> > After the IO was stuck in the guest, I used bdrv_next_all_states() to
> > iterate over the states and there's only the bdrv_raw and the
> > bdrv_host_device. For both, tracked_requests was empty.
> >
> > What is also very interesting is that the IO isn't always dead
> > immediately. It can be that the fio command still runs with lower speed
> > for a while (sometimes even up to about a minute, but most often about
> > 10-15 seconds or so). During that time, I still can see calls to
> > virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
> >
> >>>
> >>> I should also mention I'm using
> >>>
> >>>> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k
> >>>> --ioengine=psync --numjobs=5 --runtime=6000 --time_based
> >>>
> >>> inside the guest during canceling of the backup.
> >
> > One single time, it got stuck polling while canceling [0]. I usually
> > need to do the backup+cancel a few times, because the IO doesn't get
> > stuck each time, so this was a subsequent invocation. The reentrancy to
> > virtio_scsi_handle_cmd()/etc. seems strange at a first glance. Can that
> > be normal?
> >
>
> I ran into similar issues now with mirror, (both deadlocks and stuck
> guest IO at other times), and interestingly, also during job start.
>
> Also had a backtrace similar to [0] once, so I took a closer look.
> Probably was obvious to others already, but for the record:
>
> 1. the graph is locked by the main thread
> 2. the iothread holds the AioContext lock
> 3. the main thread waits on the AioContext lock
> 4. the iothread waits for coroutine spawned by blk_is_available()
Where does this blk_is_available() in the iothread come from? Having it
wait without dropping the AioContext lock sounds like something that
we'd want to avoid. Ideally, devices using iothreads shouldn't use
synchronous requests at all, but I think scsi-disk might have some of
them.
> As for why it doesn't progress, blk_co_is_available_entry() uses
> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
> write lock. Should be fixed once the AioContext locks are gone, but not
> sure what should be done to avoid it until then.
Then the nested event loop in blk_is_available() would probably be
enough to make progress, yes.
Maybe we could actually drop the lock (and immediately reacquire it) in
AIO_WAIT_WHILE() even if we're in the home thread? That should give the
main thread a chance to make progress.
But I think we're actually not very far from removing the AioContext
lock, so if we can't find an easy fix in the meantime, waiting for that
could be a realistic option.
> I'm still trying to figure out what happens in the cases where the guest
> IO gets stuck. I should mention that I'm not experiencing the issues
> when disabling graph locking. It's rather obvious for the deadlocks, but
> I also can't reproduce the issues with stuck guest IO. Did it on top of
> QEMU 8.1.1, because stuff like bdrv_schedule_unref() came in later and I
> didn't want to mess that up :)
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-17 12:12 ` Kevin Wolf
@ 2023-10-17 13:37 ` Fiona Ebner
2023-10-17 14:20 ` Kevin Wolf
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-17 13:37 UTC (permalink / raw)
To: Kevin Wolf
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 17.10.23 um 14:12 schrieb Kevin Wolf:
> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>> I ran into similar issues now with mirror, (both deadlocks and stuck
>> guest IO at other times), and interestingly, also during job start.
>>
>> Also had a backtrace similar to [0] once, so I took a closer look.
>> Probably was obvious to others already, but for the record:
>>
>> 1. the graph is locked by the main thread
>> 2. the iothread holds the AioContext lock
>> 3. the main thread waits on the AioContext lock
>> 4. the iothread waits for coroutine spawned by blk_is_available()
>
> Where does this blk_is_available() in the iothread come from? Having it
> wait without dropping the AioContext lock sounds like something that
> we'd want to avoid. Ideally, devices using iothreads shouldn't use
> synchronous requests at all, but I think scsi-disk might have some of
> them.
>
It's part of the request handling in virtio-scsi:
> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
> #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
> #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
> #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
>> As for why it doesn't progress, blk_co_is_available_entry() uses
>> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
>> write lock. Should be fixed once the AioContext locks are gone, but not
>> sure what should be done to avoid it until then.
>
> Then the nested event loop in blk_is_available() would probably be
> enough to make progress, yes.
>
> Maybe we could actually drop the lock (and immediately reacquire it) in
> AIO_WAIT_WHILE() even if we're in the home thread? That should give the
> main thread a chance to make progress.
>
Seems to work :) I haven't run into the issue with the following change
anymore, but I have to say, running into that specific deadlock only
happened every 10-15 tries or so before. Did 30 tests now. But
unfortunately, the stuck IO issue is still there.
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index 5449b6d742..da159501ca 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -88,7 +88,13 @@ extern AioWait global_aio_wait;
> smp_mb__after_rmw(); \
> if (ctx_ && in_aio_context_home_thread(ctx_)) { \
> while ((cond)) { \
> + if (unlock && ctx_) { \
> + aio_context_release(ctx_); \
> + } \
> aio_poll(ctx_, true); \
> + if (unlock && ctx_) { \
> + aio_context_acquire(ctx_); \
> + } \
> waited_ = true; \
> } \
> } else { \
>
> But I think we're actually not very far from removing the AioContext
> lock, so if we can't find an easy fix in the meantime, waiting for that
> could be a realistic option.
>
Glad to hear :) Do you think it will be in time for QEMU 8.2? Otherwise,
I'll go ahead and send what I have for fixing the deadlocks from this
mail thread in the following days. The stuck guest IO can happen even
without any of those changes (on current master, i.e.
ebca80bbdb5c1650e4b753a3d13b43634e7dfe05, at least when starting a
mirror job).
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-17 13:37 ` Fiona Ebner
@ 2023-10-17 14:20 ` Kevin Wolf
2023-10-18 9:42 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2023-10-17 14:20 UTC (permalink / raw)
To: Fiona Ebner
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 17.10.2023 um 15:37 hat Fiona Ebner geschrieben:
> Am 17.10.23 um 14:12 schrieb Kevin Wolf:
> > Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
> >> I ran into similar issues now with mirror, (both deadlocks and stuck
> >> guest IO at other times), and interestingly, also during job start.
> >>
> >> Also had a backtrace similar to [0] once, so I took a closer look.
> >> Probably was obvious to others already, but for the record:
> >>
> >> 1. the graph is locked by the main thread
> >> 2. the iothread holds the AioContext lock
> >> 3. the main thread waits on the AioContext lock
> >> 4. the iothread waits for coroutine spawned by blk_is_available()
> >
> > Where does this blk_is_available() in the iothread come from? Having it
> > wait without dropping the AioContext lock sounds like something that
> > we'd want to avoid. Ideally, devices using iothreads shouldn't use
> > synchronous requests at all, but I think scsi-disk might have some of
> > them.
> >
>
> It's part of the request handling in virtio-scsi:
>
> > #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> > #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
> > #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
> > #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
> > #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
> > #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
> > #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
Oh... So essentially for an assertion.
I wonder if the blk_is_available() check introduced in 2a2d69f490c is
even necessary any more, because BlockBackend has its own AioContext
now. And if blk_bs(blk) != NULL isn't what we actually want to check if
the check is necessary, because calling bdrv_is_inserted() doesn't seem
to have been intended. blk_bs() wouldn't have to poll.
> > #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
> > #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
> > #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
> > #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
> > #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
> > #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
> > #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
>
>
> >> As for why it doesn't progress, blk_co_is_available_entry() uses
> >> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
> >> write lock. Should be fixed once the AioContext locks are gone, but not
> >> sure what should be done to avoid it until then.
> >
> > Then the nested event loop in blk_is_available() would probably be
> > enough to make progress, yes.
> >
> > Maybe we could actually drop the lock (and immediately reacquire it) in
> > AIO_WAIT_WHILE() even if we're in the home thread? That should give the
> > main thread a chance to make progress.
>
> Seems to work :) I haven't run into the issue with the following change
> anymore, but I have to say, running into that specific deadlock only
> happened every 10-15 tries or so before. Did 30 tests now. But
> unfortunately, the stuck IO issue is still there.
>
> > diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> > index 5449b6d742..da159501ca 100644
> > --- a/include/block/aio-wait.h
> > +++ b/include/block/aio-wait.h
> > @@ -88,7 +88,13 @@ extern AioWait global_aio_wait;
> > smp_mb__after_rmw(); \
> > if (ctx_ && in_aio_context_home_thread(ctx_)) { \
> > while ((cond)) { \
> > + if (unlock && ctx_) { \
> > + aio_context_release(ctx_); \
> > + } \
> > aio_poll(ctx_, true); \
> > + if (unlock && ctx_) { \
> > + aio_context_acquire(ctx_); \
> > + } \
> > waited_ = true; \
> > } \
> > } else { \
For reacquiring the lock, I really meant "immediately". Calling
aio_poll() without the lock is wrong.
What does the stuck I/O look like? Is it stuck in the backend, i.e. the
device started requests that never complete? Or stuck from the guest
perspective, i.e. the device never checks for new requests?
I don't really have an idea immediately, we'd have to find out where the
stuck I/O stops being processed.
> > But I think we're actually not very far from removing the AioContext
> > lock, so if we can't find an easy fix in the meantime, waiting for that
> > could be a realistic option.
> >
>
> Glad to hear :) Do you think it will be in time for QEMU 8.2? Otherwise,
> I'll go ahead and send what I have for fixing the deadlocks from this
> mail thread in the following days. The stuck guest IO can happen even
> without any of those changes (on current master, i.e.
> ebca80bbdb5c1650e4b753a3d13b43634e7dfe05, at least when starting a
> mirror job).
Having it in 8.2 is certainly the plan, but plans don't always work out.
If you have fixes that aren't too ugly, we can still apply them.
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-17 14:20 ` Kevin Wolf
@ 2023-10-18 9:42 ` Fiona Ebner
2023-10-19 12:14 ` Kevin Wolf
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-18 9:42 UTC (permalink / raw)
To: Kevin Wolf
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 17.10.23 um 16:20 schrieb Kevin Wolf:
> Am 17.10.2023 um 15:37 hat Fiona Ebner geschrieben:
>> Am 17.10.23 um 14:12 schrieb Kevin Wolf:
>>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>>>> I ran into similar issues now with mirror, (both deadlocks and stuck
>>>> guest IO at other times), and interestingly, also during job start.
>>>>
>>>> Also had a backtrace similar to [0] once, so I took a closer look.
>>>> Probably was obvious to others already, but for the record:
>>>>
>>>> 1. the graph is locked by the main thread
>>>> 2. the iothread holds the AioContext lock
>>>> 3. the main thread waits on the AioContext lock
>>>> 4. the iothread waits for coroutine spawned by blk_is_available()
>>>
>>> Where does this blk_is_available() in the iothread come from? Having it
>>> wait without dropping the AioContext lock sounds like something that
>>> we'd want to avoid. Ideally, devices using iothreads shouldn't use
>>> synchronous requests at all, but I think scsi-disk might have some of
>>> them.
>>>
>>
>> It's part of the request handling in virtio-scsi:
>>
>>> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
>>> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
>>> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
>>> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
>>> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
>>> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
>>> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
>
> Oh... So essentially for an assertion.
>
> I wonder if the blk_is_available() check introduced in 2a2d69f490c is
> even necessary any more, because BlockBackend has its own AioContext
> now. And if blk_bs(blk) != NULL isn't what we actually want to check if
> the check is necessary, because calling bdrv_is_inserted() doesn't seem
> to have been intended. blk_bs() wouldn't have to poll.
>
Could virtio_scsi_hotunplug() be an issue with removing or modifying the
check? There's a call there which sets the blk's AioContext to
qemu_get_aio_context(). Or are we sure that the assert in
virtio_scsi_ctx_check() can't be reached after that?
>>> #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
>>> #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
>>> #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
>>> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
>>> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
>>> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
>>> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
>>
>>
>>>> As for why it doesn't progress, blk_co_is_available_entry() uses
>>>> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
>>>> write lock. Should be fixed once the AioContext locks are gone, but not
>>>> sure what should be done to avoid it until then.
>>>
>>> Then the nested event loop in blk_is_available() would probably be
>>> enough to make progress, yes.
>>>
>>> Maybe we could actually drop the lock (and immediately reacquire it) in
>>> AIO_WAIT_WHILE() even if we're in the home thread? That should give the
>>> main thread a chance to make progress.
>>
>> Seems to work :) I haven't run into the issue with the following change
>> anymore, but I have to say, running into that specific deadlock only
>> happened every 10-15 tries or so before. Did 30 tests now. But
>> unfortunately, the stuck IO issue is still there.
>>
>>> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
>>> index 5449b6d742..da159501ca 100644
>>> --- a/include/block/aio-wait.h
>>> +++ b/include/block/aio-wait.h
>>> @@ -88,7 +88,13 @@ extern AioWait global_aio_wait;
>>> smp_mb__after_rmw(); \
>>> if (ctx_ && in_aio_context_home_thread(ctx_)) { \
>>> while ((cond)) { \
>>> + if (unlock && ctx_) { \
>>> + aio_context_release(ctx_); \
>>> + } \
>>> aio_poll(ctx_, true); \
>>> + if (unlock && ctx_) { \
>>> + aio_context_acquire(ctx_); \
>>> + } \
>>> waited_ = true; \
>>> } \
>>> } else { \
>
> For reacquiring the lock, I really meant "immediately". Calling
> aio_poll() without the lock is wrong.
Unfortunately, then it's not enough, because the call to aio_poll() is
blocking and because the lock is held during that call, the very same
deadlock can still happen.
>
> What does the stuck I/O look like? Is it stuck in the backend, i.e. the
> device started requests that never complete? Or stuck from the guest
> perspective, i.e. the device never checks for new requests?
>
AFAICT, from the guest perspective.
> I don't really have an idea immediately, we'd have to find out where the
> stuck I/O stops being processed.
>
I've described it in an earlier mail in this thread:
https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg01900.html
Quoting from there:
> After the IO was stuck in the guest, I used bdrv_next_all_states() to
> iterate over the states and there's only the bdrv_raw and the
> bdrv_host_device. For both, tracked_requests was empty.
>
> What is also very interesting is that the IO isn't always dead
> immediately. It can be that the fio command still runs with lower speed
> for a while (sometimes even up to about a minute, but most often about
> 10-15 seconds or so). During that time, I still can see calls to
> virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
Noting again that (at least for backup) it happens with both virtio-blk
and virtio-scsi and with both aio=io_uring and aio=threads. I also tried
different host kernels 5.15, 6.2 and 6.5 and guest kernels 5.10 and 6.1.
>>> But I think we're actually not very far from removing the AioContext
>>> lock, so if we can't find an easy fix in the meantime, waiting for that
>>> could be a realistic option.
>>>
>>
>> Glad to hear :) Do you think it will be in time for QEMU 8.2? Otherwise,
>> I'll go ahead and send what I have for fixing the deadlocks from this
>> mail thread in the following days. The stuck guest IO can happen even
>> without any of those changes (on current master, i.e.
>> ebca80bbdb5c1650e4b753a3d13b43634e7dfe05, at least when starting a
>> mirror job).
>
> Having it in 8.2 is certainly the plan, but plans don't always work out.
> If you have fixes that aren't too ugly, we can still apply them.
>
Okay, I'll give the ones for the deadlocks via
block_job_remove_all_bdrv() and bdrv_close() another look/spin and try
to send them in the next few days.
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-18 9:42 ` Fiona Ebner
@ 2023-10-19 12:14 ` Kevin Wolf
2023-10-19 13:53 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2023-10-19 12:14 UTC (permalink / raw)
To: Fiona Ebner
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 18.10.2023 um 11:42 hat Fiona Ebner geschrieben:
> Am 17.10.23 um 16:20 schrieb Kevin Wolf:
> > Am 17.10.2023 um 15:37 hat Fiona Ebner geschrieben:
> >> Am 17.10.23 um 14:12 schrieb Kevin Wolf:
> >>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
> >>>> I ran into similar issues now with mirror, (both deadlocks and stuck
> >>>> guest IO at other times), and interestingly, also during job start.
> >>>>
> >>>> Also had a backtrace similar to [0] once, so I took a closer look.
> >>>> Probably was obvious to others already, but for the record:
> >>>>
> >>>> 1. the graph is locked by the main thread
> >>>> 2. the iothread holds the AioContext lock
> >>>> 3. the main thread waits on the AioContext lock
> >>>> 4. the iothread waits for coroutine spawned by blk_is_available()
> >>>
> >>> Where does this blk_is_available() in the iothread come from? Having it
> >>> wait without dropping the AioContext lock sounds like something that
> >>> we'd want to avoid. Ideally, devices using iothreads shouldn't use
> >>> synchronous requests at all, but I think scsi-disk might have some of
> >>> them.
> >>>
> >>
> >> It's part of the request handling in virtio-scsi:
> >>
> >>> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
> >>> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
> >>> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
> >>> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
> >>> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
> >>> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
> >>> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
> >
> > Oh... So essentially for an assertion.
> >
> > I wonder if the blk_is_available() check introduced in 2a2d69f490c is
> > even necessary any more, because BlockBackend has its own AioContext
> > now. And if blk_bs(blk) != NULL isn't what we actually want to check if
> > the check is necessary, because calling bdrv_is_inserted() doesn't seem
> > to have been intended. blk_bs() wouldn't have to poll.
> >
>
> Could virtio_scsi_hotunplug() be an issue with removing or modifying
> the check? There's a call there which sets the blk's AioContext to
> qemu_get_aio_context(). Or are we sure that the assert in
> virtio_scsi_ctx_check() can't be reached after that?
I think that would be the kind of bug that the assertion tries to
catch, because then we would be sending requests to blk from a thread
that doesn't match its AioContext (which will be allowed soon, but not
quite yet).
Before resetting the AioContext, virtio_scsi_hotunplug() calls
qdev_simple_device_unplug_cb(), which unrealizes the SCSI device. This
calls scsi_qdev_unrealize() -> scsi_device_purge_requests(), which in
turn drains blk and cancels all pending requests. So there should be
nothing left that could call into virtio_scsi_ctx_check() any more.
The other argument is that after unrealize, virtio_scsi_device_get()
would return NULL anyway, so even if a request were still pending, it
would just fail instead of accessing the unplugged device.
> >>> #7 0x0000558712f697e4 in virtio_scsi_handle_cmd_req_prepare (s=0x558716c049c0, req=0x7ff7e400b650) at ../hw/scsi/virtio-scsi.c:788
> >>> #8 0x0000558712f699b0 in virtio_scsi_handle_cmd_vq (s=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:831
> >>> #9 0x0000558712f69bcb in virtio_scsi_handle_cmd (vdev=0x558716c049c0, vq=0x558716c0d2a8) at ../hw/scsi/virtio-scsi.c:867
> >>> #10 0x0000558712f96812 in virtio_queue_notify_vq (vq=0x558716c0d2a8) at ../hw/virtio/virtio.c:2263
> >>> #11 0x0000558712f99b75 in virtio_queue_host_notifier_read (n=0x558716c0d31c) at ../hw/virtio/virtio.c:3575
> >>> #12 0x000055871323d8b5 in aio_dispatch_handler (ctx=0x55871598d5e0, node=0x558716771000) at ../util/aio-posix.c:372
> >>> #13 0x000055871323d988 in aio_dispatch_ready_handlers (ctx=0x55871598d5e0, ready_list=0x7ff7f288eeb0) at ../util/aio-posix.c:401
> >>
> >>
> >>>> As for why it doesn't progress, blk_co_is_available_entry() uses
> >>>> bdrv_graph_co_rdlock() and can't get it, because the main thread has the
> >>>> write lock. Should be fixed once the AioContext locks are gone, but not
> >>>> sure what should be done to avoid it until then.
> >>>
> >>> Then the nested event loop in blk_is_available() would probably be
> >>> enough to make progress, yes.
> >>>
> >>> Maybe we could actually drop the lock (and immediately reacquire it) in
> >>> AIO_WAIT_WHILE() even if we're in the home thread? That should give the
> >>> main thread a chance to make progress.
> >>
> >> Seems to work :) I haven't run into the issue with the following change
> >> anymore, but I have to say, running into that specific deadlock only
> >> happened every 10-15 tries or so before. Did 30 tests now. But
> >> unfortunately, the stuck IO issue is still there.
> >>
> >>> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> >>> index 5449b6d742..da159501ca 100644
> >>> --- a/include/block/aio-wait.h
> >>> +++ b/include/block/aio-wait.h
> >>> @@ -88,7 +88,13 @@ extern AioWait global_aio_wait;
> >>> smp_mb__after_rmw(); \
> >>> if (ctx_ && in_aio_context_home_thread(ctx_)) { \
> >>> while ((cond)) { \
> >>> + if (unlock && ctx_) { \
> >>> + aio_context_release(ctx_); \
> >>> + } \
> >>> aio_poll(ctx_, true); \
> >>> + if (unlock && ctx_) { \
> >>> + aio_context_acquire(ctx_); \
> >>> + } \
> >>> waited_ = true; \
> >>> } \
> >>> } else { \
> >
> > For reacquiring the lock, I really meant "immediately". Calling
> > aio_poll() without the lock is wrong.
>
> Unfortunately, then it's not enough, because the call to aio_poll() is
> blocking and because the lock is held during that call, the very same
> deadlock can still happen.
Oh, right. What we would need is an aio_poll() that drops the lock
while it blocks, pretty much like the main loop does with the BQL. I
seem to remember that this was an intentional difference, but I don't
remember why.
Paolo, I'm sure you remember?
> >
> > What does the stuck I/O look like? Is it stuck in the backend, i.e. the
> > device started requests that never complete? Or stuck from the guest
> > perspective, i.e. the device never checks for new requests?
> >
>
> AFAICT, from the guest perspective.
>
> > I don't really have an idea immediately, we'd have to find out where the
> > stuck I/O stops being processed.
> >
>
> I've described it in an earlier mail in this thread:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg01900.html
>
> Quoting from there:
>
> > After the IO was stuck in the guest, I used bdrv_next_all_states() to
> > iterate over the states and there's only the bdrv_raw and the
> > bdrv_host_device. For both, tracked_requests was empty.
And bs->in_flight and blk->in_flight are 0, too?
Is anything quiesced?
> > What is also very interesting is that the IO isn't always dead
> > immediately. It can be that the fio command still runs with lower speed
> > for a while (sometimes even up to about a minute, but most often about
> > 10-15 seconds or so). During that time, I still can see calls to
> > virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
>
> Noting again that (at least for backup) it happens with both virtio-blk
> and virtio-scsi and with both aio=io_uring and aio=threads. I also tried
> different host kernels 5.15, 6.2 and 6.5 and guest kernels 5.10 and 6.1.
You say "at least for backup". Did you see the bug for other job types,
too?
Kevin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-19 12:14 ` Kevin Wolf
@ 2023-10-19 13:53 ` Fiona Ebner
2023-10-20 13:52 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-19 13:53 UTC (permalink / raw)
To: Kevin Wolf
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 19.10.23 um 14:14 schrieb Kevin Wolf:
> Am 18.10.2023 um 11:42 hat Fiona Ebner geschrieben:
>> Am 17.10.23 um 16:20 schrieb Kevin Wolf:
>>> Am 17.10.2023 um 15:37 hat Fiona Ebner geschrieben:
>>>> Am 17.10.23 um 14:12 schrieb Kevin Wolf:
>>>>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>>>>>> I ran into similar issues now with mirror, (both deadlocks and stuck
>>>>>> guest IO at other times), and interestingly, also during job start.
>>>>>>
>>>>>> Also had a backtrace similar to [0] once, so I took a closer look.
>>>>>> Probably was obvious to others already, but for the record:
>>>>>>
>>>>>> 1. the graph is locked by the main thread
>>>>>> 2. the iothread holds the AioContext lock
>>>>>> 3. the main thread waits on the AioContext lock
>>>>>> 4. the iothread waits for coroutine spawned by blk_is_available()
>>>>>
>>>>> Where does this blk_is_available() in the iothread come from? Having it
>>>>> wait without dropping the AioContext lock sounds like something that
>>>>> we'd want to avoid. Ideally, devices using iothreads shouldn't use
>>>>> synchronous requests at all, but I think scsi-disk might have some of
>>>>> them.
>>>>>
>>>>
>>>> It's part of the request handling in virtio-scsi:
>>>>
>>>>> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
>>>>> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
>>>>> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
>>>>> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
>>>>> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
>>>>> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
>>>>> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
>>>
>>> Oh... So essentially for an assertion.
>>>
>>> I wonder if the blk_is_available() check introduced in 2a2d69f490c is
>>> even necessary any more, because BlockBackend has its own AioContext
>>> now. And if blk_bs(blk) != NULL isn't what we actually want to check if
>>> the check is necessary, because calling bdrv_is_inserted() doesn't seem
>>> to have been intended. blk_bs() wouldn't have to poll.
>>>
>>
>> Could virtio_scsi_hotunplug() be an issue with removing or modifying
>> the check? There's a call there which sets the blk's AioContext to
>> qemu_get_aio_context(). Or are we sure that the assert in
>> virtio_scsi_ctx_check() can't be reached after that?
>
> I think that would be the kind of bug that the assertion tries to
> catch, because then we would be sending requests to blk from a thread
> that doesn't match its AioContext (which will be allowed soon, but not
> quite yet).
>
> Before resetting the AioContext, virtio_scsi_hotunplug() calls
> qdev_simple_device_unplug_cb(), which unrealizes the SCSI device. This
> calls scsi_qdev_unrealize() -> scsi_device_purge_requests(), which in
> turn drains blk and cancels all pending requests. So there should be
> nothing left that could call into virtio_scsi_ctx_check() any more.
>
> The other argument is that after unrealize, virtio_scsi_device_get()
> would return NULL anyway, so even if a request were still pending, it
> would just fail instead of accessing the unplugged device.
>
Okay, sounds like a way to get around that deadlock issue then :)
(...)
>>>
>>> What does the stuck I/O look like? Is it stuck in the backend, i.e. the
>>> device started requests that never complete? Or stuck from the guest
>>> perspective, i.e. the device never checks for new requests?
>>>
>>
>> AFAICT, from the guest perspective.
>>
>>> I don't really have an idea immediately, we'd have to find out where the
>>> stuck I/O stops being processed.
>>>
>>
>> I've described it in an earlier mail in this thread:
>> https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg01900.html
>>
>> Quoting from there:
>>
>>> After the IO was stuck in the guest, I used bdrv_next_all_states() to
>>> iterate over the states and there's only the bdrv_raw and the
>>> bdrv_host_device. For both, tracked_requests was empty.
>
> And bs->in_flight and blk->in_flight are 0, too?
>
Yes. And queued_requests in the BlockBackend is also empty.
> Is anything quiesced?
No. quiesce_counter is 0 for both BlockDriverState instances as well as
for the BlockBackend. quiesced_parent is false for both parents (i.e.
child_root for the bdrv_raw and child_of_bds for the bdrv_file (this
time I used VirtIO SCSI, in the quote it was VirtIO block)).
>>> What is also very interesting is that the IO isn't always dead
>>> immediately. It can be that the fio command still runs with lower speed
>>> for a while (sometimes even up to about a minute, but most often about
>>> 10-15 seconds or so). During that time, I still can see calls to
>>> virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
>>
>> Noting again that (at least for backup) it happens with both virtio-blk
>> and virtio-scsi and with both aio=io_uring and aio=threads. I also tried
>> different host kernels 5.15, 6.2 and 6.5 and guest kernels 5.10 and 6.1.
>
> You say "at least for backup". Did you see the bug for other job types,
> too?
>
Yes, sorry. I meant to say that I only tested it for the backup
canceling with all those configurations. I did run into the stuck guest
IO issue in three different scenarios: canceling backup, canceling
mirror and starting mirror:
>>>>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>>>>>> I ran into similar issues now with mirror, (both deadlocks and stuck
>>>>>> guest IO at other times), and interestingly, also during job start.
But for mirror, I only used VirtIO SCSI and aio=io_uring and didn't test
other configurations.
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-19 13:53 ` Fiona Ebner
@ 2023-10-20 13:52 ` Fiona Ebner
2023-11-03 13:20 ` Fiona Ebner
0 siblings, 1 reply; 16+ messages in thread
From: Fiona Ebner @ 2023-10-20 13:52 UTC (permalink / raw)
To: Kevin Wolf
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 19.10.23 um 15:53 schrieb Fiona Ebner:
> Am 19.10.23 um 14:14 schrieb Kevin Wolf:
>> Am 18.10.2023 um 11:42 hat Fiona Ebner geschrieben:
>>> Am 17.10.23 um 16:20 schrieb Kevin Wolf:
>>>> Am 17.10.2023 um 15:37 hat Fiona Ebner geschrieben:
>>>>> Am 17.10.23 um 14:12 schrieb Kevin Wolf:
>>>>>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>>>>>>> I ran into similar issues now with mirror, (both deadlocks and stuck
>>>>>>> guest IO at other times), and interestingly, also during job start.
>>>>>>>
>>>>>>> Also had a backtrace similar to [0] once, so I took a closer look.
>>>>>>> Probably was obvious to others already, but for the record:
>>>>>>>
>>>>>>> 1. the graph is locked by the main thread
>>>>>>> 2. the iothread holds the AioContext lock
>>>>>>> 3. the main thread waits on the AioContext lock
>>>>>>> 4. the iothread waits for coroutine spawned by blk_is_available()
>>>>>>
>>>>>> Where does this blk_is_available() in the iothread come from? Having it
>>>>>> wait without dropping the AioContext lock sounds like something that
>>>>>> we'd want to avoid. Ideally, devices using iothreads shouldn't use
>>>>>> synchronous requests at all, but I think scsi-disk might have some of
>>>>>> them.
>>>>>>
>>>>>
>>>>> It's part of the request handling in virtio-scsi:
>>>>>
>>>>>> #0 0x00007ff7f5f55136 in __ppoll (fds=0x7ff7e40030c0, nfds=8, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
>>>>>> #1 0x00005587132615ab in qemu_poll_ns (fds=0x7ff7e40030c0, nfds=8, timeout=-1) at ../util/qemu-timer.c:339
>>>>>> #2 0x000055871323e8b1 in fdmon_poll_wait (ctx=0x55871598d5e0, ready_list=0x7ff7f288ebe0, timeout=-1) at ../util/fdmon-poll.c:79
>>>>>> #3 0x000055871323e1ed in aio_poll (ctx=0x55871598d5e0, blocking=true) at ../util/aio-posix.c:670
>>>>>> #4 0x0000558713089efa in bdrv_poll_co (s=0x7ff7f288ec90) at /home/febner/repos/qemu/block/block-gen.h:43
>>>>>> #5 0x000055871308c362 in blk_is_available (blk=0x55871599e2f0) at block/block-gen.c:1426
>>>>>> #6 0x0000558712f6843b in virtio_scsi_ctx_check (s=0x558716c049c0, d=0x55871581cd30) at ../hw/scsi/virtio-scsi.c:290
>>>>
>>>> Oh... So essentially for an assertion.
>>>>
>>>> I wonder if the blk_is_available() check introduced in 2a2d69f490c is
>>>> even necessary any more, because BlockBackend has its own AioContext
>>>> now. And if blk_bs(blk) != NULL isn't what we actually want to check if
>>>> the check is necessary, because calling bdrv_is_inserted() doesn't seem
>>>> to have been intended. blk_bs() wouldn't have to poll.
>>>>
>>>
>>> Could virtio_scsi_hotunplug() be an issue with removing or modifying
>>> the check? There's a call there which sets the blk's AioContext to
>>> qemu_get_aio_context(). Or are we sure that the assert in
>>> virtio_scsi_ctx_check() can't be reached after that?
>>
>> I think that would be the kind of bug that the assertion tries to
>> catch, because then we would be sending requests to blk from a thread
>> that doesn't match its AioContext (which will be allowed soon, but not
>> quite yet).
>>
>> Before resetting the AioContext, virtio_scsi_hotunplug() calls
>> qdev_simple_device_unplug_cb(), which unrealizes the SCSI device. This
>> calls scsi_qdev_unrealize() -> scsi_device_purge_requests(), which in
>> turn drains blk and cancels all pending requests. So there should be
>> nothing left that could call into virtio_scsi_ctx_check() any more.
>>
>> The other argument is that after unrealize, virtio_scsi_device_get()
>> would return NULL anyway, so even if a request were still pending, it
>> would just fail instead of accessing the unplugged device.
>>
>
> Okay, sounds like a way to get around that deadlock issue then :)
>
Unfortunately, scsi_dma_command() also has a blk_is_available() call and
I ran into a similar deadlock with that once.
> (...)
>
>>>>
>>>> What does the stuck I/O look like? Is it stuck in the backend, i.e. the
>>>> device started requests that never complete? Or stuck from the guest
>>>> perspective, i.e. the device never checks for new requests?
>>>>
>>>
>>> AFAICT, from the guest perspective.
>>>
>>>> I don't really have an idea immediately, we'd have to find out where the
>>>> stuck I/O stops being processed.
>>>>
>>>
>>> I've described it in an earlier mail in this thread:
>>> https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg01900.html
>>>
>>> Quoting from there:
>>>
>>>> After the IO was stuck in the guest, I used bdrv_next_all_states() to
>>>> iterate over the states and there's only the bdrv_raw and the
>>>> bdrv_host_device. For both, tracked_requests was empty.
>>
>> And bs->in_flight and blk->in_flight are 0, too?
>>
>
> Yes. And queued_requests in the BlockBackend is also empty.
>
>> Is anything quiesced?
>
> No. quiesce_counter is 0 for both BlockDriverState instances as well as
> for the BlockBackend. quiesced_parent is false for both parents (i.e.
> child_root for the bdrv_raw and child_of_bds for the bdrv_file (this
> time I used VirtIO SCSI, in the quote it was VirtIO block)).
>
>>>> What is also very interesting is that the IO isn't always dead
>>>> immediately. It can be that the fio command still runs with lower speed
>>>> for a while (sometimes even up to about a minute, but most often about
>>>> 10-15 seconds or so). During that time, I still can see calls to
>>>> virtio_scsi_handle_cmd() and blk_aio_write_entry(). Then they suddenly stop.
>>>
>>> Noting again that (at least for backup) it happens with both virtio-blk
>>> and virtio-scsi and with both aio=io_uring and aio=threads. I also tried
>>> different host kernels 5.15, 6.2 and 6.5 and guest kernels 5.10 and 6.1.
>>
>> You say "at least for backup". Did you see the bug for other job types,
>> too?
>>
>
> Yes, sorry. I meant to say that I only tested it for the backup
> canceling with all those configurations. I did run into the stuck guest
> IO issue in three different scenarios: canceling backup, canceling
> mirror and starting mirror:
>
>>>>>> Am 17.10.2023 um 12:18 hat Fiona Ebner geschrieben:
>>>>>>> I ran into similar issues now with mirror, (both deadlocks and stuck
>>>>>>> guest IO at other times), and interestingly, also during job start.
>
> But for mirror, I only used VirtIO SCSI and aio=io_uring and didn't test
> other configurations.
>
I should also mention that I used drive-mirror and drive-backup. I can
also trigger the issue with blockdev-mirror when canceling, but didn't
run into it when starting with blockdev-mirror yet.
And I found that with drive-mirror, the issue during starting seems to
manifest with the bdrv_open() call. Adding a return before it, the guest
IO didn't get stuck in my testing, but adding a return after it, it can
get stuck. I'll try to see if I can further narrow it down next week,
but maybe that's already a useful hint?
> diff --git a/blockdev.c b/blockdev.c
> index a01c62596b..156915d15d 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -3126,6 +3126,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
> }
> aio_context_release(aio_context);
>
> + // return; // GOOD
> +
> /* Mirroring takes care of copy-on-write using the source's backing
> * file.
> */
> @@ -3133,6 +3135,8 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
> target_bs = bdrv_open(arg->target, NULL, options, flags, errp);
> aio_context_release(qemu_get_aio_context());
>
> + // return; // BAD
> +
> if (!target_bs) {
> return;
> }
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: deadlock when using iothread during backup_clean()
2023-10-20 13:52 ` Fiona Ebner
@ 2023-11-03 13:20 ` Fiona Ebner
0 siblings, 0 replies; 16+ messages in thread
From: Fiona Ebner @ 2023-11-03 13:20 UTC (permalink / raw)
To: Kevin Wolf
Cc: Vladimir Sementsov-Ogievskiy, Paolo Bonzini, QEMU Developers,
Hanna Reitz, John Snow, open list:Network Block Dev...,
Thomas Lamprecht
Am 20.10.23 um 15:52 schrieb Fiona Ebner:
> And I found that with drive-mirror, the issue during starting seems to
> manifest with the bdrv_open() call. Adding a return before it, the guest
> IO didn't get stuck in my testing, but adding a return after it, it can
> get stuck. I'll try to see if I can further narrow it down next week,
> but maybe that's already a useful hint?
>
In the end, I was able to find a reproducer that just does draining and
bisected the issue (doesn't seem related to the graph lock after all). I
replied there, to avoid all the overhead from this thread:
https://lists.nongnu.org/archive/html/qemu-devel/2023-11/msg00681.html
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-11-03 13:22 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-05 10:01 deadlock when using iothread during backup_clean() Fiona Ebner
2023-09-05 11:25 ` Fiona Ebner
2023-10-04 16:51 ` Vladimir Sementsov-Ogievskiy
2023-09-05 11:42 ` Paolo Bonzini
2023-09-28 8:06 ` Fiona Ebner
2023-10-04 17:08 ` Vladimir Sementsov-Ogievskiy
2023-10-06 12:18 ` Fiona Ebner
2023-10-17 10:18 ` Fiona Ebner
2023-10-17 12:12 ` Kevin Wolf
2023-10-17 13:37 ` Fiona Ebner
2023-10-17 14:20 ` Kevin Wolf
2023-10-18 9:42 ` Fiona Ebner
2023-10-19 12:14 ` Kevin Wolf
2023-10-19 13:53 ` Fiona Ebner
2023-10-20 13:52 ` Fiona Ebner
2023-11-03 13:20 ` Fiona Ebner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).