qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Deadlock with SATA CD I/O and eject
@ 2023-09-18 17:28 John Levon
  2023-09-19 10:54 ` Kevin Wolf
  0 siblings, 1 reply; 5+ messages in thread
From: John Levon @ 2023-09-18 17:28 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block, qemu-devel


Observed with base of qemu 6.2.0, but from code inspection it looks to me like
it's still current in upstream master. Apologies if I have missed a fix in this
area.

Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
files.." screen, run an eject e.g.

virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }'

qemu will get stuck like so:

gdb) bt
#0  0x00007f8ba4b16036 in ppoll () from /lib64/libc.so.6
#1  0x0000561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:62
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
#3  0x0000561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
#4  0x0000561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x0000561813ae2fad in bdrv_do_drained_begin (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at ../block/io.c:483
#6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, poll=<optimized out>) at ../block/io.c:446
#7  0x0000561813ad9982 in blk_drain (blk=0x5618161c1f10) at ../block/block-backend.c:1741
#8  0x0000561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at ../block/block-backend.c:852
#9  0x000056181382b8ab in blockdev_remove_medium (has_device=<optimized out>, device=<optimized out>, has_id=<optimized out>, id=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
#10 0x000056181382bfb1 in qmp_eject (has_device=<optimized out>, device=0x0, has_id=<optimized out>, id=0x561815e6efe0 "sata0-0-0", has_force=<optimized out>, force=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45

We are stuck forever here:

 351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
 352                                   bool poll)
...
 380     if (poll) {
 381         BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
 382     }

Because the blk root's ->in_flight is 1, as tested by the condition
blk_root_drained_poll().


Our blk->in_flight user is stuck here:

1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
...
1310         blk_dec_in_flight(blk);
1311         qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
1312         blk_inc_in_flight(blk);

Note that before entering this stanza, blk->in_flight was 2. This turns out to
be due to the ide atapi code. In particular, the guest VM is generating lots of
read I/O. The "first IO" arrives into blk via:

cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()

This initial IO completes:

blk_aio_read_entry()->blk_aio_complete()

1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
1561 {
1562     if (acb->has_returned) {
1563         acb->common.cb(acb->common.opaque, acb->rwco.ret);
1564         blk_dec_in_flight(acb->rwco.blk);
1565         qemu_aio_unref(acb);
1566     }
1567 }

Line 1564 is what we need to move blk->in_flight down to zero, but that is never
reached! This is because of what happens at :1563

acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()

That is, the IO callback in the atapi code itself triggers more - synchronous - IO.

In the meantime, we start processing the blk_drain() code, so by the time this
blk_pread() actually gets handled, quiesce is set, and we get stuck in the
blk_wait_while_drained().

I don't know the qemu block stack well enough to propose an actual fix.

Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
that's just a band-aid instead of fixing the deadlock.

Any suggestions/clues/thoughts?

thanks
john


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock with SATA CD I/O and eject
  2023-09-18 17:28 Deadlock with SATA CD I/O and eject John Levon
@ 2023-09-19 10:54 ` Kevin Wolf
  2023-09-19 11:23   ` Hanna Czenczek
  2023-09-19 12:57   ` John Levon
  0 siblings, 2 replies; 5+ messages in thread
From: Kevin Wolf @ 2023-09-19 10:54 UTC (permalink / raw)
  To: John Levon
  Cc: qemu-block, qemu-devel, Hanna Czenczek, Fiona Ebner, John Snow,
	Paolo Bonzini

Am 18.09.2023 um 19:28 hat John Levon geschrieben:
> 
> Observed with base of qemu 6.2.0, but from code inspection it looks to me like
> it's still current in upstream master. Apologies if I have missed a fix in this
> area.
> 
> Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
> files.." screen, run an eject e.g.
> 
> virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }'
> 
> qemu will get stuck like so:
> 
> gdb) bt
> #0  0x00007f8ba4b16036 in ppoll () from /lib64/libc.so.6
> #1  0x0000561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:62
> #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
> #3  0x0000561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
> #4  0x0000561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, blocking=blocking@entry=true) at ../util/aio-posix.c:607
> #5  0x0000561813ae2fad in bdrv_do_drained_begin (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at ../block/io.c:483
> #6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, poll=<optimized out>) at ../block/io.c:446
> #7  0x0000561813ad9982 in blk_drain (blk=0x5618161c1f10) at ../block/block-backend.c:1741
> #8  0x0000561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at ../block/block-backend.c:852
> #9  0x000056181382b8ab in blockdev_remove_medium (has_device=<optimized out>, device=<optimized out>, has_id=<optimized out>, id=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
> #10 0x000056181382bfb1 in qmp_eject (has_device=<optimized out>, device=0x0, has_id=<optimized out>, id=0x561815e6efe0 "sata0-0-0", has_force=<optimized out>, force=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45
> 
> We are stuck forever here:
> 
>  351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
>  352                                   bool poll)
> ...
>  380     if (poll) {
>  381         BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
>  382     }
> 
> Because the blk root's ->in_flight is 1, as tested by the condition
> blk_root_drained_poll().
> 
> 
> Our blk->in_flight user is stuck here:
> 
> 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
> ...
> 1310         blk_dec_in_flight(blk);
> 1311         qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
> 1312         blk_inc_in_flight(blk);
> 
> Note that before entering this stanza, blk->in_flight was 2. This turns out to
> be due to the ide atapi code. In particular, the guest VM is generating lots of
> read I/O. The "first IO" arrives into blk via:
> 
> cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()
> 
> This initial IO completes:
> 
> blk_aio_read_entry()->blk_aio_complete()
> 
> 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
> 1561 {
> 1562     if (acb->has_returned) {
> 1563         acb->common.cb(acb->common.opaque, acb->rwco.ret);
> 1564         blk_dec_in_flight(acb->rwco.blk);
> 1565         qemu_aio_unref(acb);
> 1566     }
> 1567 }
> 
> Line 1564 is what we need to move blk->in_flight down to zero, but that is never
> reached! This is because of what happens at :1563
> 
> acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()
> 
> That is, the IO callback in the atapi code itself triggers more - synchronous - IO.
> 
> In the meantime, we start processing the blk_drain() code, so by the time this
> blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> blk_wait_while_drained().
> 
> I don't know the qemu block stack well enough to propose an actual fix.
> 
> Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
> blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
> that's just a band-aid instead of fixing the deadlock.
> 
> Any suggestions/clues/thoughts?

Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html

Actually, it seems we never fixed that problem either?

Back then I suggested that blk_drain*() should disable request queuing
because its callers don't want to quiesce the BlockBackend, but rather
get their own requests completed. I think this approach would solve this
one as well.

Your experiment with moving the queuing later is basically what Paolo
suggested, though I'd still argue it's not the right thing to do because
while it seems to mostly work with both use cases of drain (give me
exclusive access vs. wait for my requests to complete), it's not optimal
for either one.

Kevin



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock with SATA CD I/O and eject
  2023-09-19 10:54 ` Kevin Wolf
@ 2023-09-19 11:23   ` Hanna Czenczek
  2023-09-19 12:57   ` John Levon
  1 sibling, 0 replies; 5+ messages in thread
From: Hanna Czenczek @ 2023-09-19 11:23 UTC (permalink / raw)
  To: Kevin Wolf, John Levon
  Cc: qemu-block, qemu-devel, Fiona Ebner, John Snow, Paolo Bonzini

On 19.09.23 12:54, Kevin Wolf wrote:
> Am 18.09.2023 um 19:28 hat John Levon geschrieben:
>> Observed with base of qemu 6.2.0, but from code inspection it looks to me like
>> it's still current in upstream master. Apologies if I have missed a fix in this
>> area.
>>
>> Symptom: run a UEFI-booted SATA CD Windows installer. When it hits "Loading
>> files.." screen, run an eject e.g.
>>
>> virsh qemu-monitor-command 64c6e190-ea7f-49e2-b2d5-6ba1814b00ae '{"execute":"eject", "arguments": { "id": "sata0-0-0" } }'
>>
>> qemu will get stuck like so:
>>
>> gdb) bt
>> #0  0x00007f8ba4b16036 in ppoll () from /lib64/libc.so.6
>> #1  0x0000561813c48ed5 in ppoll (__ss=0x0, __timeout=0x7ffcbd981a70, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:62
>> #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=999896128) at ../util/qemu-timer.c:348
>> #3  0x0000561813c29be9 in fdmon_poll_wait (ctx=0x56181516e070, ready_list=0x7ffcbd981af0, timeout=999896128) at ../util/fdmon-poll.c:80
>> #4  0x0000561813c297e1 in aio_poll (ctx=ctx@entry=0x56181516e070, blocking=blocking@entry=true) at ../util/aio-posix.c:607
>> #5  0x0000561813ae2fad in bdrv_do_drained_begin (poll=true, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x56181533fcc0) at ../block/io.c:483
>> #6  bdrv_do_drained_begin (bs=0x56181533fcc0, recursive=<optimized out>, parent=0x0, ignore_bds_parents=<optimized out>, poll=<optimized out>) at ../block/io.c:446
>> #7  0x0000561813ad9982 in blk_drain (blk=0x5618161c1f10) at ../block/block-backend.c:1741
>> #8  0x0000561813ad9b8c in blk_remove_bs (blk=blk@entry=0x5618161c1f10) at ../block/block-backend.c:852
>> #9  0x000056181382b8ab in blockdev_remove_medium (has_device=<optimized out>, device=<optimized out>, has_id=<optimized out>, id=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:232
>> #10 0x000056181382bfb1 in qmp_eject (has_device=<optimized out>, device=0x0, has_id=<optimized out>, id=0x561815e6efe0 "sata0-0-0", has_force=<optimized out>, force=<optimized out>, errp=0x7ffcbd981c78) at ../block/qapi-sysemu.c:45
>>
>> We are stuck forever here:
>>
>>   351 static void bdrv_do_drained_begin(BlockDriverState *bs, BdrvChild *parent,
>>   352                                   bool poll)
>> ...
>>   380     if (poll) {
>>   381         BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
>>   382     }
>>
>> Because the blk root's ->in_flight is 1, as tested by the condition
>> blk_root_drained_poll().
>>
>>
>> Our blk->in_flight user is stuck here:
>>
>> 1298 static void coroutine_fn blk_wait_while_drained(BlockBackend *blk)
>> ...
>> 1310         blk_dec_in_flight(blk);
>> 1311         qemu_co_queue_wait(&blk->queued_requests, &blk->queued_requests_lock);
>> 1312         blk_inc_in_flight(blk);
>>
>> Note that before entering this stanza, blk->in_flight was 2. This turns out to
>> be due to the ide atapi code. In particular, the guest VM is generating lots of
>> read I/O. The "first IO" arrives into blk via:
>>
>> cd_read_sector()->ide_buffered_readv()->blk_aio_preadv()
>>
>> This initial IO completes:
>>
>> blk_aio_read_entry()->blk_aio_complete()
>>
>> 1560 static void blk_aio_complete(BlkAioEmAIOCB *acb)
>> 1561 {
>> 1562     if (acb->has_returned) {
>> 1563         acb->common.cb(acb->common.opaque, acb->rwco.ret);
>> 1564         blk_dec_in_flight(acb->rwco.blk);
>> 1565         qemu_aio_unref(acb);
>> 1566     }
>> 1567 }
>>
>> Line 1564 is what we need to move blk->in_flight down to zero, but that is never
>> reached! This is because of what happens at :1563
>>
>> acm->common.cb()->cd_read_sector_cb()->ide_atapi_cmd_reply_end()->cd_read_sector_sync()->blk_pread()
>>
>> That is, the IO callback in the atapi code itself triggers more - synchronous - IO.
>>
>> In the meantime, we start processing the blk_drain() code, so by the time this
>> blk_pread() actually gets handled, quiesce is set, and we get stuck in the
>> blk_wait_while_drained().
>>
>> I don't know the qemu block stack well enough to propose an actual fix.
>>
>> Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
>> blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
>> that's just a band-aid instead of fixing the deadlock.
>>
>> Any suggestions/clues/thoughts?
> Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
>
> Actually, it seems we never fixed that problem either?

We didn’t, from my POV mainly because everyone had different suggestions 
and none of them looked nice enough that it was clear which one to pursue.

I.e., there are also 
https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00709.html and 
https://lists.gnu.org/archive/html/qemu-block/2023-04/msg00118.html .

> Back then I suggested that blk_drain*() should disable request queuing
> because its callers don't want to quiesce the BlockBackend, but rather
> get their own requests completed. I think this approach would solve this
> one as well.
>
> Your experiment with moving the queuing later is basically what Paolo
> suggested, though I'd still argue it's not the right thing to do because
> while it seems to mostly work with both use cases of drain (give me
> exclusive access vs. wait for my requests to complete), it's not optimal
> for either one.
>
> Kevin
>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock with SATA CD I/O and eject
  2023-09-19 10:54 ` Kevin Wolf
  2023-09-19 11:23   ` Hanna Czenczek
@ 2023-09-19 12:57   ` John Levon
  2023-09-19 14:03     ` Kevin Wolf
  1 sibling, 1 reply; 5+ messages in thread
From: John Levon @ 2023-09-19 12:57 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, qemu-devel, Hanna Czenczek, Fiona Ebner, John Snow,
	Paolo Bonzini

On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote:

> > In the meantime, we start processing the blk_drain() code, so by the time this
> > blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> > blk_wait_while_drained().
> > 
> > I don't know the qemu block stack well enough to propose an actual fix.
> > 
> > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
> > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
> > that's just a band-aid instead of fixing the deadlock.
> 
> Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
> 
> Actually, it seems we never fixed that problem either?
> 
> Back then I suggested that blk_drain*() should disable request queuing
> because its callers don't want to quiesce the BlockBackend, but rather
> get their own requests completed. I think this approach would solve this
> one as well.

In this case though, it's not their own requests right? We have incoming I/O
from the guest and the eject is a separate operation.

So why it would be OK to disable queuing and have ongoing I/Os (i.e.
blk->in_flight > 0) racing against the block remove?

> Your experiment with moving the queuing later is basically what Paolo

I think it's much more flaky than that, isn't it? It's just clearing out the
*current* pending queue, but nothing would stop new I/Os from being started
before we dropped into the poll for blk->in_flight to be zero again. In my case,
this just happens to work because the prior tray open notification has stopped
new I/O being filed, right?

thanks
john


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Deadlock with SATA CD I/O and eject
  2023-09-19 12:57   ` John Levon
@ 2023-09-19 14:03     ` Kevin Wolf
  0 siblings, 0 replies; 5+ messages in thread
From: Kevin Wolf @ 2023-09-19 14:03 UTC (permalink / raw)
  To: John Levon
  Cc: qemu-block, qemu-devel, Hanna Czenczek, Fiona Ebner, John Snow,
	Paolo Bonzini

Am 19.09.2023 um 14:57 hat John Levon geschrieben:
> On Tue, Sep 19, 2023 at 12:54:59PM +0200, Kevin Wolf wrote:
> 
> > > In the meantime, we start processing the blk_drain() code, so by the time this
> > > blk_pread() actually gets handled, quiesce is set, and we get stuck in the
> > > blk_wait_while_drained().
> > > 
> > > I don't know the qemu block stack well enough to propose an actual fix.
> > > 
> > > Experimentally, waiting for ->in_flight to drop to zero *before* we quiesce in
> > > blk_remove_bs() via an AIO_WAIT_WHILE() avoids the symptom, but I'm pretty sure
> > > that's just a band-aid instead of fixing the deadlock.
> > 
> > Related discussion: https://lists.gnu.org/archive/html/qemu-block/2023-03/msg00284.html
> > 
> > Actually, it seems we never fixed that problem either?
> > 
> > Back then I suggested that blk_drain*() should disable request queuing
> > because its callers don't want to quiesce the BlockBackend, but rather
> > get their own requests completed. I think this approach would solve this
> > one as well.
> 
> In this case though, it's not their own requests right? We have incoming I/O
> from the guest and the eject is a separate operation.

It's not the same code path, but we're operating in the context of the
IDE device (or more specifically, its BlockBackend), so in that sense it
is "its own requests".

The main difference is anyway between just stopping activity (even if
it's in the middle of a higher level operation of the device) and
getting requests fully completed. We want the latter here.

> So why it would be OK to disable queuing and have ongoing I/Os (i.e.
> blk->in_flight > 0) racing against the block remove?

With eject, the case is simple for IDE: We hold the BQL, so the guest
won't be able to submit new requests anyway.

In more complicated cases like virtio-blk, bdrv_drained_begin() and
friends take care to stop new requests from coming in from the guest by
not running notifiers while the device is drained.

We just need to take care of the requests that have already started.

> > Your experiment with moving the queuing later is basically what Paolo
> 
> I think it's much more flaky than that, isn't it? It's just clearing out the
> *current* pending queue, but nothing would stop new I/Os from being started
> before we dropped into the poll for blk->in_flight to be zero again. In my case,
> this just happens to work because the prior tray open notification has stopped
> new I/O being filed, right?

I think it's the same as above, holding the BQL and calling drain would
already take care of that.

Kevin



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-09-19 14:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-18 17:28 Deadlock with SATA CD I/O and eject John Levon
2023-09-19 10:54 ` Kevin Wolf
2023-09-19 11:23   ` Hanna Czenczek
2023-09-19 12:57   ` John Levon
2023-09-19 14:03     ` Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).