* [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
@ 2018-06-09 9:10 l00284672
2018-06-10 7:43 ` Fam Zheng
0 siblings, 1 reply; 7+ messages in thread
From: l00284672 @ 2018-06-09 9:10 UTC (permalink / raw)
To: stefanha, kwolf, famz; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 4687 bytes --]
Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
coinstantaneously.
The qemu bt is below:
#0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
#1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
__nfds=<optimized out>, __fds=<optimized out>) at
/usr/include/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=0) at util/qemu-timer.c:334
#3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
blocking=blocking@entry=false) at util/aio-posix.c:629
#4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
block/io.c:198
#5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
block/io.c:215
#6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
recursive=<optimized out>, parent=0x0) at block/io.c:291
#7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
block/block-backend.c:1586
#8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
#9 0x000000000072d228 in block_job_detach_aio_context
(opaque=0x29df040) at blockjob.c:139
#10 0x00000000007298b1 in bdrv_detach_aio_context
(bs=bs@entry=0x3665990) at block.c:4885
#11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
new_context=0x268e140) at block.c:4946
#12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized
out>) at
/mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
#13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
hw/virtio/virtio-bus.c:246
#14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
running=<optimized out>, state=<optimized out>)
at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
#15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
#16 0x000000000045d67a in do_vm_stop
(state=state@entry=RUN_STATE_PAUSED, send_stop=send_stop@entry=true)
at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
#17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED)
at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
#18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
qmp.c:106
#19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
#20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
#21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800)
at qapi/qmp-dispatch.c:168
#22 0x00000000004655be in monitor_qmp_dispatch_one
(req_obj=req_obj@entry=0x39abff0) at
/mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
#23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized
out>) at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
#24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
#25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
#26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
util/aio-posix.c:436
#27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#28 0x00007ff58bc7c99a in g_main_context_dispatch () from
/lib64/libglib-2.0.so.0
#29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
#30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#32 0x0000000000561cad in main_loop () at vl.c:1848
#33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4605
The disk is a virtio-blk dataplane disk with a mirror job running. The
dead loop is here:
static void block_job_detach_aio_context(void *opaque)
{
BlockJob *job = opaque;
/* In case the job terminates during aio_poll()... */
job_ref(&job->job);
job_pause(&job->job);
while (!job->job.paused && !job_is_completed(&job->job)) {
job_drain(&job->job);
}
job->job.aio_context = NULL;
job_unref(&job->job);
}
The job is deferred to main loop now, but the job_drain only processes
the AIO context of bs which has no more work to do,
while the main loop BH is scheduled for setting the job->completed flag
is never processed.
I have tried many ways and want to slove it, but they all can not
slove it completely. Do you have any good ideas for it? Thanks for your
reply!
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: lizhengui.vcf --]
[-- Type: text/x-vcard; name="lizhengui.vcf", Size: 4 bytes --]
null
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-09 9:10 [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously l00284672
@ 2018-06-10 7:43 ` Fam Zheng
2018-06-11 3:01 ` l00284672
0 siblings, 1 reply; 7+ messages in thread
From: Fam Zheng @ 2018-06-10 7:43 UTC (permalink / raw)
To: l00284672; +Cc: stefanha, kwolf, qemu-devel
On Sat, 06/09 17:10, l00284672 wrote:
> Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
> coinstantaneously.
>
> The qemu bt is below:
>
> #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
> #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
> __nfds=<optimized out>, __fds=<optimized out>) at
> /usr/include/bits/poll2.h:77
> #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
> timeout=timeout@entry=0) at util/qemu-timer.c:334
> #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
> blocking=blocking@entry=false) at util/aio-posix.c:629
> #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
> block/io.c:198
> #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
> block/io.c:215
> #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
> recursive=<optimized out>, parent=0x0) at block/io.c:291
> #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
> block/block-backend.c:1586
> #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
> #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
> blockjob.c:139
> #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
> block.c:4885
> #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
> new_context=0x268e140) at block.c:4946
> #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
> at
> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
> #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
> hw/virtio/virtio-bus.c:246
> #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
> running=<optimized out>, state=<optimized out>)
> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
> #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
> state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
> #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
> send_stop=send_stop@entry=true)
> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
> #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
> #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
> qmp.c:106
> #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
> ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
> #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
> request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
> #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
> qapi/qmp-dispatch.c:168
> #22 0x00000000004655be in monitor_qmp_dispatch_one
> (req_obj=req_obj@entry=0x39abff0) at
> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
> #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
> #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
> #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
> #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
> util/aio-posix.c:436
> #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
> #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
> #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
> #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
> #32 0x0000000000561cad in main_loop () at vl.c:1848
> #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
> envp=<optimized out>) at vl.c:4605
>
> The disk is a virtio-blk dataplane disk with a mirror job running. The dead
> loop is here:
>
> static void block_job_detach_aio_context(void *opaque)
> {
> BlockJob *job = opaque;
>
> /* In case the job terminates during aio_poll()... */
> job_ref(&job->job);
>
> job_pause(&job->job);
>
> while (!job->job.paused && !job_is_completed(&job->job)) {
> job_drain(&job->job);
> }
>
> job->job.aio_context = NULL;
> job_unref(&job->job);
> }
>
> The job is deferred to main loop now, but the job_drain only processes the
> AIO context of bs which has no more work to do,
>
> while the main loop BH is scheduled for setting the job->completed flag is
> never processed.
In that case, main loop's AioContext should be driven like in job_finish_sync().
Could you try this patch?
diff --git a/blockjob.c b/blockjob.c
index 0306533a2e..72aa82ac2d 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
job_pause(&job->job);
- while (!job->job.paused && !job_is_completed(&job->job)) {
+
+ while (true) {
+ if (job->job.paused || job_is_completed(&job->job)) {
+ break;
+ }
+ if (job->deferred_to_main_loop) {
+ aio_poll(qemu_get_aio_context(), true);
+ continue;
+ }
job_drain(&job->job);
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-10 7:43 ` Fam Zheng
@ 2018-06-11 3:01 ` l00284672
2018-06-11 3:31 ` l00284672
0 siblings, 1 reply; 7+ messages in thread
From: l00284672 @ 2018-06-11 3:01 UTC (permalink / raw)
To: Fam Zheng; +Cc: stefanha, kwolf, lizhengui, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 11103 bytes --]
Thanks for your reply.
I tried your patch but it didn't work for qemu crashed. The qemu crash
bt is below:
(gdb) bt
#0 bdrv_detach_aio_context (bs=bs@entry=0x55a96b79ca30)
#1 0x000055a9688249ae in bdrv_set_aio_context (bs=bs@entry=0x55a96b79ca30,
new_context=new_context@entry=0x55a96b766920)
#2 0x000055a96885f721 in blk_set_aio_context (blk=0x55a96b792820,
new_context=0x55a96b766920)
#3 0x000055a9685ab797 in virtio_blk_data_plane_stop (vdev=<optimized out>)
#4 0x000055a9687bf705 in virtio_bus_stop_ioeventfd (bus=0x55a96cc42220)
#5 0x000055a9685d9d94 in virtio_vmstate_change (opaque=0x55a96cc42290,
running=<optimized out>,
state=<optimized out>)
#6 0x000055a96866e1a2 in vm_state_notify (running=running@entry=0,
state=state@entry=RUN_STATE_PAUSED)
#7 0x000055a96857b4c5 in do_vm_stop (state=RUN_STATE_PAUSED)
#8 vm_stop (state=state@entry=RUN_STATE_PAUSED)
#9 0x000055a96867d52b in qmp_stop (errp=errp@entry=0x7fff4e54a0d8)
#10 0x000055a96867b6ab in qmp_marshal_stop (args=<optimized out>,
ret=<optimized out>, errp=0x7fff4e54a
#11 0x000055a9688c2267 in do_qmp_dispatch (errp=0x7fff4e54a118,
request=0x55a96b7b4740)
#12 qmp_dispatch (request=request@entry=0x55a96b7ae490)
#13 0x000055a96857dd42 in handle_qmp_command (parser=<optimized out>,
tokens=<optimized out>)
#14 0x000055a9688c7534 in json_message_process_token
(lexer=0x55a96b776a68, input=0x55a96b70cae0, type=
x=36, y=91)
#15 0x000055a9688e960b in json_lexer_feed_char
(lexer=lexer@entry=0x55a96b776a68, ch=125 '}',
flush=flush@entry=false)
#16 0x000055a9688e96ce in json_lexer_feed (lexer=0x55a96b776a68,
buffer=<optimized out>, size=<optimize
#17 0x000055a9688c75f9 in json_message_parser_feed (parser=<optimized
out>, buffer=<optimized out>,
#18 0x000055a96857c5fb in monitor_qmp_read (opaque=<optimized out>,
buf=<optimized out>, size=<optimized out>)
#19 0x000055a968667596 in tcp_chr_read (chan=<optimized out>,
cond=<optimized out>, opaque=0x55a96b7748
#20 0x00007f8a9447899a in g_main_context_dispatch () from
/lib64/libglib-2.0.so.0
#21 0x000055a968828c3c in glib_pollfds_poll ()
#22 os_host_main_loop_wait (timeout=<optimized out>)
#23 main_loop_wait (nonblocking=<optimized out>)
#24 0x000055a96854351f in main_loop () at vl.c:2095
#25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
(gdb) p *bs
$1 = {total_sectors = 94186141054112, open_flags = 1811887680, read_only
= 169, encrypted = 85, valid_k
sg = false, probed = false, copy_on_read = 0, flush_queue = {entries
= {sqh_first = 0x0,
sqh_last = 0x55a96b79ca48}}, active_flush_req = false,
flushed_gen = 68727, drv = 0x0, opaque = 0
aio_context = 0x55a96b778cd0, aio_notifiers = {lh_first = 0x0},
walking_aio_notifiers = false,
filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070 times>,
backing_file = "\000mnt/sdb/lzg/disk_10G.raw", '\000' <repeats 4070
times>,
backing_format = "\000aw", '\000' <repeats 12 times>,
full_open_options = 0x0,
exact_filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070
times>, backing = 0x0, file = 0x0,
before_write_notifiers = {notifiers = {lh_first = 0x0}}, in_flight =
0, serialising_in_flight = 0,
wakeup = false, wr_highest_offset = 35188224, bl = {request_alignment
= 0, max_pdiscard = 0,
pdiscard_alignment = 0, max_pwrite_zeroes = 0,
pwrite_zeroes_alignment = 0, opt_transfer = 0, max_t
min_mem_alignment = 0, opt_mem_alignment = 0, max_iov = 0},
supported_write_flags = 0,
supported_zero_flags = 4, node_name = "#block349", '\000' <repeats 22
times>, node_list = {
tqe_next = 0x55a96b7b14f0, tqe_prev = 0x0}, bs_list = {tqe_next =
0x55a96b7ab240, tqe_prev = 0x0},
monitor_list = {tqe_next = 0x0, tqe_prev = 0x0}, dirty_bitmaps =
{lh_first = 0x0}, refcnt = 0,
tracked_requests = {lh_first = 0x0}, op_blockers = {{lh_first = 0x0}
<repeats 18 times>}, job = 0x0,
inherits_from = 0x0, children = {lh_first = 0x0}, parents = {lh_first
= 0x0}, options = 0x0,
explicit_options = 0x0, detect_zeroes =
BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker = 0x0,
write_threshold_offset = 0, write_threshold_notifier = {notify = 0x0,
node = {le_next = 0x0, le_prev
io_plugged = 0, quiesce_counter = 0, write_gen = 68727}
(gdb) p *bs->drv
Cannot access memory at address 0x0
From the bt we can see, qemu met a Null pointer reference in
bdrv_detach_aio_context. The code is below:
void bdrv_detach_aio_context(BlockDriverState *bs)
{
.........
QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
if (baf->deleted) {
bdrv_do_remove_aio_context_notifier(baf);
} else {
baf->detach_aio_context(baf->opaque);
}
}
/* Never mind iterating again to check for ->deleted. bdrv_close() will
* remove remaining aio notifiers if we aren't called again.
*/
bs->walking_aio_notifiers = false;
if (bs->drv->bdrv_detach_aio_context) {
bs->drv->bdrv_detach_aio_context(bs);
}
..................
}
In your patch, the mirror_exit is done in
aio_poll(qemu_get_aio_context(), true). In
mirror_exit, the top bs willl be free by bdrv_unref. So it will make a
Null pointer access in the follow-up procedure.
On 2018/6/10 15:43, Fam Zheng wrote:
> On Sat, 06/09 17:10, l00284672 wrote:
>> Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
>> coinstantaneously.
>>
>> The qemu bt is below:
>>
>> #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
>> #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
>> __nfds=<optimized out>, __fds=<optimized out>) at
>> /usr/include/bits/poll2.h:77
>> #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
>> timeout=timeout@entry=0) at util/qemu-timer.c:334
>> #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
>> blocking=blocking@entry=false) at util/aio-posix.c:629
>> #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
>> block/io.c:198
>> #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
>> block/io.c:215
>> #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
>> recursive=<optimized out>, parent=0x0) at block/io.c:291
>> #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
>> block/block-backend.c:1586
>> #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
>> #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
>> blockjob.c:139
>> #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
>> block.c:4885
>> #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
>> new_context=0x268e140) at block.c:4946
>> #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
>> at
>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
>> #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
>> hw/virtio/virtio-bus.c:246
>> #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
>> running=<optimized out>, state=<optimized out>)
>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
>> #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
>> state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
>> #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
>> send_stop=send_stop@entry=true)
>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
>> #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
>> #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
>> qmp.c:106
>> #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
>> ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
>> #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
>> request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
>> #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
>> qapi/qmp-dispatch.c:168
>> #22 0x00000000004655be in monitor_qmp_dispatch_one
>> (req_obj=req_obj@entry=0x39abff0) at
>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
>> #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
>> #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
>> #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
>> #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
>> util/aio-posix.c:436
>> #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
>> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
>> #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
>> /lib64/libglib-2.0.so.0
>> #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
>> #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
>> #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
>> #32 0x0000000000561cad in main_loop () at vl.c:1848
>> #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
>> envp=<optimized out>) at vl.c:4605
>>
>> The disk is a virtio-blk dataplane disk with a mirror job running. The dead
>> loop is here:
>>
>> static void block_job_detach_aio_context(void *opaque)
>> {
>> BlockJob *job = opaque;
>>
>> /* In case the job terminates during aio_poll()... */
>> job_ref(&job->job);
>>
>> job_pause(&job->job);
>>
>> while (!job->job.paused && !job_is_completed(&job->job)) {
>> job_drain(&job->job);
>> }
>>
>> job->job.aio_context = NULL;
>> job_unref(&job->job);
>> }
>>
>> The job is deferred to main loop now, but the job_drain only processes the
>> AIO context of bs which has no more work to do,
>>
>> while the main loop BH is scheduled for setting the job->completed flag is
>> never processed.
> In that case, main loop's AioContext should be driven like in job_finish_sync().
> Could you try this patch?
>
> diff --git a/blockjob.c b/blockjob.c
> index 0306533a2e..72aa82ac2d 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
>
> job_pause(&job->job);
>
> - while (!job->job.paused && !job_is_completed(&job->job)) {
> +
> + while (true) {
> + if (job->job.paused || job_is_completed(&job->job)) {
> + break;
> + }
> + if (job->deferred_to_main_loop) {
> + aio_poll(qemu_get_aio_context(), true);
> + continue;
> + }
> job_drain(&job->job);
> }
>
> .
>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: lizhengui.vcf --]
[-- Type: text/x-vcard; name="lizhengui.vcf", Size: 4 bytes --]
null
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-11 3:01 ` l00284672
@ 2018-06-11 3:31 ` l00284672
2018-06-12 1:19 ` l00284672
2018-06-12 1:45 ` Fam Zheng
0 siblings, 2 replies; 7+ messages in thread
From: l00284672 @ 2018-06-11 3:31 UTC (permalink / raw)
To: Fam Zheng; +Cc: stefanha, kwolf, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 12323 bytes --]
I tried your patch with my modification below can slove this problem.
void blk_set_aio_context(BlockBackend *blk, AioContext *new_context)
{
BlockDriverState *bs = blk_bs(blk);
ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
if (bs) {
if (tgm->throttle_state) {
bdrv_drained_begin(bs);
throttle_group_detach_aio_context(tgm);
throttle_group_attach_aio_context(tgm, new_context);
bdrv_drained_end(bs);
}
bdrv_ref(bs);
bdrv_set_aio_context(bs, new_context);
bdrv_unref(bs);
}
}
I add bdrv_ref before bdrv_set_aio_context to avoid bs freed in
mirror_exit. Do you agree with
my modification ?
On 2018/6/11 11:01, l00284672 wrote:
>
> Thanks for your reply.
>
> I tried your patch but it didn't work for qemu crashed. The qemu
> crash bt is below:
>
> (gdb) bt
> #0 bdrv_detach_aio_context (bs=bs@entry=0x55a96b79ca30)
> #1 0x000055a9688249ae in bdrv_set_aio_context
> (bs=bs@entry=0x55a96b79ca30,
> new_context=new_context@entry=0x55a96b766920)
> #2 0x000055a96885f721 in blk_set_aio_context (blk=0x55a96b792820,
> new_context=0x55a96b766920)
> #3 0x000055a9685ab797 in virtio_blk_data_plane_stop (vdev=<optimized
> out>)
> #4 0x000055a9687bf705 in virtio_bus_stop_ioeventfd (bus=0x55a96cc42220)
> #5 0x000055a9685d9d94 in virtio_vmstate_change
> (opaque=0x55a96cc42290, running=<optimized out>,
> state=<optimized out>)
> #6 0x000055a96866e1a2 in vm_state_notify (running=running@entry=0,
> state=state@entry=RUN_STATE_PAUSED)
> #7 0x000055a96857b4c5 in do_vm_stop (state=RUN_STATE_PAUSED)
> #8 vm_stop (state=state@entry=RUN_STATE_PAUSED)
> #9 0x000055a96867d52b in qmp_stop (errp=errp@entry=0x7fff4e54a0d8)
> #10 0x000055a96867b6ab in qmp_marshal_stop (args=<optimized out>,
> ret=<optimized out>, errp=0x7fff4e54a
> #11 0x000055a9688c2267 in do_qmp_dispatch (errp=0x7fff4e54a118,
> request=0x55a96b7b4740)
> #12 qmp_dispatch (request=request@entry=0x55a96b7ae490)
> #13 0x000055a96857dd42 in handle_qmp_command (parser=<optimized out>,
> tokens=<optimized out>)
> #14 0x000055a9688c7534 in json_message_process_token
> (lexer=0x55a96b776a68, input=0x55a96b70cae0, type=
> x=36, y=91)
> #15 0x000055a9688e960b in json_lexer_feed_char
> (lexer=lexer@entry=0x55a96b776a68, ch=125 '}',
> flush=flush@entry=false)
> #16 0x000055a9688e96ce in json_lexer_feed (lexer=0x55a96b776a68,
> buffer=<optimized out>, size=<optimize
> #17 0x000055a9688c75f9 in json_message_parser_feed (parser=<optimized
> out>, buffer=<optimized out>,
> #18 0x000055a96857c5fb in monitor_qmp_read (opaque=<optimized out>,
> buf=<optimized out>, size=<optimized out>)
> #19 0x000055a968667596 in tcp_chr_read (chan=<optimized out>,
> cond=<optimized out>, opaque=0x55a96b7748
> #20 0x00007f8a9447899a in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> #21 0x000055a968828c3c in glib_pollfds_poll ()
> #22 os_host_main_loop_wait (timeout=<optimized out>)
> #23 main_loop_wait (nonblocking=<optimized out>)
> #24 0x000055a96854351f in main_loop () at vl.c:2095
> #25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
> out>)
>
> (gdb) p *bs
> $1 = {total_sectors = 94186141054112, open_flags = 1811887680,
> read_only = 169, encrypted = 85, valid_k
> sg = false, probed = false, copy_on_read = 0, flush_queue = {entries
> = {sqh_first = 0x0,
> sqh_last = 0x55a96b79ca48}}, active_flush_req = false,
> flushed_gen = 68727, drv = 0x0, opaque = 0
> aio_context = 0x55a96b778cd0, aio_notifiers = {lh_first = 0x0},
> walking_aio_notifiers = false,
> filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070 times>,
> backing_file = "\000mnt/sdb/lzg/disk_10G.raw", '\000' <repeats 4070
> times>,
> backing_format = "\000aw", '\000' <repeats 12 times>,
> full_open_options = 0x0,
> exact_filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070
> times>, backing = 0x0, file = 0x0,
> before_write_notifiers = {notifiers = {lh_first = 0x0}}, in_flight =
> 0, serialising_in_flight = 0,
> wakeup = false, wr_highest_offset = 35188224, bl =
> {request_alignment = 0, max_pdiscard = 0,
> pdiscard_alignment = 0, max_pwrite_zeroes = 0,
> pwrite_zeroes_alignment = 0, opt_transfer = 0, max_t
> min_mem_alignment = 0, opt_mem_alignment = 0, max_iov = 0},
> supported_write_flags = 0,
> supported_zero_flags = 4, node_name = "#block349", '\000' <repeats
> 22 times>, node_list = {
> tqe_next = 0x55a96b7b14f0, tqe_prev = 0x0}, bs_list = {tqe_next =
> 0x55a96b7ab240, tqe_prev = 0x0},
> monitor_list = {tqe_next = 0x0, tqe_prev = 0x0}, dirty_bitmaps =
> {lh_first = 0x0}, refcnt = 0,
> tracked_requests = {lh_first = 0x0}, op_blockers = {{lh_first = 0x0}
> <repeats 18 times>}, job = 0x0,
> inherits_from = 0x0, children = {lh_first = 0x0}, parents =
> {lh_first = 0x0}, options = 0x0,
> explicit_options = 0x0, detect_zeroes =
> BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker = 0x0,
> write_threshold_offset = 0, write_threshold_notifier = {notify =
> 0x0, node = {le_next = 0x0, le_prev
> io_plugged = 0, quiesce_counter = 0, write_gen = 68727}
> (gdb) p *bs->drv
> Cannot access memory at address 0x0
>
> From the bt we can see, qemu met a Null pointer reference in
> bdrv_detach_aio_context. The code is below:
>
> void bdrv_detach_aio_context(BlockDriverState *bs)
> {
> .........
>
> QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
> if (baf->deleted) {
> bdrv_do_remove_aio_context_notifier(baf);
> } else {
> baf->detach_aio_context(baf->opaque);
> }
> }
> /* Never mind iterating again to check for ->deleted.
> bdrv_close() will
> * remove remaining aio notifiers if we aren't called again.
> */
> bs->walking_aio_notifiers = false;
>
> if (bs->drv->bdrv_detach_aio_context) {
> bs->drv->bdrv_detach_aio_context(bs);
> }
>
> ..................
> }
>
> In your patch, the mirror_exit is done in
> aio_poll(qemu_get_aio_context(), true). In
> mirror_exit, the top bs willl be free by bdrv_unref. So it will make a
> Null pointer access in the follow-up procedure.
>
>
>
>
> On 2018/6/10 15:43, Fam Zheng wrote:
>> On Sat, 06/09 17:10, l00284672 wrote:
>>> Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
>>> coinstantaneously.
>>>
>>> The qemu bt is below:
>>>
>>> #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
>>> #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
>>> __nfds=<optimized out>, __fds=<optimized out>) at
>>> /usr/include/bits/poll2.h:77
>>> #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
>>> timeout=timeout@entry=0) at util/qemu-timer.c:334
>>> #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
>>> blocking=blocking@entry=false) at util/aio-posix.c:629
>>> #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
>>> block/io.c:198
>>> #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
>>> block/io.c:215
>>> #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
>>> recursive=<optimized out>, parent=0x0) at block/io.c:291
>>> #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
>>> block/block-backend.c:1586
>>> #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
>>> #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
>>> blockjob.c:139
>>> #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
>>> block.c:4885
>>> #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
>>> new_context=0x268e140) at block.c:4946
>>> #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
>>> at
>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
>>> #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
>>> hw/virtio/virtio-bus.c:246
>>> #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
>>> running=<optimized out>, state=<optimized out>)
>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
>>> #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
>>> state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
>>> #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
>>> send_stop=send_stop@entry=true)
>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
>>> #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
>>> #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
>>> qmp.c:106
>>> #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
>>> ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
>>> #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
>>> request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
>>> #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
>>> qapi/qmp-dispatch.c:168
>>> #22 0x00000000004655be in monitor_qmp_dispatch_one
>>> (req_obj=req_obj@entry=0x39abff0) at
>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
>>> #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
>>> #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
>>> #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
>>> #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
>>> util/aio-posix.c:436
>>> #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
>>> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
>>> #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
>>> /lib64/libglib-2.0.so.0
>>> #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
>>> #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
>>> #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
>>> #32 0x0000000000561cad in main_loop () at vl.c:1848
>>> #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
>>> envp=<optimized out>) at vl.c:4605
>>>
>>> The disk is a virtio-blk dataplane disk with a mirror job running. The dead
>>> loop is here:
>>>
>>> static void block_job_detach_aio_context(void *opaque)
>>> {
>>> BlockJob *job = opaque;
>>>
>>> /* In case the job terminates during aio_poll()... */
>>> job_ref(&job->job);
>>>
>>> job_pause(&job->job);
>>>
>>> while (!job->job.paused && !job_is_completed(&job->job)) {
>>> job_drain(&job->job);
>>> }
>>>
>>> job->job.aio_context = NULL;
>>> job_unref(&job->job);
>>> }
>>>
>>> The job is deferred to main loop now, but the job_drain only processes the
>>> AIO context of bs which has no more work to do,
>>>
>>> while the main loop BH is scheduled for setting the job->completed flag is
>>> never processed.
>> In that case, main loop's AioContext should be driven like in job_finish_sync().
>> Could you try this patch?
>>
>> diff --git a/blockjob.c b/blockjob.c
>> index 0306533a2e..72aa82ac2d 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
>>
>> job_pause(&job->job);
>>
>> - while (!job->job.paused && !job_is_completed(&job->job)) {
>> +
>> + while (true) {
>> + if (job->job.paused || job_is_completed(&job->job)) {
>> + break;
>> + }
>> + if (job->deferred_to_main_loop) {
>> + aio_poll(qemu_get_aio_context(), true);
>> + continue;
>> + }
>> job_drain(&job->job);
>> }
>>
>> .
>>
>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: lizhengui.vcf --]
[-- Type: text/x-vcard; name="lizhengui.vcf", Size: 4 bytes --]
null
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-11 3:31 ` l00284672
@ 2018-06-12 1:19 ` l00284672
2018-06-12 1:45 ` Fam Zheng
1 sibling, 0 replies; 7+ messages in thread
From: l00284672 @ 2018-06-12 1:19 UTC (permalink / raw)
To: Fam Zheng; +Cc: stefanha, kwolf, lizhengui, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 12664 bytes --]
ping
On 2018/6/11 11:31, l00284672 wrote:
>
> I tried your patch with my modification below can slove this problem.
>
> void blk_set_aio_context(BlockBackend *blk, AioContext *new_context)
> {
> BlockDriverState *bs = blk_bs(blk);
> ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
>
> if (bs) {
> if (tgm->throttle_state) {
> bdrv_drained_begin(bs);
> throttle_group_detach_aio_context(tgm);
> throttle_group_attach_aio_context(tgm, new_context);
> bdrv_drained_end(bs);
> }
> bdrv_ref(bs);
> bdrv_set_aio_context(bs, new_context);
> bdrv_unref(bs);
> }
> }
>
> I add bdrv_ref before bdrv_set_aio_context to avoid bs freed in
> mirror_exit. Do you agree with
> my modification ?
>
> On 2018/6/11 11:01, l00284672 wrote:
>>
>> Thanks for your reply.
>>
>> I tried your patch but it didn't work for qemu crashed. The qemu
>> crash bt is below:
>>
>> (gdb) bt
>> #0 bdrv_detach_aio_context (bs=bs@entry=0x55a96b79ca30)
>> #1 0x000055a9688249ae in bdrv_set_aio_context
>> (bs=bs@entry=0x55a96b79ca30,
>> new_context=new_context@entry=0x55a96b766920)
>> #2 0x000055a96885f721 in blk_set_aio_context (blk=0x55a96b792820,
>> new_context=0x55a96b766920)
>> #3 0x000055a9685ab797 in virtio_blk_data_plane_stop (vdev=<optimized
>> out>)
>> #4 0x000055a9687bf705 in virtio_bus_stop_ioeventfd (bus=0x55a96cc42220)
>> #5 0x000055a9685d9d94 in virtio_vmstate_change
>> (opaque=0x55a96cc42290, running=<optimized out>,
>> state=<optimized out>)
>> #6 0x000055a96866e1a2 in vm_state_notify (running=running@entry=0,
>> state=state@entry=RUN_STATE_PAUSED)
>> #7 0x000055a96857b4c5 in do_vm_stop (state=RUN_STATE_PAUSED)
>> #8 vm_stop (state=state@entry=RUN_STATE_PAUSED)
>> #9 0x000055a96867d52b in qmp_stop (errp=errp@entry=0x7fff4e54a0d8)
>> #10 0x000055a96867b6ab in qmp_marshal_stop (args=<optimized out>,
>> ret=<optimized out>, errp=0x7fff4e54a
>> #11 0x000055a9688c2267 in do_qmp_dispatch (errp=0x7fff4e54a118,
>> request=0x55a96b7b4740)
>> #12 qmp_dispatch (request=request@entry=0x55a96b7ae490)
>> #13 0x000055a96857dd42 in handle_qmp_command (parser=<optimized out>,
>> tokens=<optimized out>)
>> #14 0x000055a9688c7534 in json_message_process_token
>> (lexer=0x55a96b776a68, input=0x55a96b70cae0, type=
>> x=36, y=91)
>> #15 0x000055a9688e960b in json_lexer_feed_char
>> (lexer=lexer@entry=0x55a96b776a68, ch=125 '}',
>> flush=flush@entry=false)
>> #16 0x000055a9688e96ce in json_lexer_feed (lexer=0x55a96b776a68,
>> buffer=<optimized out>, size=<optimize
>> #17 0x000055a9688c75f9 in json_message_parser_feed (parser=<optimized
>> out>, buffer=<optimized out>,
>> #18 0x000055a96857c5fb in monitor_qmp_read (opaque=<optimized out>,
>> buf=<optimized out>, size=<optimized out>)
>> #19 0x000055a968667596 in tcp_chr_read (chan=<optimized out>,
>> cond=<optimized out>, opaque=0x55a96b7748
>> #20 0x00007f8a9447899a in g_main_context_dispatch () from
>> /lib64/libglib-2.0.so.0
>> #21 0x000055a968828c3c in glib_pollfds_poll ()
>> #22 os_host_main_loop_wait (timeout=<optimized out>)
>> #23 main_loop_wait (nonblocking=<optimized out>)
>> #24 0x000055a96854351f in main_loop () at vl.c:2095
>> #25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
>> out>)
>>
>> (gdb) p *bs
>> $1 = {total_sectors = 94186141054112, open_flags = 1811887680,
>> read_only = 169, encrypted = 85, valid_k
>> sg = false, probed = false, copy_on_read = 0, flush_queue =
>> {entries = {sqh_first = 0x0,
>> sqh_last = 0x55a96b79ca48}}, active_flush_req = false,
>> flushed_gen = 68727, drv = 0x0, opaque = 0
>> aio_context = 0x55a96b778cd0, aio_notifiers = {lh_first = 0x0},
>> walking_aio_notifiers = false,
>> filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070 times>,
>> backing_file = "\000mnt/sdb/lzg/disk_10G.raw", '\000' <repeats 4070
>> times>,
>> backing_format = "\000aw", '\000' <repeats 12 times>,
>> full_open_options = 0x0,
>> exact_filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070
>> times>, backing = 0x0, file = 0x0,
>> before_write_notifiers = {notifiers = {lh_first = 0x0}}, in_flight
>> = 0, serialising_in_flight = 0,
>> wakeup = false, wr_highest_offset = 35188224, bl =
>> {request_alignment = 0, max_pdiscard = 0,
>> pdiscard_alignment = 0, max_pwrite_zeroes = 0,
>> pwrite_zeroes_alignment = 0, opt_transfer = 0, max_t
>> min_mem_alignment = 0, opt_mem_alignment = 0, max_iov = 0},
>> supported_write_flags = 0,
>> supported_zero_flags = 4, node_name = "#block349", '\000' <repeats
>> 22 times>, node_list = {
>> tqe_next = 0x55a96b7b14f0, tqe_prev = 0x0}, bs_list = {tqe_next =
>> 0x55a96b7ab240, tqe_prev = 0x0},
>> monitor_list = {tqe_next = 0x0, tqe_prev = 0x0}, dirty_bitmaps =
>> {lh_first = 0x0}, refcnt = 0,
>> tracked_requests = {lh_first = 0x0}, op_blockers = {{lh_first =
>> 0x0} <repeats 18 times>}, job = 0x0,
>> inherits_from = 0x0, children = {lh_first = 0x0}, parents =
>> {lh_first = 0x0}, options = 0x0,
>> explicit_options = 0x0, detect_zeroes =
>> BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker = 0x0,
>> write_threshold_offset = 0, write_threshold_notifier = {notify =
>> 0x0, node = {le_next = 0x0, le_prev
>> io_plugged = 0, quiesce_counter = 0, write_gen = 68727}
>> (gdb) p *bs->drv
>> Cannot access memory at address 0x0
>>
>> From the bt we can see, qemu met a Null pointer reference in
>> bdrv_detach_aio_context. The code is below:
>>
>> void bdrv_detach_aio_context(BlockDriverState *bs)
>> {
>> .........
>>
>> QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
>> if (baf->deleted) {
>> bdrv_do_remove_aio_context_notifier(baf);
>> } else {
>> baf->detach_aio_context(baf->opaque);
>> }
>> }
>> /* Never mind iterating again to check for ->deleted.
>> bdrv_close() will
>> * remove remaining aio notifiers if we aren't called again.
>> */
>> bs->walking_aio_notifiers = false;
>>
>> if (bs->drv->bdrv_detach_aio_context) {
>> bs->drv->bdrv_detach_aio_context(bs);
>> }
>>
>> ..................
>> }
>>
>> In your patch, the mirror_exit is done in
>> aio_poll(qemu_get_aio_context(), true). In
>> mirror_exit, the top bs willl be free by bdrv_unref. So it will make a
>> Null pointer access in the follow-up procedure.
>>
>>
>>
>>
>> On 2018/6/10 15:43, Fam Zheng wrote:
>>> On Sat, 06/09 17:10, l00284672 wrote:
>>>> Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
>>>> coinstantaneously.
>>>>
>>>> The qemu bt is below:
>>>>
>>>> #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
>>>> #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
>>>> __nfds=<optimized out>, __fds=<optimized out>) at
>>>> /usr/include/bits/poll2.h:77
>>>> #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
>>>> timeout=timeout@entry=0) at util/qemu-timer.c:334
>>>> #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
>>>> blocking=blocking@entry=false) at util/aio-posix.c:629
>>>> #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
>>>> block/io.c:198
>>>> #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
>>>> block/io.c:215
>>>> #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
>>>> recursive=<optimized out>, parent=0x0) at block/io.c:291
>>>> #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
>>>> block/block-backend.c:1586
>>>> #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
>>>> #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
>>>> blockjob.c:139
>>>> #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
>>>> block.c:4885
>>>> #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
>>>> new_context=0x268e140) at block.c:4946
>>>> #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
>>>> at
>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
>>>> #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
>>>> hw/virtio/virtio-bus.c:246
>>>> #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
>>>> running=<optimized out>, state=<optimized out>)
>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
>>>> #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
>>>> state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
>>>> #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
>>>> send_stop=send_stop@entry=true)
>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
>>>> #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
>>>> #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
>>>> qmp.c:106
>>>> #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
>>>> ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
>>>> #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
>>>> request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
>>>> #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
>>>> qapi/qmp-dispatch.c:168
>>>> #22 0x00000000004655be in monitor_qmp_dispatch_one
>>>> (req_obj=req_obj@entry=0x39abff0) at
>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
>>>> #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
>>>> #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
>>>> #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
>>>> #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
>>>> util/aio-posix.c:436
>>>> #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
>>>> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
>>>> #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
>>>> /lib64/libglib-2.0.so.0
>>>> #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
>>>> #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
>>>> #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
>>>> #32 0x0000000000561cad in main_loop () at vl.c:1848
>>>> #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
>>>> envp=<optimized out>) at vl.c:4605
>>>>
>>>> The disk is a virtio-blk dataplane disk with a mirror job running. The dead
>>>> loop is here:
>>>>
>>>> static void block_job_detach_aio_context(void *opaque)
>>>> {
>>>> BlockJob *job = opaque;
>>>>
>>>> /* In case the job terminates during aio_poll()... */
>>>> job_ref(&job->job);
>>>>
>>>> job_pause(&job->job);
>>>>
>>>> while (!job->job.paused && !job_is_completed(&job->job)) {
>>>> job_drain(&job->job);
>>>> }
>>>>
>>>> job->job.aio_context = NULL;
>>>> job_unref(&job->job);
>>>> }
>>>>
>>>> The job is deferred to main loop now, but the job_drain only processes the
>>>> AIO context of bs which has no more work to do,
>>>>
>>>> while the main loop BH is scheduled for setting the job->completed flag is
>>>> never processed.
>>> In that case, main loop's AioContext should be driven like in job_finish_sync().
>>> Could you try this patch?
>>>
>>> diff --git a/blockjob.c b/blockjob.c
>>> index 0306533a2e..72aa82ac2d 100644
>>> --- a/blockjob.c
>>> +++ b/blockjob.c
>>> @@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
>>>
>>> job_pause(&job->job);
>>>
>>> - while (!job->job.paused && !job_is_completed(&job->job)) {
>>> +
>>> + while (true) {
>>> + if (job->job.paused || job_is_completed(&job->job)) {
>>> + break;
>>> + }
>>> + if (job->deferred_to_main_loop) {
>>> + aio_poll(qemu_get_aio_context(), true);
>>> + continue;
>>> + }
>>> job_drain(&job->job);
>>> }
>>>
>>> .
>>>
>>
>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: lizhengui.vcf --]
[-- Type: text/x-vcard; name="lizhengui.vcf", Size: 4 bytes --]
null
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-11 3:31 ` l00284672
2018-06-12 1:19 ` l00284672
@ 2018-06-12 1:45 ` Fam Zheng
2018-06-12 2:11 ` l00284672
1 sibling, 1 reply; 7+ messages in thread
From: Fam Zheng @ 2018-06-12 1:45 UTC (permalink / raw)
To: l00284672; +Cc: stefanha, kwolf, qemu-devel
On Mon, 06/11 11:31, l00284672 wrote:
> I tried your patch with my modification below can slove this problem.
>
> void blk_set_aio_context(BlockBackend *blk, AioContext *new_context)
> {
> BlockDriverState *bs = blk_bs(blk);
> ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
>
> if (bs) {
> if (tgm->throttle_state) {
> bdrv_drained_begin(bs);
> throttle_group_detach_aio_context(tgm);
> throttle_group_attach_aio_context(tgm, new_context);
> bdrv_drained_end(bs);
> }
> bdrv_ref(bs);
> bdrv_set_aio_context(bs, new_context);
> bdrv_unref(bs);
> }
> }
>
> I add bdrv_ref before bdrv_set_aio_context to avoid bs freed in
> mirror_exit. Do you agree with
> my modification ?
TBH I don't understand this change. @blk should have a reference to @bs here,
no? Why is an extra reference making any difference?
Fam
>
> On 2018/6/11 11:01, l00284672 wrote:
> >
> > Thanks for your reply.
> >
> > I tried your patch but it didn't work for qemu crashed. The qemu crash
> > bt is below:
> >
> > (gdb) bt
> > #0 bdrv_detach_aio_context (bs=bs@entry=0x55a96b79ca30)
> > #1 0x000055a9688249ae in bdrv_set_aio_context
> > (bs=bs@entry=0x55a96b79ca30,
> > new_context=new_context@entry=0x55a96b766920)
> > #2 0x000055a96885f721 in blk_set_aio_context (blk=0x55a96b792820,
> > new_context=0x55a96b766920)
> > #3 0x000055a9685ab797 in virtio_blk_data_plane_stop (vdev=<optimized
> > out>)
> > #4 0x000055a9687bf705 in virtio_bus_stop_ioeventfd (bus=0x55a96cc42220)
> > #5 0x000055a9685d9d94 in virtio_vmstate_change (opaque=0x55a96cc42290,
> > running=<optimized out>,
> > state=<optimized out>)
> > #6 0x000055a96866e1a2 in vm_state_notify (running=running@entry=0,
> > state=state@entry=RUN_STATE_PAUSED)
> > #7 0x000055a96857b4c5 in do_vm_stop (state=RUN_STATE_PAUSED)
> > #8 vm_stop (state=state@entry=RUN_STATE_PAUSED)
> > #9 0x000055a96867d52b in qmp_stop (errp=errp@entry=0x7fff4e54a0d8)
> > #10 0x000055a96867b6ab in qmp_marshal_stop (args=<optimized out>,
> > ret=<optimized out>, errp=0x7fff4e54a
> > #11 0x000055a9688c2267 in do_qmp_dispatch (errp=0x7fff4e54a118,
> > request=0x55a96b7b4740)
> > #12 qmp_dispatch (request=request@entry=0x55a96b7ae490)
> > #13 0x000055a96857dd42 in handle_qmp_command (parser=<optimized out>,
> > tokens=<optimized out>)
> > #14 0x000055a9688c7534 in json_message_process_token
> > (lexer=0x55a96b776a68, input=0x55a96b70cae0, type=
> > x=36, y=91)
> > #15 0x000055a9688e960b in json_lexer_feed_char
> > (lexer=lexer@entry=0x55a96b776a68, ch=125 '}',
> > flush=flush@entry=false)
> > #16 0x000055a9688e96ce in json_lexer_feed (lexer=0x55a96b776a68,
> > buffer=<optimized out>, size=<optimize
> > #17 0x000055a9688c75f9 in json_message_parser_feed (parser=<optimized
> > out>, buffer=<optimized out>,
> > #18 0x000055a96857c5fb in monitor_qmp_read (opaque=<optimized out>,
> > buf=<optimized out>, size=<optimized out>)
> > #19 0x000055a968667596 in tcp_chr_read (chan=<optimized out>,
> > cond=<optimized out>, opaque=0x55a96b7748
> > #20 0x00007f8a9447899a in g_main_context_dispatch () from
> > /lib64/libglib-2.0.so.0
> > #21 0x000055a968828c3c in glib_pollfds_poll ()
> > #22 os_host_main_loop_wait (timeout=<optimized out>)
> > #23 main_loop_wait (nonblocking=<optimized out>)
> > #24 0x000055a96854351f in main_loop () at vl.c:2095
> > #25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
> > out>)
> >
> > (gdb) p *bs
> > $1 = {total_sectors = 94186141054112, open_flags = 1811887680, read_only
> > = 169, encrypted = 85, valid_k
> > sg = false, probed = false, copy_on_read = 0, flush_queue = {entries =
> > {sqh_first = 0x0,
> > sqh_last = 0x55a96b79ca48}}, active_flush_req = false, flushed_gen
> > = 68727, drv = 0x0, opaque = 0
> > aio_context = 0x55a96b778cd0, aio_notifiers = {lh_first = 0x0},
> > walking_aio_notifiers = false,
> > filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070 times>,
> > backing_file = "\000mnt/sdb/lzg/disk_10G.raw", '\000' <repeats 4070
> > times>,
> > backing_format = "\000aw", '\000' <repeats 12 times>,
> > full_open_options = 0x0,
> > exact_filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070
> > times>, backing = 0x0, file = 0x0,
> > before_write_notifiers = {notifiers = {lh_first = 0x0}}, in_flight =
> > 0, serialising_in_flight = 0,
> > wakeup = false, wr_highest_offset = 35188224, bl = {request_alignment
> > = 0, max_pdiscard = 0,
> > pdiscard_alignment = 0, max_pwrite_zeroes = 0,
> > pwrite_zeroes_alignment = 0, opt_transfer = 0, max_t
> > min_mem_alignment = 0, opt_mem_alignment = 0, max_iov = 0},
> > supported_write_flags = 0,
> > supported_zero_flags = 4, node_name = "#block349", '\000' <repeats 22
> > times>, node_list = {
> > tqe_next = 0x55a96b7b14f0, tqe_prev = 0x0}, bs_list = {tqe_next =
> > 0x55a96b7ab240, tqe_prev = 0x0},
> > monitor_list = {tqe_next = 0x0, tqe_prev = 0x0}, dirty_bitmaps =
> > {lh_first = 0x0}, refcnt = 0,
> > tracked_requests = {lh_first = 0x0}, op_blockers = {{lh_first = 0x0}
> > <repeats 18 times>}, job = 0x0,
> > inherits_from = 0x0, children = {lh_first = 0x0}, parents = {lh_first
> > = 0x0}, options = 0x0,
> > explicit_options = 0x0, detect_zeroes =
> > BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker = 0x0,
> > write_threshold_offset = 0, write_threshold_notifier = {notify = 0x0,
> > node = {le_next = 0x0, le_prev
> > io_plugged = 0, quiesce_counter = 0, write_gen = 68727}
> > (gdb) p *bs->drv
> > Cannot access memory at address 0x0
> >
> > From the bt we can see, qemu met a Null pointer reference in
> > bdrv_detach_aio_context. The code is below:
> >
> > void bdrv_detach_aio_context(BlockDriverState *bs)
> > {
> > .........
> >
> > QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
> > if (baf->deleted) {
> > bdrv_do_remove_aio_context_notifier(baf);
> > } else {
> > baf->detach_aio_context(baf->opaque);
> > }
> > }
> > /* Never mind iterating again to check for ->deleted. bdrv_close()
> > will
> > * remove remaining aio notifiers if we aren't called again.
> > */
> > bs->walking_aio_notifiers = false;
> >
> > if (bs->drv->bdrv_detach_aio_context) {
> > bs->drv->bdrv_detach_aio_context(bs);
> > }
> >
> > ..................
> > }
> >
> > In your patch, the mirror_exit is done in
> > aio_poll(qemu_get_aio_context(), true). In
> > mirror_exit, the top bs willl be free by bdrv_unref. So it will make a
> > Null pointer access in the follow-up procedure.
> >
> >
> >
> >
> > On 2018/6/10 15:43, Fam Zheng wrote:
> > > On Sat, 06/09 17:10, l00284672 wrote:
> > > > Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
> > > > coinstantaneously.
> > > >
> > > > The qemu bt is below:
> > > >
> > > > #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
> > > > #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
> > > > __nfds=<optimized out>, __fds=<optimized out>) at
> > > > /usr/include/bits/poll2.h:77
> > > > #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
> > > > timeout=timeout@entry=0) at util/qemu-timer.c:334
> > > > #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
> > > > blocking=blocking@entry=false) at util/aio-posix.c:629
> > > > #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
> > > > block/io.c:198
> > > > #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
> > > > block/io.c:215
> > > > #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
> > > > recursive=<optimized out>, parent=0x0) at block/io.c:291
> > > > #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
> > > > block/block-backend.c:1586
> > > > #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
> > > > #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
> > > > blockjob.c:139
> > > > #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
> > > > block.c:4885
> > > > #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
> > > > new_context=0x268e140) at block.c:4946
> > > > #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
> > > > at
> > > > /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
> > > > #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
> > > > hw/virtio/virtio-bus.c:246
> > > > #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
> > > > running=<optimized out>, state=<optimized out>)
> > > > at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
> > > > #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
> > > > state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
> > > > #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
> > > > send_stop=send_stop@entry=true)
> > > > at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
> > > > #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
> > > > /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
> > > > #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
> > > > qmp.c:106
> > > > #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
> > > > ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
> > > > #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
> > > > request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
> > > > #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
> > > > qapi/qmp-dispatch.c:168
> > > > #22 0x00000000004655be in monitor_qmp_dispatch_one
> > > > (req_obj=req_obj@entry=0x39abff0) at
> > > > /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
> > > > #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
> > > > at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
> > > > #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
> > > > #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
> > > > #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
> > > > util/aio-posix.c:436
> > > > #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
> > > > callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
> > > > #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
> > > > /lib64/libglib-2.0.so.0
> > > > #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
> > > > #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
> > > > #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
> > > > #32 0x0000000000561cad in main_loop () at vl.c:1848
> > > > #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
> > > > envp=<optimized out>) at vl.c:4605
> > > >
> > > > The disk is a virtio-blk dataplane disk with a mirror job running. The dead
> > > > loop is here:
> > > >
> > > > static void block_job_detach_aio_context(void *opaque)
> > > > {
> > > > BlockJob *job = opaque;
> > > >
> > > > /* In case the job terminates during aio_poll()... */
> > > > job_ref(&job->job);
> > > >
> > > > job_pause(&job->job);
> > > >
> > > > while (!job->job.paused && !job_is_completed(&job->job)) {
> > > > job_drain(&job->job);
> > > > }
> > > >
> > > > job->job.aio_context = NULL;
> > > > job_unref(&job->job);
> > > > }
> > > >
> > > > The job is deferred to main loop now, but the job_drain only processes the
> > > > AIO context of bs which has no more work to do,
> > > >
> > > > while the main loop BH is scheduled for setting the job->completed flag is
> > > > never processed.
> > > In that case, main loop's AioContext should be driven like in job_finish_sync().
> > > Could you try this patch?
> > >
> > > diff --git a/blockjob.c b/blockjob.c
> > > index 0306533a2e..72aa82ac2d 100644
> > > --- a/blockjob.c
> > > +++ b/blockjob.c
> > > @@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
> > >
> > > job_pause(&job->job);
> > >
> > > - while (!job->job.paused && !job_is_completed(&job->job)) {
> > > +
> > > + while (true) {
> > > + if (job->job.paused || job_is_completed(&job->job)) {
> > > + break;
> > > + }
> > > + if (job->deferred_to_main_loop) {
> > > + aio_poll(qemu_get_aio_context(), true);
> > > + continue;
> > > + }
> > > job_drain(&job->job);
> > > }
> > >
> > > .
> > >
> >
>
> null
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously
2018-06-12 1:45 ` Fam Zheng
@ 2018-06-12 2:11 ` l00284672
0 siblings, 0 replies; 7+ messages in thread
From: l00284672 @ 2018-06-12 2:11 UTC (permalink / raw)
To: Fam Zheng; +Cc: stefanha, kwolf, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 13463 bytes --]
As my previos reply said, the mirror_exit is done in
aio_poll(qemu_get_aio_context(), true). In
mirror_exit, the bs willl be free by bdrv_unref. So it will make a
Null pointer access in the follow-up procedure in bdrv_set_aio_context.
So we should add bdrv_ref for it to avoid bs freed in mirror_exit.
On 2018/6/12 9:45, Fam Zheng wrote:
> On Mon, 06/11 11:31, l00284672 wrote:
>> I tried your patch with my modification below can slove this problem.
>>
>> void blk_set_aio_context(BlockBackend *blk, AioContext *new_context)
>> {
>> BlockDriverState *bs = blk_bs(blk);
>> ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
>>
>> if (bs) {
>> if (tgm->throttle_state) {
>> bdrv_drained_begin(bs);
>> throttle_group_detach_aio_context(tgm);
>> throttle_group_attach_aio_context(tgm, new_context);
>> bdrv_drained_end(bs);
>> }
>> bdrv_ref(bs);
>> bdrv_set_aio_context(bs, new_context);
>> bdrv_unref(bs);
>> }
>> }
>>
>> I add bdrv_ref before bdrv_set_aio_context to avoid bs freed in
>> mirror_exit. Do you agree with
>> my modification ?
> TBH I don't understand this change. @blk should have a reference to @bs here,
> no? Why is an extra reference making any difference?
>
> Fam
>
>> On 2018/6/11 11:01, l00284672 wrote:
>>> Thanks for your reply.
>>>
>>> I tried your patch but it didn't work for qemu crashed. The qemu crash
>>> bt is below:
>>>
>>> (gdb) bt
>>> #0 bdrv_detach_aio_context (bs=bs@entry=0x55a96b79ca30)
>>> #1 0x000055a9688249ae in bdrv_set_aio_context
>>> (bs=bs@entry=0x55a96b79ca30,
>>> new_context=new_context@entry=0x55a96b766920)
>>> #2 0x000055a96885f721 in blk_set_aio_context (blk=0x55a96b792820,
>>> new_context=0x55a96b766920)
>>> #3 0x000055a9685ab797 in virtio_blk_data_plane_stop (vdev=<optimized
>>> out>)
>>> #4 0x000055a9687bf705 in virtio_bus_stop_ioeventfd (bus=0x55a96cc42220)
>>> #5 0x000055a9685d9d94 in virtio_vmstate_change (opaque=0x55a96cc42290,
>>> running=<optimized out>,
>>> state=<optimized out>)
>>> #6 0x000055a96866e1a2 in vm_state_notify (running=running@entry=0,
>>> state=state@entry=RUN_STATE_PAUSED)
>>> #7 0x000055a96857b4c5 in do_vm_stop (state=RUN_STATE_PAUSED)
>>> #8 vm_stop (state=state@entry=RUN_STATE_PAUSED)
>>> #9 0x000055a96867d52b in qmp_stop (errp=errp@entry=0x7fff4e54a0d8)
>>> #10 0x000055a96867b6ab in qmp_marshal_stop (args=<optimized out>,
>>> ret=<optimized out>, errp=0x7fff4e54a
>>> #11 0x000055a9688c2267 in do_qmp_dispatch (errp=0x7fff4e54a118,
>>> request=0x55a96b7b4740)
>>> #12 qmp_dispatch (request=request@entry=0x55a96b7ae490)
>>> #13 0x000055a96857dd42 in handle_qmp_command (parser=<optimized out>,
>>> tokens=<optimized out>)
>>> #14 0x000055a9688c7534 in json_message_process_token
>>> (lexer=0x55a96b776a68, input=0x55a96b70cae0, type=
>>> x=36, y=91)
>>> #15 0x000055a9688e960b in json_lexer_feed_char
>>> (lexer=lexer@entry=0x55a96b776a68, ch=125 '}',
>>> flush=flush@entry=false)
>>> #16 0x000055a9688e96ce in json_lexer_feed (lexer=0x55a96b776a68,
>>> buffer=<optimized out>, size=<optimize
>>> #17 0x000055a9688c75f9 in json_message_parser_feed (parser=<optimized
>>> out>, buffer=<optimized out>,
>>> #18 0x000055a96857c5fb in monitor_qmp_read (opaque=<optimized out>,
>>> buf=<optimized out>, size=<optimized out>)
>>> #19 0x000055a968667596 in tcp_chr_read (chan=<optimized out>,
>>> cond=<optimized out>, opaque=0x55a96b7748
>>> #20 0x00007f8a9447899a in g_main_context_dispatch () from
>>> /lib64/libglib-2.0.so.0
>>> #21 0x000055a968828c3c in glib_pollfds_poll ()
>>> #22 os_host_main_loop_wait (timeout=<optimized out>)
>>> #23 main_loop_wait (nonblocking=<optimized out>)
>>> #24 0x000055a96854351f in main_loop () at vl.c:2095
>>> #25 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
>>> out>)
>>>
>>> (gdb) p *bs
>>> $1 = {total_sectors = 94186141054112, open_flags = 1811887680, read_only
>>> = 169, encrypted = 85, valid_k
>>> sg = false, probed = false, copy_on_read = 0, flush_queue = {entries =
>>> {sqh_first = 0x0,
>>> sqh_last = 0x55a96b79ca48}}, active_flush_req = false, flushed_gen
>>> = 68727, drv = 0x0, opaque = 0
>>> aio_context = 0x55a96b778cd0, aio_notifiers = {lh_first = 0x0},
>>> walking_aio_notifiers = false,
>>> filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070 times>,
>>> backing_file = "\000mnt/sdb/lzg/disk_10G.raw", '\000' <repeats 4070
>>> times>,
>>> backing_format = "\000aw", '\000' <repeats 12 times>,
>>> full_open_options = 0x0,
>>> exact_filename = "/mnt/sdb/lzg/disk_10G.son", '\000' <repeats 4070
>>> times>, backing = 0x0, file = 0x0,
>>> before_write_notifiers = {notifiers = {lh_first = 0x0}}, in_flight =
>>> 0, serialising_in_flight = 0,
>>> wakeup = false, wr_highest_offset = 35188224, bl = {request_alignment
>>> = 0, max_pdiscard = 0,
>>> pdiscard_alignment = 0, max_pwrite_zeroes = 0,
>>> pwrite_zeroes_alignment = 0, opt_transfer = 0, max_t
>>> min_mem_alignment = 0, opt_mem_alignment = 0, max_iov = 0},
>>> supported_write_flags = 0,
>>> supported_zero_flags = 4, node_name = "#block349", '\000' <repeats 22
>>> times>, node_list = {
>>> tqe_next = 0x55a96b7b14f0, tqe_prev = 0x0}, bs_list = {tqe_next =
>>> 0x55a96b7ab240, tqe_prev = 0x0},
>>> monitor_list = {tqe_next = 0x0, tqe_prev = 0x0}, dirty_bitmaps =
>>> {lh_first = 0x0}, refcnt = 0,
>>> tracked_requests = {lh_first = 0x0}, op_blockers = {{lh_first = 0x0}
>>> <repeats 18 times>}, job = 0x0,
>>> inherits_from = 0x0, children = {lh_first = 0x0}, parents = {lh_first
>>> = 0x0}, options = 0x0,
>>> explicit_options = 0x0, detect_zeroes =
>>> BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker = 0x0,
>>> write_threshold_offset = 0, write_threshold_notifier = {notify = 0x0,
>>> node = {le_next = 0x0, le_prev
>>> io_plugged = 0, quiesce_counter = 0, write_gen = 68727}
>>> (gdb) p *bs->drv
>>> Cannot access memory at address 0x0
>>>
>>> From the bt we can see, qemu met a Null pointer reference in
>>> bdrv_detach_aio_context. The code is below:
>>>
>>> void bdrv_detach_aio_context(BlockDriverState *bs)
>>> {
>>> .........
>>>
>>> QLIST_FOREACH_SAFE(baf, &bs->aio_notifiers, list, baf_tmp) {
>>> if (baf->deleted) {
>>> bdrv_do_remove_aio_context_notifier(baf);
>>> } else {
>>> baf->detach_aio_context(baf->opaque);
>>> }
>>> }
>>> /* Never mind iterating again to check for ->deleted. bdrv_close()
>>> will
>>> * remove remaining aio notifiers if we aren't called again.
>>> */
>>> bs->walking_aio_notifiers = false;
>>>
>>> if (bs->drv->bdrv_detach_aio_context) {
>>> bs->drv->bdrv_detach_aio_context(bs);
>>> }
>>>
>>> ..................
>>> }
>>>
>>> In your patch, the mirror_exit is done in
>>> aio_poll(qemu_get_aio_context(), true). In
>>> mirror_exit, the top bs willl be free by bdrv_unref. So it will make a
>>> Null pointer access in the follow-up procedure.
>>>
>>>
>>>
>>>
>>> On 2018/6/10 15:43, Fam Zheng wrote:
>>>> On Sat, 06/09 17:10, l00284672 wrote:
>>>>> Hi, I found a dead loop in qemu when do blockJobAbort and vm suspend
>>>>> coinstantaneously.
>>>>>
>>>>> The qemu bt is below:
>>>>>
>>>>> #0 0x00007ff58b53af1f in ppoll () from /lib64/libc.so.6
>>>>> #1 0x00000000007fdbd9 in ppoll (__ss=0x0, __timeout=0x7ffcf7055390,
>>>>> __nfds=<optimized out>, __fds=<optimized out>) at
>>>>> /usr/include/bits/poll2.h:77
>>>>> #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
>>>>> timeout=timeout@entry=0) at util/qemu-timer.c:334
>>>>> #3 0x00000000007ff83a in aio_poll (ctx=ctx@entry=0x269d800,
>>>>> blocking=blocking@entry=false) at util/aio-posix.c:629
>>>>> #4 0x0000000000776e91 in bdrv_drain_recurse (bs=bs@entry=0x26d9cb0) at
>>>>> block/io.c:198
>>>>> #5 0x0000000000776ef2 in bdrv_drain_recurse (bs=bs@entry=0x3665990) at
>>>>> block/io.c:215
>>>>> #6 0x00000000007774b8 in bdrv_do_drained_begin (bs=0x3665990,
>>>>> recursive=<optimized out>, parent=0x0) at block/io.c:291
>>>>> #7 0x000000000076a79e in blk_drain (blk=0x2780fc0) at
>>>>> block/block-backend.c:1586
>>>>> #8 0x000000000072d2a9 in block_job_drain (job=0x29df040) at blockjob.c:123
>>>>> #9 0x000000000072d228 in block_job_detach_aio_context (opaque=0x29df040) at
>>>>> blockjob.c:139
>>>>> #10 0x00000000007298b1 in bdrv_detach_aio_context (bs=bs@entry=0x3665990) at
>>>>> block.c:4885
>>>>> #11 0x0000000000729a46 in bdrv_set_aio_context (bs=0x3665990,
>>>>> new_context=0x268e140) at block.c:4946
>>>>> #12 0x0000000000499743 in virtio_blk_data_plane_stop (vdev=<optimized out>)
>>>>> at
>>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/block/dataplane/virtio-blk.c:285
>>>>> #13 0x00000000006bce30 in virtio_bus_stop_ioeventfd (bus=0x3de5378) at
>>>>> hw/virtio/virtio-bus.c:246
>>>>> #14 0x00000000004c654d in virtio_vmstate_change (opaque=0x3de53f0,
>>>>> running=<optimized out>, state=<optimized out>)
>>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/hw/virtio/virtio.c:2222
>>>>> #15 0x0000000000561b52 in vm_state_notify (running=running@entry=0,
>>>>> state=state@entry=RUN_STATE_PAUSED) at vl.c:1514
>>>>> #16 0x000000000045d67a in do_vm_stop (state=state@entry=RUN_STATE_PAUSED,
>>>>> send_stop=send_stop@entry=true)
>>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:1012
>>>>> #17 0x000000000045dafd in vm_stop (state=state@entry=RUN_STATE_PAUSED) at
>>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/cpus.c:2035
>>>>> #18 0x000000000057301b in qmp_stop (errp=errp@entry=0x7ffcf70556f0) at
>>>>> qmp.c:106
>>>>> #19 0x000000000056bf7a in qmp_marshal_stop (args=<optimized out>,
>>>>> ret=<optimized out>, errp=0x7ffcf7055738) at qapi/qapi-commands-misc.c:784
>>>>> #20 0x00000000007f2d27 in do_qmp_dispatch (errp=0x7ffcf7055730,
>>>>> request=0x3e121e0, cmds=<optimized out>) at qapi/qmp-dispatch.c:119
>>>>> #21 qmp_dispatch (cmds=<optimized out>, request=request@entry=0x26f2800) at
>>>>> qapi/qmp-dispatch.c:168
>>>>> #22 0x00000000004655be in monitor_qmp_dispatch_one
>>>>> (req_obj=req_obj@entry=0x39abff0) at
>>>>> /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4088
>>>>> #23 0x0000000000465894 in monitor_qmp_bh_dispatcher (data=<optimized out>)
>>>>> at /mnt/sdb/lzg/code/shequ_code/5_29/qemu/monitor.c:4146
>>>>> #24 0x00000000007fc571 in aio_bh_call (bh=0x26de7e0) at util/async.c:90
>>>>> #25 aio_bh_poll (ctx=ctx@entry=0x268dd50) at util/async.c:118
>>>>> #26 0x00000000007ff6f0 in aio_dispatch (ctx=0x268dd50) at
>>>>> util/aio-posix.c:436
>>>>> #27 0x00000000007fc44e in aio_ctx_dispatch (source=<optimized out>,
>>>>> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
>>>>> #28 0x00007ff58bc7c99a in g_main_context_dispatch () from
>>>>> /lib64/libglib-2.0.so.0
>>>>> #29 0x00000000007fea3a in glib_pollfds_poll () at util/main-loop.c:215
>>>>> #30 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
>>>>> #31 main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
>>>>> #32 0x0000000000561cad in main_loop () at vl.c:1848
>>>>> #33 0x000000000041995c in main (argc=<optimized out>, argv=<optimized out>,
>>>>> envp=<optimized out>) at vl.c:4605
>>>>>
>>>>> The disk is a virtio-blk dataplane disk with a mirror job running. The dead
>>>>> loop is here:
>>>>>
>>>>> static void block_job_detach_aio_context(void *opaque)
>>>>> {
>>>>> BlockJob *job = opaque;
>>>>>
>>>>> /* In case the job terminates during aio_poll()... */
>>>>> job_ref(&job->job);
>>>>>
>>>>> job_pause(&job->job);
>>>>>
>>>>> while (!job->job.paused && !job_is_completed(&job->job)) {
>>>>> job_drain(&job->job);
>>>>> }
>>>>>
>>>>> job->job.aio_context = NULL;
>>>>> job_unref(&job->job);
>>>>> }
>>>>>
>>>>> The job is deferred to main loop now, but the job_drain only processes the
>>>>> AIO context of bs which has no more work to do,
>>>>>
>>>>> while the main loop BH is scheduled for setting the job->completed flag is
>>>>> never processed.
>>>> In that case, main loop's AioContext should be driven like in job_finish_sync().
>>>> Could you try this patch?
>>>>
>>>> diff --git a/blockjob.c b/blockjob.c
>>>> index 0306533a2e..72aa82ac2d 100644
>>>> --- a/blockjob.c
>>>> +++ b/blockjob.c
>>>> @@ -135,7 +135,15 @@ static void block_job_detach_aio_context(void *opaque)
>>>>
>>>> job_pause(&job->job);
>>>>
>>>> - while (!job->job.paused && !job_is_completed(&job->job)) {
>>>> +
>>>> + while (true) {
>>>> + if (job->job.paused || job_is_completed(&job->job)) {
>>>> + break;
>>>> + }
>>>> + if (job->deferred_to_main_loop) {
>>>> + aio_poll(qemu_get_aio_context(), true);
>>>> + continue;
>>>> + }
>>>> job_drain(&job->job);
>>>> }
>>>>
>>>> .
>>>>
>> null
> .
>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: lizhengui.vcf --]
[-- Type: text/x-vcard; name="lizhengui.vcf", Size: 4 bytes --]
null
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-06-12 2:13 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-09 9:10 [Qemu-devel] question: a dead loop in qemu when do blockJobAbort and vm suspend coinstantaneously l00284672
2018-06-10 7:43 ` Fam Zheng
2018-06-11 3:01 ` l00284672
2018-06-11 3:31 ` l00284672
2018-06-12 1:19 ` l00284672
2018-06-12 1:45 ` Fam Zheng
2018-06-12 2:11 ` l00284672
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).