qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fam Zheng <fam@euphon.net>
To: Zhenyu Ye <yezhenyu2@huawei.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	xiexiangyou@huawei.com, armbru@redhat.com,
	Stefan Hajnoczi <stefanha@redhat.com>,
	pbonzini@redhat.com, mreitz@redhat.com
Subject: Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions
Date: Thu, 17 Sep 2020 10:10:56 +0000	[thread overview]
Message-ID: <20200917101056.GA2423524@dev> (raw)
In-Reply-To: <7ad220bd-7ee5-1f66-b2e5-7dc57d72eb2e@huawei.com>

On 2020-09-17 15:36, Zhenyu Ye wrote:
> Hi Stefan,
> 
> On 2020/9/14 21:27, Stefan Hajnoczi wrote:
> >>
> >> Theoretically, everything running in an iothread is asynchronous. However,
> >> some 'asynchronous' actions are not non-blocking entirely, such as
> >> io_submit().  This will block while the iodepth is too big and I/O pressure
> >> is too high.  If we do some qmp actions, such as 'info block', at this time,
> >> may cause vm soft lockup.  This series can make these qmp actions safer.
> >>
> >> I constructed the scene as follow:
> >> 1. create a vm with 4 disks, using iothread.
> >> 2. add press to the CPU on the host.  In my scene, the CPU usage exceeds 95%.
> >> 3. add press to the 4 disks in the vm at the same time.  I used the fio and
> >> some parameters are:
> >>
> >> 	 fio -rw=randrw -bs=1M -size=1G -iodepth=512 -ioengine=libaio -numjobs=4
> >>
> >> 4. do block query actions, for example, by virsh:
> >>
> >> 	virsh qemu-monitor-command [vm name] --hmp info block
> >>
> >> Then the vm will soft lockup, the calltrace is:
> >>
> >> [  192.311393] watchdog: BUG: soft lockup - CPU#1 stuck for 42s! [kworker/1:1:33]
> > 
> > Hi,
> > Sorry I haven't had time to investigate this myself yet.
> > 
> > Do you also have a QEMU backtrace when the hang occurs?
> > 
> > Let's find out if QEMU is stuck in the io_submit(2) syscall or whether
> > there's an issue in QEMU itself that causes the softlockup (for example,
> > aio_poll() with the global mutex held).
> 
> Sorry for the delay.
> 
> I traced the io_submit() within the guest. The maximum execution time exceeds 16s:
> 
> 	guest# perf trace -p `pidof -s fio` --duration 1
> 	    14.808 (3843.082 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	  3861.336 (11470.608 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 15341.998 (479.283 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 15831.380 (3704.997 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 19547.869 (3412.839 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 22966.953 (1080.854 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 24062.689 (6939.813 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 31019.219 (532.147 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 31556.610 (3469.920 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 35038.885 (9007.175 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 44053.578 (16006.405 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 60068.944 (3068.748 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	 63138.474 (13012.499 ms): io_getevents(ctx_id: 281472924459008, min_nr: 511, nr: 511, events: 0x19a83150) = 511
> 	 76191.073 (2872.657 ms): io_submit(ctx_id: 281472924459008, nr: 1, iocbpp: 0x19a87160          ) = 1
> 	...
> 
> And in the host, the information of sys_enter_io_submit() is:
> 
> 	Samples: 3K of event 'syscalls:sys_enter_io_submit', Event count (approx.): 3150
> 	Children      Self  Trace output
> 	-   66.70%    66.70%  ctx_id: 0xffff9c044000, nr: 0x00000001, iocbpp: 0xffff9f7fad28
> 	0xffffae7f871c
> 	0xffffae8a27c4
> 	qemu_thread_start
> 	iothread_run
> 	aio_poll
> 	aio_dispatch_ready_handlers
> 	aio_dispatch_handler
> 	virtio_queue_host_notifier_aio_read
> 	virtio_queue_notify_aio_vq
> 	virtio_blk_data_plane_handle_output
> 	virtio_blk_handle_vq
> 	blk_io_unplug
> 	bdrv_io_unplug
> 	bdrv_io_unplug
> 	raw_aio_unplug
> 	laio_io_unplug
> 	syscall
> 
> 
> When the hang occurs, the QEMU is blocked at:
> 
> 	#0  0x0000ffff95762b64 in ?? () from target:/usr/lib64/libpthread.so.0
> 	#1  0x0000ffff9575bd88 in pthread_mutex_lock () from target:/usr/lib64/libpthread.so.0
> 	#2  0x0000aaaabb1f5948 in qemu_mutex_lock_impl (mutex=0xaaaacc8e1860,
> 	    file=0xaaaabb4e1bd0 "/Images/eillon/CODE/5-opensource/qemu/util/async.c", line=605)
> 	#3  0x0000aaaabb20acd4 in aio_context_acquire (ctx=0xaaaacc8e1800)
> 	#4  0x0000aaaabb105e90 in bdrv_query_image_info (bs=0xaaaacc934620,
> 	    p_info=0xaaaaccc41e18, errp=0xffffca669118)
> 	#5  0x0000aaaabb105968 in bdrv_block_device_info (blk=0xaaaacdca19f0, bs=0xaaaacc934620,
> 	    flat=false, errp=0xffffca6692b8)
> 	#6  0x0000aaaabb1063dc in bdrv_query_info (blk=0xaaaacdca19f0, p_info=0xaaaacd29c9a8,
> 	    errp=0xffffca6692b8)
> 	#7  0x0000aaaabb106c14 in qmp_query_block (errp=0x0)
> 	#8  0x0000aaaabacb8e6c in hmp_info_block (mon=0xffffca6693d0, qdict=0xaaaacd089790)
> 	#9  0x0000aaaabb0068f0 in handle_hmp_command (mon=0xffffca6693d0,
> 	    cmdline=0xaaaacc9013ca "")
> 	#10 0x0000aaaabaab5034 in qmp_human_monitor_command (
> 	    command_line=0xaaaacc9013c0 "info block", has_cpu_index=false, cpu_index=0,
> 	    errp=0xffffca6694d8)
> 	#11 0x0000aaaabb03c720 in qmp_marshal_human_monitor_command (args=0xffff70008070,
> 	    ret=0xffffca669570, errp=0xffffca669568) at qapi/qapi-commands-misc.c:653
> 	#12 0x0000aaaabb18fdbc in qmp_dispatch (cmds=0xaaaabbb197c0 <qmp_commands>,
> 	    request=0xffff70005cc0, allow_oob=false)
> 	#13 0x0000aaaabb003074 in monitor_qmp_dispatch (mon=0xaaaacc962320, req=0xffff70005cc0)
> 	#14 0x0000aaaabb003488 in monitor_qmp_bh_dispatcher (data=0x0)
> 	#15 0x0000aaaabb209d44 in aio_bh_call (bh=0xaaaacc5f0810)
> 	#16 0x0000aaaabb209e50 in aio_bh_poll (ctx=0xaaaacc5ef320)
> 	#17 0x0000aaaabb1f1494 in aio_dispatch (ctx=0xaaaacc5ef320)
> 	#18 0x0000aaaabb20a330 in aio_ctx_dispatch (source=0xaaaacc5ef320, callback=0x0,
> 	    user_data=0x0)
> 	#19 0x0000ffff95f00a7c in g_main_context_dispatch ()
> 	   from target:/usr/lib64/libglib-2.0.so.0
> 	#20 0x0000aaaabb2128e8 in glib_pollfds_poll ()
> 	#21 0x0000aaaabb212970 in os_host_main_loop_wait (timeout=64805420)
> 	#22 0x0000aaaabb212a90 in main_loop_wait (nonblocking=0)
> 	#23 0x0000aaaabaad63d0 in qemu_main_loop ()
> 	#24 0x0000aaaabb188b54 in main (argc=117, argv=0xffffca669a68, envp=0xffffca669e18)
> 
> And the QEMU process backtrace in kernel is:
> 	[<0>] __switch_to+0xdc/0x140
> 	[<0>] futex_wait_queue_me+0xd4/0x158
> 	[<0>] futex_wait+0xf4/0x230
> 	[<0>] do_futex+0x46c/0x608
> 	[<0>] __arm64_sys_futex+0x13c/0x188
> 	[<0>] el0_svc_common+0x80/0x1b8
> 	[<0>] el0_svc_handler+0x38/0x78
> 	[<0>] el0_svc+0x10/0x14
> 	[<0>] 0xffffffffffffffff

Hi Zhenyu,

Did you by any chance capture the kernel stack where io_submit is stuck?

Thanks,
Fam


  reply	other threads:[~2020-09-17 10:15 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10 14:52 [PATCH v1 0/2] Add timeout mechanism to qmp actions Zhenyu Ye
2020-08-10 14:52 ` [PATCH v1 1/2] util: introduce aio_context_acquire_timeout Zhenyu Ye
2020-08-10 14:52 ` [PATCH v1 2/2] qmp: use aio_context_acquire_timeout replace aio_context_acquire Zhenyu Ye
2020-08-10 15:38 ` [PATCH v1 0/2] Add timeout mechanism to qmp actions Kevin Wolf
2020-08-11 13:54   ` Zhenyu Ye
2020-08-21 12:52     ` Stefan Hajnoczi
2020-09-14 13:27     ` Stefan Hajnoczi
2020-09-17  7:36       ` Zhenyu Ye
2020-09-17 10:10         ` Fam Zheng [this message]
2020-09-17 15:44         ` Stefan Hajnoczi
2020-09-17 16:01           ` Fam Zheng
2020-09-18 11:23             ` Zhenyu Ye
2020-09-18 14:06               ` Fam Zheng
2020-09-19  2:22                 ` Zhenyu Ye
2020-09-21 11:14                   ` Fam Zheng
2020-10-13 10:00                     ` Stefan Hajnoczi
2020-10-19 12:40                       ` Zhenyu Ye
2020-10-19 13:25                         ` Paolo Bonzini
2020-10-20  1:34                           ` Zhenyu Ye
2020-10-22 16:29                             ` Fam Zheng
2020-12-08 13:10                               ` Stefan Hajnoczi
2020-12-08 13:47                                 ` Glauber Costa
2020-12-14 16:33                                   ` Stefan Hajnoczi
2020-12-21 11:30                                     ` Zhenyu Ye
2020-09-14 14:42     ` Daniel P. Berrangé
2020-09-17  8:12       ` Zhenyu Ye
2020-08-12 13:51 ` Stefan Hajnoczi
2020-08-13  1:51   ` Zhenyu Ye

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200917101056.GA2423524@dev \
    --to=fam@euphon.net \
    --cc=armbru@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=xiexiangyou@huawei.com \
    --cc=yezhenyu2@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).