Re: [Qemu-devel] [Bug?] BQL about live migration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Gonglei (Arei)" <arei.gonglei@huawei.com>, pbonzini@redhat.com
Cc: "quintela@redhat.com" <quintela@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	yanghongyang <yanghongyang@huawei.com>,
	Huangzhichao <huangzhichao@huawei.com>
Subject: Re: [Qemu-devel] [Bug?] BQL about live migration
Date: Fri, 3 Mar 2017 12:00:55 +0000	[thread overview]
Message-ID: <20170303120054.GB2439@work-vm> (raw)
In-Reply-To: <33183CC9F5247A488A2544077AF19020DA1D01C2@DGGEMA505-MBX.china.huawei.com>

* Gonglei (Arei) (arei.gonglei@huawei.com) wrote:
> Hello Juan & Dave,

cc'ing in pbonzini since it's magic involving cpu_synrhonize_all_states()

> We hit a bug in our test:
> Network error occurs when migrating a guest, libvirt then rollback the
> migration, causes qemu coredump
> qemu log:
> 2017-03-01T12:54:33.904949+08:00|info|qemu[17672]|[33614]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344073, "microseconds": 904914}, "event": "STOP"}
> 2017-03-01T12:54:37.522500+08:00|info|qemu[17672]|[17672]|handle_qmp_command[3930]|: qmp_cmd_name: migrate_cancel
> 2017-03-01T12:54:37.522607+08:00|info|qemu[17672]|[17672]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 522556}, "event": "MIGRATION", "data": {"status": "cancelling"}}
> 2017-03-01T12:54:37.524671+08:00|info|qemu[17672]|[17672]|handle_qmp_command[3930]|: qmp_cmd_name: cont
> 2017-03-01T12:54:37.524733+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-balloon device status is 7 that means DRIVER OK
> 2017-03-01T12:54:37.525434+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-net device status is 7 that means DRIVER OK
> 2017-03-01T12:54:37.525484+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-blk device status is 7 that means DRIVER OK
> 2017-03-01T12:54:37.525562+08:00|info|qemu[17672]|[17672]|virtio_set_status[725]|: virtio-serial device status is 7 that means DRIVER OK
> 2017-03-01T12:54:37.527653+08:00|info|qemu[17672]|[17672]|vm_start[981]|: vm_state-notify:3ms
> 2017-03-01T12:54:37.528523+08:00|info|qemu[17672]|[17672]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 527699}, "event": "RESUME"}
> 2017-03-01T12:54:37.530680+08:00|info|qemu[17672]|[33614]|migration_bitmap_sync[720]|: this iteration cycle takes 3s, new dirtied data:0MB
> 2017-03-01T12:54:37.530909+08:00|info|qemu[17672]|[33614]|monitor_qapi_event_emit[479]|: {"timestamp": {"seconds": 1488344077, "microseconds": 530733}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
> 2017-03-01T04:54:37.530997Z qemu-kvm: socket_writev_buffer: Got err=32 for (131583/18446744073709551615)
> qemu-kvm: /home/abuild/rpmbuild/BUILD/qemu-kvm-2.6.0/hw/net/virtio_net.c:1519: virtio_net_save: Assertion `!n->vhost_started' failed.
> 2017-03-01 12:54:43.028: shutting down
> 
> From qemu log, qemu received and processed migrate_cancel/cont qmp commands
> after guest been stopped and entered the last round of migration. Then
> migration thread try to save device state when guest is running(started by
> cont command), causes assert and coredump.
> This is because in last iter, we call cpu_synchronize_all_states() to
> synchronize vcpu states, this call will release qemu_global_mutex and wait
> for do_kvm_cpu_synchronize_state() to be executed on target vcpu:
> (gdb) bt
> #0  0x00007f763d1046d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x00007f7643e51d7f in qemu_cond_wait (cond=0x7f764445eca0 <qemu_work_cond>, mutex=0x7f764445eba0 <qemu_global_mutex>) at util/qemu-thread-posix.c:132
> #2  0x00007f7643a2e154 in run_on_cpu (cpu=0x7f7644e06d80, func=0x7f7643a46413 <do_kvm_cpu_synchronize_state>, data=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/cpus.c:995
> #3  0x00007f7643a46487 in kvm_cpu_synchronize_state (cpu=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/kvm-all.c:1805
> #4  0x00007f7643a2c700 in cpu_synchronize_state (cpu=0x7f7644e06d80) at /mnt/public/yanghy/qemu-kvm/include/sysemu/kvm.h:457
> #5  0x00007f7643a2db0c in cpu_synchronize_all_states () at /mnt/public/yanghy/qemu-kvm/cpus.c:766
> #6  0x00007f7643a67b5b in qemu_savevm_state_complete_precopy (f=0x7f76462f2d30, iterable_only=false) at /mnt/public/yanghy/qemu-kvm/migration/savevm.c:1051
> #7  0x00007f7643d121e9 in migration_completion (s=0x7f76443e78c0 <current_migration.37571>, current_active_state=4, old_vm_running=0x7f74343fda00, start_time=0x7f74343fda08) at migration/migration.c:1753
> #8  0x00007f7643d126c5 in migration_thread (opaque=0x7f76443e78c0 <current_migration.37571>) at migration/migration.c:1922
> #9  0x00007f763d100dc5 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007f763ce2e71d in clone () from /lib64/libc.so.6
> (gdb) p iothread_locked
> $1 = true
> 
> and then, qemu main thread been executed, it won't block because migration
> thread released the qemu_global_mutex:
> (gdb) thr 1
> [Switching to thread 1 (Thread 0x7fe298e08bc0 (LWP 30767))]
> #0  os_host_main_loop_wait (timeout=931565) at main-loop.c:270
> 270                 QEMU_LOG(LOG_INFO,"***** after qemu_pool_ns: timeout %d\n", timeout);
> (gdb) p iothread_locked
> $2 = true
> (gdb) l 268
> 263
> 264         ret = qemu_poll_ns((GPollFD *)gpollfds->data, gpollfds->len, timeout);
> 265
> 266
> 267         if (timeout) {
> 268             qemu_mutex_lock_iothread();
> 269             if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
> 270                 QEMU_LOG(LOG_INFO,"***** after qemu_pool_ns: timeout %d\n", timeout);
> 271             }
> 272         }
> (gdb)
> 
> So, although we've hold iothread_lock in stop&copy phase of migration, we
> can't guarantee the iothread been locked all through the stop & copy phase,
> any thoughts on how to solve this problem?

Ouch that's pretty nasty; I remember Paolo explaining to me a while ago that
their were times when run_on_cpu would have to drop the BQL and I worried about it,
but this is the 1st time I've seen an error due to it.

Do you know what the migration state was at that point? Was it MIGRATION_STATUS_CANCELLING?
I'm thinking perhaps we should stop 'cont' from continuing while migration is in
MIGRATION_STATUS_CANCELLING.  Do we send an event when we hit CANCELLED - so that
perhaps libvirt could avoid sending the 'cont' until then?

Dave


> 
> Thanks,
> -Gonglei
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

next prev parent reply	other threads:[~2017-03-03 12:01 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-03  9:29 [Qemu-devel] [Bug?] BQL about live migration Gonglei (Arei)
2017-03-03 10:42 ` Fam Zheng
2017-03-06  2:07   ` yanghongyang
2017-03-03 12:00 ` Dr. David Alan Gilbert [this message]
2017-03-03 12:48   ` Paolo Bonzini
2017-03-03 13:11     ` Dr. David Alan Gilbert
2017-03-03 13:14       ` Paolo Bonzini
2017-03-03 13:26         ` Dr. David Alan Gilbert
2017-03-03 13:33           ` Paolo Bonzini
2017-03-03 14:15             ` Yang Hongyang
2017-03-03 15:03             ` Dr. David Alan Gilbert
2017-03-03 13:57           ` Yang Hongyang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170303120054.GB2439@work-vm \
    --to=dgilbert@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=huangzhichao@huawei.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=yanghongyang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.