From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60700) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XUFXD-0007G2-Vo for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XUFX6-0002yQ-GI for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:39 -0400 Received: from mail-pd0-f180.google.com ([209.85.192.180]:65405) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XUFX6-0002y4-Ah for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:32 -0400 Received: by mail-pd0-f180.google.com with SMTP id ft15so2119434pdb.25 for ; Wed, 17 Sep 2014 06:44:26 -0700 (PDT) Message-ID: <5419902B.1030309@ozlabs.ru> Date: Wed, 17 Sep 2014 23:44:11 +1000 From: Alexey Kardashevskiy MIME-Version: 1.0 References: <5416C46D.7040105@ozlabs.ru> <541826CA.7050607@ozlabs.ru> <541828BF.8090301@redhat.com> <20140917090615.GB10699@stefanha-thinkpad.redhat.com> <54195395.9010201@redhat.com> In-Reply-To: <54195395.9010201@redhat.com> Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , Stefan Hajnoczi Cc: Kevin Wolf , "qemu-devel@nongnu.org" , Max Reitz , Stefan Hajnoczi , "Dr. David Alan Gilbert" On 09/17/2014 07:25 PM, Paolo Bonzini wrote: > Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto: >> I think the fundamental problem here is that the mirror block job >> on the source host does not synchronize with live migration. > >> Remember the mirror block job iterates on the dirty bitmap >> whenever it feels like. > >> There is no guarantee that the mirror block job has quiesced before >> migration handover takes place, right? > > Libvirt does that. Migration is started only once storage mirroring > is out of the bulk phase, and the handover looks like: > > 1) migration completes > > 2) because the source VM is stopped, the disk has quiesced on the source > > 3) libvirt sends block-job-complete > > 4) libvirt receives BLOCK_JOB_COMPLETED. The disk has now quiesced on > the destination as well. > > 5) the VM is started on the destination > > 6) the NBD server is stopped on the destination and the source VM is quit. > > It is actually a feature that storage migration is completed > asynchronously with respect to RAM migration. The problem is that > qcow2_invalidate_cache happens between (3) and (5), and it doesn't > like the concurrent I/O received by the NBD server. How can it happen at all? I thought there are 2 channels/sockets - one for live migration, one for NBD and they concur, nope? btw any better idea of a hack to try? Testers are pushing me - they want to upgrade the broken setup and I am blocking them :) Thanks! -- Alexey