From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60700)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1XUFXD-0007G2-Vo
	for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:47 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1XUFX6-0002yQ-GI
	for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:39 -0400
Received: from mail-pd0-f180.google.com ([209.85.192.180]:65405)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aik@ozlabs.ru>) id 1XUFX6-0002y4-Ah
	for qemu-devel@nongnu.org; Wed, 17 Sep 2014 09:44:32 -0400
Received: by mail-pd0-f180.google.com with SMTP id ft15so2119434pdb.25
	for <qemu-devel@nongnu.org>; Wed, 17 Sep 2014 06:44:26 -0700 (PDT)
Message-ID: <5419902B.1030309@ozlabs.ru>
Date: Wed, 17 Sep 2014 23:44:11 +1000
From: Alexey Kardashevskiy <aik@ozlabs.ru>
MIME-Version: 1.0
References: <5416C46D.7040105@ozlabs.ru> <541826CA.7050607@ozlabs.ru>
	<541828BF.8090301@redhat.com>
	<20140917090615.GB10699@stefanha-thinkpad.redhat.com>
	<54195395.9010201@redhat.com>
In-Reply-To: <54195395.9010201@redhat.com>
Content-Type: text/plain; charset=koi8-r
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] migration: qemu-coroutine-lock.c:141:
 qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>, Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Max Reitz <mreitz@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>

On 09/17/2014 07:25 PM, Paolo Bonzini wrote:
> Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto:
>> I think the fundamental problem here is that the mirror block job 
>> on the source host does not synchronize with live migration.
> 
>> Remember the mirror block job iterates on the dirty bitmap
>> whenever it feels like.
> 
>> There is no guarantee that the mirror block job has quiesced before
>> migration handover takes place, right?
> 
> Libvirt does that.  Migration is started only once storage mirroring
> is out of the bulk phase, and the handover looks like:
> 
> 1) migration completes
> 
> 2) because the source VM is stopped, the disk has quiesced on the source
> 
> 3) libvirt sends block-job-complete
> 
> 4) libvirt receives BLOCK_JOB_COMPLETED.  The disk has now quiesced on
> the destination as well.
> 
> 5) the VM is started on the destination
> 
> 6) the NBD server is stopped on the destination and the source VM is quit.
> 
> It is actually a feature that storage migration is completed
> asynchronously with respect to RAM migration.  The problem is that
> qcow2_invalidate_cache happens between (3) and (5), and it doesn't
> like the concurrent I/O received by the NBD server.

How can it happen at all? I thought there are 2 channels/sockets - one for
live migration, one for NBD and they concur, nope?

btw any better idea of a hack to try? Testers are pushing me - they want to
upgrade the broken setup and I am blocking them :) Thanks!


-- 
Alexey