From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54251) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XUGz8-0001xn-3q for qemu-devel@nongnu.org; Wed, 17 Sep 2014 11:17:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XUGz1-0004FW-Lf for qemu-devel@nongnu.org; Wed, 17 Sep 2014 11:17:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:11851) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XUGz1-0004F9-F6 for qemu-devel@nongnu.org; Wed, 17 Sep 2014 11:17:27 -0400 Message-ID: <5419A600.1070209@redhat.com> Date: Wed, 17 Sep 2014 09:17:20 -0600 From: Eric Blake MIME-Version: 1.0 References: <5416C46D.7040105@ozlabs.ru> <541826CA.7050607@ozlabs.ru> <541828BF.8090301@redhat.com> <20140917090615.GB10699@stefanha-thinkpad.redhat.com> <54195395.9010201@redhat.com> In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="NBqonN38768tPTNjNumTbnVThfpqf4qAV" Subject: Re: [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi , Paolo Bonzini Cc: Kevin Wolf , Alexey Kardashevskiy , "qemu-devel@nongnu.org" , Max Reitz , "libvir-list@redhat.com" , Stefan Hajnoczi , "Dr. David Alan Gilbert" This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --NBqonN38768tPTNjNumTbnVThfpqf4qAV Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable [adding libvirt list] On 09/17/2014 09:04 AM, Stefan Hajnoczi wrote: > On Wed, Sep 17, 2014 at 10:25 AM, Paolo Bonzini w= rote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto: >>> I think the fundamental problem here is that the mirror block job >>> on the source host does not synchronize with live migration. >>> >>> Remember the mirror block job iterates on the dirty bitmap >>> whenever it feels like. >>> >>> There is no guarantee that the mirror block job has quiesced before >>> migration handover takes place, right? >> >> Libvirt does that. Migration is started only once storage mirroring >> is out of the bulk phase, and the handover looks like: >> >> 1) migration completes >> >> 2) because the source VM is stopped, the disk has quiesced on the sour= ce >=20 > But the mirror block job might still be writing out dirty blocks. >=20 >> 3) libvirt sends block-job-complete >=20 > No, it sends block-job-cancel after the source QEMU's migration has > completed. See the qemuMigrationCancelDriveMirror() call in > src/qemu/qemu_migration.c:qemuMigrationRun(). >=20 >> 4) libvirt receives BLOCK_JOB_COMPLETED. The disk has now quiesced on= >> the destination as well. >=20 > I don't see where this happens in the libvirt source code. Libvirt > doesn't care about block job events for drive-mirror during migration. >=20 > And that's why there could still be I/O going on (since > block-job-cancel is asynchronous). >=20 >> 5) the VM is started on the destination >> >> 6) the NBD server is stopped on the destination and the source VM is q= uit. >> >> It is actually a feature that storage migration is completed >> asynchronously with respect to RAM migration. The problem is that >> qcow2_invalidate_cache happens between (3) and (5), and it doesn't >> like the concurrent I/O received by the NBD server. >=20 > I agree that qcow2_invalidate_cache() (and any other invalidate cache > implementations) need to allow concurrent I/O requests. >=20 > Either I'm misreading the libvirt code or libvirt is not actually > ensuring that the block job on the source has cancelled/completed > before the guest is resumed on the destination. So I think there is > still a bug, maybe Eric can verify this? You may indeed be correct that libvirt is not waiting long enough for the block job to be gone on the source before resuming on the destination. I didn't write that particular code, so I'm cc'ing the libvirt list, but I can try and take a look into it, since it's related to code I've recently touched in getting libvirt to support active layer block commit. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --NBqonN38768tPTNjNumTbnVThfpqf4qAV Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJUGaYAAAoJEKeha0olJ0Nqn/sH/ivI6/Yosxr8yIsF1N06Ruip vKd4/tRTtw7I6ghyFU/yyOtYLJDo+xw51wWDXdDEvTKUWMsmjfdKePg2wASQySwF ubWxRkIPYsKyVde00X2zVPYU+xCSXK4vd7QEUARuYTtsTGWPrXA+KNgLQtdjuUZZ ur6fNG3ThitbQnWTsDDU874o0I7ko2Vy2o0H8f073ryQMKFSfLSK+AnIsTXwIefm SITLQydlD3PFDmTA6/g3GraOoqzRLvA+fHeYQJKe3mjgwtmFBYUd/35KS9FbKPZv g7Qca7J9XNgLYnbYwVyqUYe91gOlUgR4QyMmKNw+0nMi9gZpqpRfj4TqFHMOh1M= =NxsK -----END PGP SIGNATURE----- --NBqonN38768tPTNjNumTbnVThfpqf4qAV--