From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58221)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1XSOih-0005RT-Ur
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:08:56 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1XSOic-0001JS-UH
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:08:51 -0400
Received: from mx1.redhat.com ([209.132.183.28]:9531)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1XSOic-0001JK-Nd
	for qemu-devel@nongnu.org; Fri, 12 Sep 2014 07:08:46 -0400
Date: Fri, 12 Sep 2014 12:07:35 +0100
From: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20140912110735.GG1614@stefanha-thinkpad.redhat.com>
References: <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com>
	<1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com>
	<1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com>
	<53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com>
	<54107187.8040706@gmail.com> <20140911174407.GP2353@work-vm>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="YH9Qf6Fh2G5kB/85"
Content-Disposition: inline
In-Reply-To: <20140911174407.GP2353@work-vm>
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
 consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: kwolf@redhat.com, eddie.dong@intel.com, qemu-devel@nongnu.org, "Michael R. Hines" <mrhines@linux.vnet.ibm.com>, Paolo Bonzini <pbonzini@redhat.com>, Walid Nouri <walid.nouri@gmail.com>, yanghy@cn.fujitsu.com


--YH9Qf6Fh2G5kB/85
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Sep 11, 2014 at 06:44:08PM +0100, Dr. David Alan Gilbert wrote:
> (I've cc'd in Fam, Stefan, and Kevin for Block stuff, and=20
>               Yang and Eddie for Colo)
>=20
> * Walid Nouri (walid.nouri@gmail.com) wrote:
> > Hello Michael, Hello Paolo
> > i have ???studied??? the available documentation/Information and tried =
to
> > get an idea of the QEMU live block operation possibilities.
> >=20
> > I think the MC protocol doesn???t need synchronous block device replica=
tion
> > because primary and secondary VM are not synchronous. The state of the
> > primary is allays ahead of the state of the secondary. When the primary=
 is
> > in epoch(n) the secondary is in epoch(n-1).

Note that I haven't followed the microcheckpointing or COLO discussions
so I'm not aware of those designs...

> > What MC needs is a block device agnostic, controlled and asynchronous
> > approach for replicating the contents of block devices and its state ch=
anges
> > to the secondary VM while the primary VM is running. Asynchronous block
> > transfer is important to allow maximum performance for the primary VM, =
while
> > keeping the secondary VM updated with state changes.
> >=20
> > The block device replication should be possible in two stages or modes.
> >=20
> > The first stage is the live copy of all block devices of the primary to=
 the
> > secondary. This is necessary if the secondary doesn???t have an existing
> > image which is in sync with the primary at the time MC has started. Thi=
s is
> > not very convenient but as far as I know actually there is no mechanism=
 for
> > persistent dirty bitmap in QEMU.

I think you are trying to address the non-shared storage cause where the
secondary needs to acquire the initial state of the primary.

drive-mirror copies the contents of a source disk image to a
destination.  If the guest is running while copying takes place then new
writes will also be mirrored.

drive-mirror should be sufficient for the initial phase where primary
and secondary get in sync.

Fam Zheng sent a patch series earlier this year to add dirty bitmaps for
block devices to QEMU.  It only supported in-memory bitmaps but
persistent bitmaps are fairly straightforward to implement.  I'm
interested in these patches for the incremental backup use case.
https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg05250.html

I guess the reason you mention persistent bitmaps is to save time when
adding a host that previously participated and has an older version of
the disk image?

> > The second stage (mode) is the replication of block device state changes
> > (modified blocks)  to keep the image on the secondary in sync with the
> > primary. The mirrored blocks must be buffered in ram (block buffer) unt=
il
> > the complete Checkpoint (RAM, vCPU, device state) can be committed.
> >=20
> > For keeping the complete system state consistent on the secondary system
> > there must be a possibility for MC to commit/discard block device state
> > changes. In normal operation the mirrored block device state changes (b=
lock
> > buffer) are committed to disk when the complete checkpoint is committed=
=2E In
> > case of a crash of the primary system while transferring a checkpoint t=
he
> > data in the block buffer corresponding to the failed Checkpoint must be
> > discarded.

Thoughts:

Writing data safely to disk can take milliseconds.  Not sure how that
figures into your commit step, but I guess commit needs to be fast.

I/O requests happen in parallel with CPU execution, so could an I/O
request be pending across a checkpoint commit?  Live migration does not
migrate inflight requests, although it has special case code for
migration requests that have failed at the host level and need to be
retried.  Another way of putting this is that live migration uses
bdrv_drain_all() to quiesce disks before migrating device state - I
don't think you have that luxury since bdrv_drain_all() can take a long
time and is not suitable for microcheckpointing.

Block devices have the following semantics:
1. There is no ordering between parallel in-flight I/O requests.
2. The guest sees the disk state for completed writes but it may not see
   disk state of in-flight writes (due to #1).
3. Completed writes are only guaranteed to be persistent across power
   failure if a disk cache flush was submitted and completed after the
   writes completed.

> > I think this can be achieved by drive-mirror and a filter block driver.
> > Another approach could be to exploit the block migration functionality =
of
> > live migration with a filter block driver.

block-migration.c should be avoided because it may be dropped from QEMU.
It is unloved code and has been replaced by drive-mirror.

> > The drive-mirror (and live migration) does not rely on shared storage a=
nd
> > allow live block device copy and incremental syncing.
> >=20
> > A block buffer can be implemented with a QEMU filter block driver. It s=
hould
> > sit at the same position as the Quorum driver in the block driver hiera=
rchy.
> > When using block filter approach MC will be transparent and block device
> > agnostic.
> >=20
> > The block buffer filter must have an Interface which allows MC control =
the
> > commits or discards of block device state changes. I have no idea where=
 to
> > put such an interface to stay conform with QEMU coding style.
> >=20
> >=20
> > I???m sure there are alternative and better approaches and I???m open f=
or
> > any ideas

You can use drive-mirror and the run-time NBD server in QEMU without
modification:

  Primary (drive-mirror)   ---writes--->   Secondary (NBD export in QEMU)

Your block filter idea can work and must have the logic so that a commit
operation sent via the microcheckpointing protocol causes the block
filter to write buffered data to disk and flush the host disk cache.

To ensure that the disk image on the secondary is always in a crash
consistent state (i.e. the state you get from power failure), the
secondary needs to know when disk cache flush requests were sent and the
write ordering.  That way, even if there is a power failure while the
secondary is committing, the disk will be in a crash consistent state.
After the secondary (or primary) is booted again file systems or
databases will be able to fsck and resume.

(In other words, in a catastrophic failure you won't be any worse off
than with a power failure on an unprotected single machine.)

Stefan

--YH9Qf6Fh2G5kB/85
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUEtP3AAoJEJykq7OBq3PIFY4H/2eZyHIRTKKBf5XI+H+QKm8P
XL/dXrH0Wlwqx9Xlud0nUSP/tmx72qB5UDLDugX7Q0Uq9CbhO+18QZDLC6F+HxwT
5OiOmaaoYFybmTM796QVQpWGPvtv2/TEVk0v2y2VmbAicFVjAoU+C5cz0rFEDN3V
+PTfPe6OcAH5eURbLvOpW/QR7jrv8UGakg4j2rEuKa/w3tqVDivvNoqBbZbIefb0
aysq4T+kN8eWvekFyL488aBLrBCR5BZ20rEQquASRDkiQClC1pn3hP2bKlout8D7
A7yt9RhjIUnac9Oz6lHyo7u24Xpi25u6a+dfKyMZG1g2gqBVo0iex3QSEqLJzNc=
=l8Bv
-----END PGP SIGNATURE-----

--YH9Qf6Fh2G5kB/85--