From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39541)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1XWiEo-0004HC-Q0
	for qemu-devel@nongnu.org; Wed, 24 Sep 2014 04:48:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1XWiEf-0001xR-KU
	for qemu-devel@nongnu.org; Wed, 24 Sep 2014 04:47:50 -0400
Received: from mail-we0-x22b.google.com ([2a00:1450:400c:c03::22b]:40747)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1XWiEf-0001wb-9P
	for qemu-devel@nongnu.org; Wed, 24 Sep 2014 04:47:41 -0400
Received: by mail-we0-f171.google.com with SMTP id k48so5777313wev.30
	for <qemu-devel@nongnu.org>; Wed, 24 Sep 2014 01:47:35 -0700 (PDT)
Date: Wed, 24 Sep 2014 09:47:32 +0100
From: Stefan Hajnoczi <stefanha@gmail.com>
Message-ID: <20140924084732.GB21137@stefanha-thinkpad.redhat.com>
References: <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com>
	<53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com>
	<54107187.8040706@gmail.com> <20140911174407.GP2353@work-vm>
	<20140912110735.GG1614@stefanha-thinkpad.redhat.com>
	<5419F4CC.2060903@gmail.com>
	<20140918135604.GB16227@stefanha-thinkpad.redhat.com>
	<5421A19A.9090201@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="jho1yZJdad60DJr+"
Content-Disposition: inline
In-Reply-To: <5421A19A.9090201@gmail.com>
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State
 consistency
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Walid Nouri <walid.nouri@gmail.com>
Cc: kwolf@redhat.com, eddie.dong@intel.com, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, "Michael R. Hines" <mrhines@linux.vnet.ibm.com>, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, yanghy@cn.fujitsu.com


--jho1yZJdad60DJr+
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Sep 23, 2014 at 06:36:42PM +0200, Walid Nouri wrote:
> Am 18.09.2014 15:56, schrieb Stefan Hajnoczi:
> >There is the issue of request ordering (using write cache flushes).  The
> >secondary probably needs to perform requests in the same order and
> >interleave cache flushes in the same way as the primary.  Otherwise a
> >power failure on the secondary could leave the disk in an invalid state
> >that is impossible on the primary.  So I'm just pointing out that cache
> >flush operations matter, not just read/write.
>=20
>=20
> To be honest, my thought was that drive-mirror handles all block device
> specific problems especially the cache flush requests for write ordering.=
 So
> my naive approach was to use an existing functionality as a kind of black
> box transport mechanism and build on top of it. But that seems to be not
> possible for the subtle tricky part of the game.

I think the assumption with drive-mirror is that you throw away the
destination image if something fails.  That's the exact opposite of MC
where we want to fail over to the destination :).

> >There are fancier solutions using either a journal or snapshots that
> >provide data integrity without posing a performance bottleneck during
> >the commit phase.
> >
> >The trick is to apply write requests as they come off the wire on the
> >secondary but use a journal or snapshot mechanism to enforce commit
> >semantics.  That way the commit doesn't have to wait for writing out all
> >the data to disk.
> >
> Wouldn't that mean to send a kind of protocol information with the modifi=
ed
> Blocks, a barrier or somthing like that?
> Can you please explain a little more what you meant?

Here is one example of a mechanism like this:

QEMU has a block job called drive-backup which copies sectors that are
about to be overwritten to an external file.  Once the data has been
copied into the external file, the sectors in the original image file
can be overwritten safely.

The Secondary runs drive-backup so that writes coming from the Primary
stash rollback data into an external qcow2 file.  When the Primary
wishes to commit we drop the qcow2 rollback file since we no longer need
the ability to roll back - this is cheap and not a lot of I/O needs to
be performed for the commit operation.

If the Secondary needs to take over it can use the rollback qcow2 file
as its disk image and the guest will see the state of the disk at the
last commit point.  The sectors that were modified since commit in the
original image file are covered by the data in the rollback qcow2 file.

There are a bunch of details on making this efficient but in principle
this approach makes both commit and rollback fairly lightweight.

> >The details depend on the code and I don't remember everything well
> >enough.  Anyway, my mental model is:
> >
> >1. The dirty bit is set *after* the primary has completed the write.
> >    See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
> >    bitmap to query in-flight requests, instead you have to look at
> >    bs->tracked_requests.
> >
> >2. The mirror block job periodically scans the dirty bitmap (when there
> >    is no rate-limit set it does this with no artifical delays) and
> >    writes the dirty blocks.
> >
> >Given that cache flush requests probably need to be tracked too, maybe
> >you need MC-specific block driver on the primary to monitor and control
> >I/O requests.
> >
> >But I haven't thought this through and it's non-trivial so we need to
> >break this down more.
> >
>=20
> As drive-mirror lacks this functionality a way (without changing the
> drive-mirror code) might be a MC-specific mechanism on the primary. This
> mechanism must respect write ordering requests (like forced cache flush, =
and
> Force Unit Access request) and send corresponding information for a stream
> of blocks to the secondary.
>=20
> From what I have learned i'm assuming most guest OS filesystem/block layer
> follows an ordering interface based on SCSI???? As those kind of requests
> must be flaged in an I/O request by the guest operating system this should
> be possible. Do we have the chance to access those information in a guest
> request?
>=20
> If this is possible does this information survives the journey through the
> nbd-server or must there be another communication channel like the QEMUFi=
le
> approach of =E2=80=9Cblock-migration.c=E2=80=9D?

There isn't much information beyond the ordering of writes and cache
flushes, even in SCSI.  But that's okay, we just need to honor the
semantics of block devices.

Stefan

--jho1yZJdad60DJr+
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUIoUkAAoJEJykq7OBq3PIS5cH/1iAjArryroqglq3GEg0+7Wx
lNV0DfwFH8+3sfkeNzZrtGuxjanOsD7bTZCpa6fwPO6XIupYGXsLcdgEHzsYkMiO
xxXPTRdSjC657XFX4Y4t+mTWS2BGhdS8wcqdaMizZGGelbPiIpz6b5iSjegj3WVF
FzTMiAy96XGlOo7fSF+/8rFFBITmvJCdoQAoAtko3CUh69kkqKpn+BGyXHoUw8fX
91rMIBicmaAfuInkylx2ygQafFmFHll3oFAvkylgrca1ltBzidkOyt1pNE/WK7cG
5HpjUfe2/7Td8nJnKBwBUcxOTyBk0uUSUmiudxuO7TN6ix66oUl8lEKVKDkoh+o=
=SaFn
-----END PGP SIGNATURE-----

--jho1yZJdad60DJr+--