From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52039) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XRtX7-0001jH-P7 for qemu-devel@nongnu.org; Wed, 10 Sep 2014 21:50:58 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XRtWw-0001lA-9W for qemu-devel@nongnu.org; Wed, 10 Sep 2014 21:50:49 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:53476) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XRtWw-0001l3-1Y for qemu-devel@nongnu.org; Wed, 10 Sep 2014 21:50:38 -0400 Received: from /spool/local by e33.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 10 Sep 2014 19:50:36 -0600 Received: from b03cxnp08028.gho.boulder.ibm.com (b03cxnp08028.gho.boulder.ibm.com [9.17.130.20]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id 2213C3E4003E for ; Wed, 10 Sep 2014 19:50:33 -0600 (MDT) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id s8B1oX1149676464 for ; Thu, 11 Sep 2014 03:50:33 +0200 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s8B1oVMM022145 for ; Wed, 10 Sep 2014 19:50:32 -0600 Message-ID: <5410FFE2.40308@linux.vnet.ibm.com> Date: Thu, 11 Sep 2014 09:50:26 +0800 From: "Michael R. Hines" MIME-Version: 1.0 References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa> <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> <53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com> <54107187.8040706@gmail.com> In-Reply-To: <54107187.8040706@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Walid Nouri , Paolo Bonzini , qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com, "Dr. David Alan Gilbert" , Hongyang Yang , Dong Eddie , FNST-Gui Jianfeng , wency@cn.fujitsu.com On 09/10/2014 11:43 PM, Walid Nouri wrote: > Hello Michael, Hello Paolo > i have „studied“ the available documentation/Information and tried to > get an idea of the QEMU live block operation possibilities. > > I think the MC protocol doesn’t need synchronous block device > replication because primary and secondary VM are not synchronous. The > state of the primary is allays ahead of the state of the secondary. > When the primary is in epoch(n) the secondary is in epoch(n-1). > > What MC needs is a block device agnostic, controlled and asynchronous > approach for replicating the contents of block devices and its state > changes to the secondary VM while the primary VM is running. > Asynchronous block transfer is important to allow maximum performance > for the primary VM, while keeping the secondary VM updated with state > changes. > > The block device replication should be possible in two stages or modes. > > The first stage is the live copy of all block devices of the primary > to the secondary. This is necessary if the secondary doesn’t have an > existing image which is in sync with the primary at the time MC has > started. This is not very convenient but as far as I know actually > there is no mechanism for persistent dirty bitmap in QEMU. > > The second stage (mode) is the replication of block device state > changes (modified blocks) to keep the image on the secondary in sync > with the primary. The mirrored blocks must be buffered in ram (block > buffer) until the complete Checkpoint (RAM, vCPU, device state) can be > committed. > > For keeping the complete system state consistent on the secondary > system there must be a possibility for MC to commit/discard block > device state changes. In normal operation the mirrored block device > state changes (block buffer) are committed to disk when the complete > checkpoint is committed. In case of a crash of the primary system > while transferring a checkpoint the data in the block buffer > corresponding to the failed Checkpoint must be discarded. > > The storage architecture should be “shared nothing” so that no shared > storage is required and primary/secondary can have separate block > device images. > > I think this can be achieved by drive-mirror and a filter block > driver. Another approach could be to exploit the block migration > functionality of live migration with a filter block driver. > > The drive-mirror (and live migration) does not rely on shared storage > and allow live block device copy and incremental syncing. > > A block buffer can be implemented with a QEMU filter block driver. It > should sit at the same position as the Quorum driver in the block > driver hierarchy. When using block filter approach MC will be > transparent and block device agnostic. > > The block buffer filter must have an Interface which allows MC control > the commits or discards of block device state changes. I have no idea > where to put such an interface to stay conform with QEMU coding style. > > > I’m sure there are alternative and better approaches and I’m open for > any ideas > > > Walid > > Am 17.08.2014 11:52, schrieb Paolo Bonzini: >> Il 11/08/2014 22:15, Michael R. Hines ha scritto: >>> Excellent question: QEMU does have a feature called "drive-mirror" >>> in block/mirror.c that was introduced a couple of years ago. I'm not >>> sure what the >>> adoption rate of the feature is, but I would start with that one. >> >> block/mirror.c is asynchronous, and there's no support for communicating >> checkpoints back to the master. However, the quorum disk driver could >> be what you need. >> >> There's also a series on the mailing list that lets quorum read only >> from the primary, so that quorum can still do replication and fault >> tolerance, but skip fault detection. >> >> Paolo >> >>> There is also a second fault tolerance implementation that works a >>> little differently called >>> "COLO" - you may have seen those emails on the list too, but their >>> method does not require a disk replication solution, if I recall >>> correctly. >> > Nice description of the problem - would you like to put this information on the MC wiki page? (Just send an email to the list that says "request for wiki account, please" in the subject - and they will make an account for you. A drive-mirror + filter driver solution sounds like a good plan overall, of course the devil is in the details =) I don't know how much time you have to spend on actual code, but even a description of what a "theoretical" interface between MC and drive-mirror would look like would go a long way even without code. Your investigations would also help "drive" a solution to this problem for the COLO team as well - I believe they need the same thing.... - Michael