From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50735) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFcU-0007pe-1z for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:25:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFcM-0006N9-Il for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:25:50 -0400 Received: from [59.151.112.132] (port=10150 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFcL-0006Ml-VH for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:25:42 -0400 Message-ID: <54124B41.90508@cn.fujitsu.com> Date: Fri, 12 Sep 2014 09:24:17 +0800 From: Hongyang Yang MIME-Version: 1.0 References: <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> <53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com> <54107187.8040706@gmail.com> <20140911174407.GP2353@work-vm> In-Reply-To: <20140911174407.GP2353@work-vm> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" , Walid Nouri Cc: kwolf@redhat.com, eddie.dong@intel.com, qemu-devel@nongnu.org, "Michael R. Hines" , stefanha@redhat.com, Paolo Bonzini =E5=9C=A8 09/12/2014 01:44 AM, Dr. David Alan Gilbert =E5=86=99=E9=81=93: > (I've cc'd in Fam, Stefan, and Kevin for Block stuff, and > Yang and Eddie for Colo) > > * Walid Nouri (walid.nouri@gmail.com) wrote: >> Hello Michael, Hello Paolo >> i have ???studied??? the available documentation/Information and tried t= o >> get an idea of the QEMU live block operation possibilities. >> >> I think the MC protocol doesn???t need synchronous block device replicat= ion >> because primary and secondary VM are not synchronous. The state of the >> primary is allays ahead of the state of the secondary. When the primary = is >> in epoch(n) the secondary is in epoch(n-1). >> >> What MC needs is a block device agnostic, controlled and asynchronous >> approach for replicating the contents of block devices and its state cha= nges >> to the secondary VM while the primary VM is running. Asynchronous block >> transfer is important to allow maximum performance for the primary VM, w= hile >> keeping the secondary VM updated with state changes. >> >> The block device replication should be possible in two stages or modes. >> >> The first stage is the live copy of all block devices of the primary to = the >> secondary. This is necessary if the secondary doesn???t have an existing >> image which is in sync with the primary at the time MC has started. This= is >> not very convenient but as far as I know actually there is no mechanism = for >> persistent dirty bitmap in QEMU. >> >> The second stage (mode) is the replication of block device state changes >> (modified blocks) to keep the image on the secondary in sync with the >> primary. The mirrored blocks must be buffered in ram (block buffer) unti= l >> the complete Checkpoint (RAM, vCPU, device state) can be committed. >> >> For keeping the complete system state consistent on the secondary system >> there must be a possibility for MC to commit/discard block device state >> changes. In normal operation the mirrored block device state changes (bl= ock >> buffer) are committed to disk when the complete checkpoint is committed.= In >> case of a crash of the primary system while transferring a checkpoint th= e >> data in the block buffer corresponding to the failed Checkpoint must be >> discarded. > > I think for COLO there's a requirement that the secondary can do reads/wr= ites > in parallel with the primary, and the secondary can discard those reads/w= rites > - and that doesn't happen in MC (Yang or Eddie should be able to confirm = that). Exactly, COLO need this functionality to ensure consistency. > >> The storage architecture should be ???shared nothing??? so that no share= d >> storage is required and primary/secondary can have separate block device >> images. > > MC/COLO with shared storage still needs some stuff like this; but it's su= btely > different. They still need to be able to buffer/release modifications > to the shared storage; if any of this code can also be used in the > shared-storage configurations it would be good. Shared-storage is more complicated, we don't support shared-storage current= ly... > >> I think this can be achieved by drive-mirror and a filter block driver. >> Another approach could be to exploit the block migration functionality o= f >> live migration with a filter block driver. >> >> The drive-mirror (and live migration) does not rely on shared storage an= d >> allow live block device copy and incremental syncing. >> >> A block buffer can be implemented with a QEMU filter block driver. It sh= ould >> sit at the same position as the Quorum driver in the block driver hierar= chy. >> When using block filter approach MC will be transparent and block device >> agnostic. >> >> The block buffer filter must have an Interface which allows MC control t= he >> commits or discards of block device state changes. I have no idea where = to >> put such an interface to stay conform with QEMU coding style. >> >> >> I???m sure there are alternative and better approaches and I???m open fo= r >> any ideas >> >> >> Walid >> >> Am 17.08.2014 11:52, schrieb Paolo Bonzini: >>> Il 11/08/2014 22:15, Michael R. Hines ha scritto: >>>> Excellent question: QEMU does have a feature called "drive-mirror" >>>> in block/mirror.c that was introduced a couple of years ago. I'm not >>>> sure what the >>>> adoption rate of the feature is, but I would start with that one. >>> >>> block/mirror.c is asynchronous, and there's no support for communicatin= g >>> checkpoints back to the master. However, the quorum disk driver could >>> be what you need. >>> >>> There's also a series on the mailing list that lets quorum read only >> >from the primary, so that quorum can still do replication and fault >>> tolerance, but skip fault detection. >>> >>> Paolo >>> >>>> There is also a second fault tolerance implementation that works a >>>> little differently called >>>> "COLO" - you may have seen those emails on the list too, but their >>>> method does not require a disk replication solution, if I recall corre= ctly. >>> >> >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > . > --=20 Thanks, Yang.