From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51702) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFlM-0002mr-2o for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:35:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFlF-0000DM-Sm for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:35:00 -0400 Received: from [59.151.112.132] (port=18264 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFlF-0000DB-1L for qemu-devel@nongnu.org; Thu, 11 Sep 2014 21:34:53 -0400 Message-ID: <54124DAF.4090700@cn.fujitsu.com> Date: Fri, 12 Sep 2014 09:34:39 +0800 From: Hongyang Yang MIME-Version: 1.0 References: <53D8FF52.9000104@gmail.com> <1406820870.2680.3.camel@usa> <53DBE726.4050102@gmail.com> <1406947532.2680.11.camel@usa> <53E0AA60.9030404@gmail.com> <1407376929.21497.2.camel@usa> <53E60F34.1070607@gmail.com> <1407587152.24027.5.camel@usa> <53E8FBBD.7050703@gmail.com> <53E92470.60806@linux.vnet.ibm.com> <53F07B73.60407@redhat.com> <54107187.8040706@gmail.com> <5410FFE2.40308@linux.vnet.ibm.com> In-Reply-To: <5410FFE2.40308@linux.vnet.ibm.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael R. Hines" , Walid Nouri , Paolo Bonzini , qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com, "Dr. David Alan Gilbert" , Dong Eddie , FNST-Gui Jianfeng , wency@cn.fujitsu.com =E5=9C=A8 09/11/2014 09:50 AM, Michael R. Hines =E5=86=99=E9=81=93: > On 09/10/2014 11:43 PM, Walid Nouri wrote: >> Hello Michael, Hello Paolo >> i have =E2=80=9Estudied=E2=80=9C the available documentation/Information= and tried to get an >> idea of the QEMU live block operation possibilities. >> >> I think the MC protocol doesn=E2=80=99t need synchronous block device re= plication >> because primary and secondary VM are not synchronous. The state of the p= rimary >> is allays ahead of the state of the secondary. When the primary is in ep= och(n) >> the secondary is in epoch(n-1). >> >> What MC needs is a block device agnostic, controlled and asynchronous ap= proach >> for replicating the contents of block devices and its state changes to t= he >> secondary VM while the primary VM is running. Asynchronous block transfe= r is >> important to allow maximum performance for the primary VM, while keeping= the >> secondary VM updated with state changes. >> >> The block device replication should be possible in two stages or modes. >> >> The first stage is the live copy of all block devices of the primary to = the >> secondary. This is necessary if the secondary doesn=E2=80=99t have an ex= isting image >> which is in sync with the primary at the time MC has started. This is no= t very >> convenient but as far as I know actually there is no mechanism for persi= stent >> dirty bitmap in QEMU. >> >> The second stage (mode) is the replication of block device state changes >> (modified blocks) to keep the image on the secondary in sync with the pr= imary. >> The mirrored blocks must be buffered in ram (block buffer) until the com= plete >> Checkpoint (RAM, vCPU, device state) can be committed. >> >> For keeping the complete system state consistent on the secondary system= there >> must be a possibility for MC to commit/discard block device state change= s. In >> normal operation the mirrored block device state changes (block buffer) = are >> committed to disk when the complete checkpoint is committed. In case of = a >> crash of the primary system while transferring a checkpoint the data in = the >> block buffer corresponding to the failed Checkpoint must be discarded. >> >> The storage architecture should be =E2=80=9Cshared nothing=E2=80=9D so t= hat no shared storage >> is required and primary/secondary can have separate block device images. >> >> I think this can be achieved by drive-mirror and a filter block driver. >> Another approach could be to exploit the block migration functionality o= f live >> migration with a filter block driver. >> >> The drive-mirror (and live migration) does not rely on shared storage an= d >> allow live block device copy and incremental syncing. >> >> A block buffer can be implemented with a QEMU filter block driver. It sh= ould >> sit at the same position as the Quorum driver in the block driver hierar= chy. >> When using block filter approach MC will be transparent and block device >> agnostic. >> >> The block buffer filter must have an Interface which allows MC control t= he >> commits or discards of block device state changes. I have no idea where = to put >> such an interface to stay conform with QEMU coding style. >> >> >> I=E2=80=99m sure there are alternative and better approaches and I=E2=80= =99m open for any ideas >> >> >> Walid >> >> Am 17.08.2014 11:52, schrieb Paolo Bonzini: >>> Il 11/08/2014 22:15, Michael R. Hines ha scritto: >>>> Excellent question: QEMU does have a feature called "drive-mirror" >>>> in block/mirror.c that was introduced a couple of years ago. I'm not >>>> sure what the >>>> adoption rate of the feature is, but I would start with that one. >>> >>> block/mirror.c is asynchronous, and there's no support for communicatin= g >>> checkpoints back to the master. However, the quorum disk driver could >>> be what you need. >>> >>> There's also a series on the mailing list that lets quorum read only >>> from the primary, so that quorum can still do replication and fault >>> tolerance, but skip fault detection. >>> >>> Paolo >>> >>>> There is also a second fault tolerance implementation that works a >>>> little differently called >>>> "COLO" - you may have seen those emails on the list too, but their >>>> method does not require a disk replication solution, if I recall corre= ctly. >>> >> > > Nice description of the problem - would you like to put this information = on the > MC wiki page? (Just send an email to the list that says "request for wiki > account, please" in the subject - and they will make an account for you. > > A drive-mirror + filter driver solution sounds like a good plan overall, > of course the devil is in the details =3D) If I understand correctly, this is similar to our approach. Disk replicatio= n is definitely on our plan and we will post RFC patches that include Disk replication, you can keep an eye on COLO patches:) > > I don't know how much time you have to spend on actual code, but even a > description of what a "theoretical" interface between MC and drive-mirror= would > look like would go a long way even without code. > > Your investigations would also help "drive" a solution to this problem fo= r the > COLO team as well - I believe they need the same thing.... > > - Michael > > . > --=20 Thanks, Yang.