From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41208) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLqM1-0000k6-Cn for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:46:38 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YLqLy-00011W-2H for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:46:37 -0500 Received: from mx1.redhat.com ([209.132.183.28]:60273) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLqLx-00011M-RN for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:46:34 -0500 Date: Thu, 12 Feb 2015 17:46:21 +0800 From: Fam Zheng Message-ID: <20150212094621.GB21253@ad.nay.redhat.com> References: <1423710438-14377-1-git-send-email-wency@cn.fujitsu.com> <1423710438-14377-2-git-send-email-wency@cn.fujitsu.com> <20150212072117.GB32554@ad.nay.redhat.com> <54DC58E6.7060608@cn.fujitsu.com> <20150212084435.GD32554@ad.nay.redhat.com> <54DC7407.9010500@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <54DC7407.9010500@cn.fujitsu.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hongyang Yang Cc: Kevin Wolf , Lai Jiangshan , Jiang Yunhong , Dong Eddie , qemu devel , "Dr. David Alan Gilbert" , Gonglei , Stefan Hajnoczi , Paolo Bonzini , jsnow@redhat.com, zhanghailiang On Thu, 02/12 17:36, Hongyang Yang wrote: > Hi Fam, >=20 > =E5=9C=A8 02/12/2015 04:44 PM, Fam Zheng =E5=86=99=E9=81=93: > >On Thu, 02/12 15:40, Wen Congyang wrote: > >>On 02/12/2015 03:21 PM, Fam Zheng wrote: > >>>Hi Congyang, > >>> > >>>On Thu, 02/12 11:07, Wen Congyang wrote: > >>>>+=3D=3D Workflow =3D=3D > >>>>+The following is the image of block replication workflow: > >>>>+ > >>>>+ +----------------------+ +---------------------= ---+ > >>>>+ |Primary Write Requests| |Secondary Write Reque= sts| > >>>>+ +----------------------+ +---------------------= ---+ > >>>>+ | | > >>>>+ | (4) > >>>>+ | V > >>>>+ | /-------------\ > >>>>+ | Copy and Forward | | > >>>>+ |---------(1)----------+ | Disk Buffer | > >>>>+ | | | | > >>>>+ | (3) \-------------/ > >>>>+ | speculative ^ > >>>>+ | write through (2) > >>>>+ | | | > >>>>+ V V | > >>>>+ +--------------+ +----------------+ > >>>>+ | Primary Disk | | Secondary Disk | > >>>>+ +--------------+ +----------------+ > >>>>+ > >>>>+ 1) Primary write requests will be copied and forwarded to Seco= ndary > >>>>+ QEMU. > >>>>+ 2) Before Primary write requests are written to Secondary disk= , the > >>>>+ original sector content will be read from Secondary disk an= d > >>>>+ buffered in the Disk buffer, but it will not overwrite the = existing > >>>>+ sector content in the Disk buffer. > >>> > >>>I'm a little confused by the tenses ("will be" versus "are") and ter= ms. I am > >>>reading them as "s/will be/are/g" > >>> > >>>Why do you need this buffer? > >> > >>We only sync the disk till next checkpoint. Before next checkpoint, s= econdary > >>vm write to the buffer. > >> > >>> > >>>If both primary and secondary write to the same sector, what is save= d in the > >>>buffer? > >> > >>The primary content will be written to the secondary disk, and the se= condary content > >>is saved in the buffer. > > > >I wonder if alternatively this is possible with an imaginary "writable= backing > >image" feature, as described below. > > > >When we have a normal backing chain, > > > > {virtio-blk dev 'foo'} > > | > > | > > | > > [base] <- [mid] <- (foo) > > > >Where [base] and [mid] are read only, (foo) is writable. When we add a= n overlay > >to an existing image on top, > > > > {virtio-blk dev 'foo'} {virtio-blk dev 'bar'} > > | | > > | | > > | | > > [base] <- [mid] <- (foo) <---------------------- (bar) > > > >It's important to make sure that writes to 'foo' doesn't break data fo= r 'bar'. > >We can utilize an automatic hidden drive-backup target: > > > > {virtio-blk dev 'foo'} = {virtio-blk dev 'bar'} > > | = | > > | = | > > v = v > > > > [base] <- [mid] <- (foo) <----------------- (hidden target) <---= ------------ (bar) > > > > v ^ > > v ^ > > v ^ > > v ^ > > >>>> drive-backup sync=3Dnone >>>> > > > >So when guest writes to 'foo', the old data is moved to (hidden target= ), which > >remains unchanged from (bar)'s PoV. > > > >The drive in the middle is called hidden because QEMU creates it autom= atically, > >the naming is arbitrary. > > > >It is interesting because it is a more generalized case of image fleec= ing, > >where the (hidden target) is exposed via NBD server for data scanning = (read > >only) purpose. > > > >More interestingly, with above facility, it is also possible to create= a guest > >visible live snapshot (disk 'bar') of an existing device (disk 'foo') = very > >cheaply. Or call it shadow copy if you will. > > > >Back to the COLO case, the configuration will be very similar: > > > > > > {primary wr} = {secondary vm} > > | = | > > | = | > > | = | > > v = v > > > > [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <= ------------- (active disk) > > > > v ^ > > v ^ > > v ^ > > v ^ > > >>>> drive-backup sync=3Dnone >>>> > > > >The workflow analogue is: > > > >>>>+ 1) Primary write requests will be copied and forwarded to Seco= ndary > >>>>+ QEMU. > > > >Primary write requests are forwarded to secondary QEMU as well. > > > >>>>+ 2) Before Primary write requests are written to Secondary disk= , the > >>>>+ original sector content will be read from Secondary disk an= d > >>>>+ buffered in the Disk buffer, but it will not overwrite the = existing > >>>>+ sector content in the Disk buffer. > > > >Before Primary write requests are written to (nbd target), aka the Sec= ondary > >disk, the orignal sector content is read from it and copied to (hidden= buf > >disk) by drive-backup. It obviously will not overwrite the data in (ac= tive > >disk). > > > >>>>+ 3) Primary write requests will be written to Secondary disk. > > > >Primary write requests are written to (nbd target). > > > >>>>+ 4) Secondary write requests will be buffered in the Disk buffe= r and it > >>>>+ will overwrite the existing sector content in the buffer. > > > >Secondary write request will be written in (active disk) as usual. > > > >Finally, when checkpoint arrives, if you want to sync with primary, ju= st drop > >data in (hidden buf disk) and (active disk); when failover happends, i= f you > >want to promote secondary vm, you can commit (active disk) to (nbd tar= get), and > >drop data in (hidden buf disk). >=20 > If I understand correctly, you split the Disk Buffer into a hidden buf = disk + > an active disk. What we need to do is only to implement a buf disk(will= be > used as hidden buf disk and active disk as mentioned), apart from this,= we can > use the existing mechinism like backing-file/drive-backup? >=20 Yes, but you need a separate driver to take care of the buffer logic as introduced in this series, which is less generic, but does the same thing= we will need in the image fleecing use case. Fam