From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40243) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLqDV-0005yh-F5 for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:37:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YLqDR-0007TF-D4 for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:37:49 -0500 Received: from [59.151.112.132] (port=16408 helo=heian.cn.fujitsu.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLqDQ-0007Sn-BI for qemu-devel@nongnu.org; Thu, 12 Feb 2015 04:37:45 -0500 Message-ID: <54DC7407.9010500@cn.fujitsu.com> Date: Thu, 12 Feb 2015 17:36:07 +0800 From: Hongyang Yang MIME-Version: 1.0 References: <1423710438-14377-1-git-send-email-wency@cn.fujitsu.com> <1423710438-14377-2-git-send-email-wency@cn.fujitsu.com> <20150212072117.GB32554@ad.nay.redhat.com> <54DC58E6.7060608@cn.fujitsu.com> <20150212084435.GD32554@ad.nay.redhat.com> In-Reply-To: <20150212084435.GD32554@ad.nay.redhat.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Fam Zheng , Wen Congyang Cc: Kevin Wolf , Lai Jiangshan , Jiang Yunhong , Dong Eddie , qemu devel , "Dr. David Alan Gilbert" , Gonglei , Stefan Hajnoczi , Paolo Bonzini , jsnow@redhat.com, zhanghailiang Hi Fam, =E5=9C=A8 02/12/2015 04:44 PM, Fam Zheng =E5=86=99=E9=81=93: > On Thu, 02/12 15:40, Wen Congyang wrote: >> On 02/12/2015 03:21 PM, Fam Zheng wrote: >>> Hi Congyang, >>> >>> On Thu, 02/12 11:07, Wen Congyang wrote: >>>> +=3D=3D Workflow =3D=3D >>>> +The following is the image of block replication workflow: >>>> + >>>> + +----------------------+ +------------------------= + >>>> + |Primary Write Requests| |Secondary Write Requests= | >>>> + +----------------------+ +------------------------= + >>>> + | | >>>> + | (4) >>>> + | V >>>> + | /-------------\ >>>> + | Copy and Forward | | >>>> + |---------(1)----------+ | Disk Buffer | >>>> + | | | | >>>> + | (3) \-------------/ >>>> + | speculative ^ >>>> + | write through (2) >>>> + | | | >>>> + V V | >>>> + +--------------+ +----------------+ >>>> + | Primary Disk | | Secondary Disk | >>>> + +--------------+ +----------------+ >>>> + >>>> + 1) Primary write requests will be copied and forwarded to Seconda= ry >>>> + QEMU. >>>> + 2) Before Primary write requests are written to Secondary disk, t= he >>>> + original sector content will be read from Secondary disk and >>>> + buffered in the Disk buffer, but it will not overwrite the exi= sting >>>> + sector content in the Disk buffer. >>> >>> I'm a little confused by the tenses ("will be" versus "are") and terms.= I am >>> reading them as "s/will be/are/g" >>> >>> Why do you need this buffer? >> >> We only sync the disk till next checkpoint. Before next checkpoint, seco= ndary >> vm write to the buffer. >> >>> >>> If both primary and secondary write to the same sector, what is saved i= n the >>> buffer? >> >> The primary content will be written to the secondary disk, and the secon= dary content >> is saved in the buffer. > > I wonder if alternatively this is possible with an imaginary "writable ba= cking > image" feature, as described below. > > When we have a normal backing chain, > > {virtio-blk dev 'foo'} > | > | > | > [base] <- [mid] <- (foo) > > Where [base] and [mid] are read only, (foo) is writable. When we add an o= verlay > to an existing image on top, > > {virtio-blk dev 'foo'} {virtio-blk dev 'bar'} > | | > | | > | | > [base] <- [mid] <- (foo) <---------------------- (bar) > > It's important to make sure that writes to 'foo' doesn't break data for '= bar'. > We can utilize an automatic hidden drive-backup target: > > {virtio-blk dev 'foo'} = {virtio-blk dev 'bar'} > | = | > | = | > v = v > > [base] <- [mid] <- (foo) <----------------- (hidden target) <------= --------- (bar) > > v ^ > v ^ > v ^ > v ^ > >>>> drive-backup sync=3Dnone >>>> > > So when guest writes to 'foo', the old data is moved to (hidden target), = which > remains unchanged from (bar)'s PoV. > > The drive in the middle is called hidden because QEMU creates it automati= cally, > the naming is arbitrary. > > It is interesting because it is a more generalized case of image fleecing= , > where the (hidden target) is exposed via NBD server for data scanning (re= ad > only) purpose. > > More interestingly, with above facility, it is also possible to create a = guest > visible live snapshot (disk 'bar') of an existing device (disk 'foo') ver= y > cheaply. Or call it shadow copy if you will. > > Back to the COLO case, the configuration will be very similar: > > > {primary wr} = {secondary vm} > | = | > | = | > | = | > v = v > > [what] <- [ever] <- (nbd target) <------------ (hidden buf disk) <---= ---------- (active disk) > > v ^ > v ^ > v ^ > v ^ > >>>> drive-backup sync=3Dnone >>>> > > The workflow analogue is: > >>>> + 1) Primary write requests will be copied and forwarded to Seconda= ry >>>> + QEMU. > > Primary write requests are forwarded to secondary QEMU as well. > >>>> + 2) Before Primary write requests are written to Secondary disk, t= he >>>> + original sector content will be read from Secondary disk and >>>> + buffered in the Disk buffer, but it will not overwrite the exi= sting >>>> + sector content in the Disk buffer. > > Before Primary write requests are written to (nbd target), aka the Second= ary > disk, the orignal sector content is read from it and copied to (hidden bu= f > disk) by drive-backup. It obviously will not overwrite the data in (activ= e > disk). > >>>> + 3) Primary write requests will be written to Secondary disk. > > Primary write requests are written to (nbd target). > >>>> + 4) Secondary write requests will be buffered in the Disk buffer a= nd it >>>> + will overwrite the existing sector content in the buffer. > > Secondary write request will be written in (active disk) as usual. > > Finally, when checkpoint arrives, if you want to sync with primary, just = drop > data in (hidden buf disk) and (active disk); when failover happends, if y= ou > want to promote secondary vm, you can commit (active disk) to (nbd target= ), and > drop data in (hidden buf disk). If I understand correctly, you split the Disk Buffer into a hidden buf disk= + an active disk. What we need to do is only to implement a buf disk(will be used as hidden buf disk and active disk as mentioned), apart from this, we = can use the existing mechinism like backing-file/drive-backup? > > Fam > . > --=20 Thanks, Yang.