From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33014) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yanpb-0004lO-QD for qemu-devel@nongnu.org; Wed, 25 Mar 2015 12:07:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YanpY-0002KU-ER for qemu-devel@nongnu.org; Wed, 25 Mar 2015 12:06:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54059) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YanpX-0002KL-W4 for qemu-devel@nongnu.org; Wed, 25 Mar 2015 12:06:56 -0400 Message-ID: <5512DD1D.9020704@redhat.com> Date: Wed, 25 Mar 2015 10:06:53 -0600 From: Eric Blake MIME-Version: 1.0 References: <1419564708-17714-1-git-send-email-yanghy@cn.fujitsu.com> <1419564708-17714-2-git-send-email-yanghy@cn.fujitsu.com> In-Reply-To: <1419564708-17714-2-git-send-email-yanghy@cn.fujitsu.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="kdav6lL4Rm2UhCxKx046ABOidufUHExDD" Subject: Re: [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Yang Hongyang , qemu-devel@nongnu.org Cc: kwolf@redhat.com, Lai Jiangshan , quintela@redhat.com, GuiJianfeng@cn.fujitsu.com, yunhong.jiang@intel.com, eddie.dong@intel.com, dgilbert@redhat.com, mrhines@linux.vnet.ibm.com, stefanha@redhat.com, Amit Shah , pbonzini@redhat.com, walid.nouri@gmail.com This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --kdav6lL4Rm2UhCxKx046ABOidufUHExDD Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 12/25/2014 08:31 PM, Yang Hongyang wrote: > This is the initial design of block replication. > The blkcolo block driver enables disk replication for continuous > checkpoints. It is designed for COLO that Secondary VM is running. > It can also be applied for FT/HA scene that Secondary VM is not > running. >=20 > Signed-off-by: Wen Congyang > Signed-off-by: Lai Jiangshan > Signed-off-by: Yang Hongyang > --- > docs/blkcolo.txt | 85 ++++++++++++++++++++++++++++++++++++++++++++++++= ++++++++ > 1 file changed, 85 insertions(+) > create mode 100644 docs/blkcolo.txt Grammar review only (I'll leave the technical review to others) >=20 > diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt > new file mode 100644 > index 0000000..41c2a05 > --- /dev/null > +++ b/docs/blkcolo.txt > @@ -0,0 +1,85 @@ > +Disk replication using blkcolo > +---------------------------------------- > +Copyright Fujitsu, Corp. 2014 Visually, the separator line should match the length of the line above, and maybe have a blank line after. > + > +This work is licensed under the terms of the GNU GPL, version 2 or lat= er. > +See the COPYING file in the top-level directory. > + > +The blkcolo block driver enables disk replication for continuous check= points. > +It is designed for COLO that Secondary VM is running. It can also be a= pplied similar comments as for Wen's RFC COLO v2 series for docs/block-replication.txt (in fact, do we need two files, or should all this information be merged into a single file?): s/for COLO that/for COLO (COurse-grain LOck-stepping replication), where/= > +for FT/HA scene that Secondary VM is not running. s/for FT/HA scene that/to FT/HA (Fault-tolerance/High assurance) scenarios, where/ > + > +This document gives an overview of blkcolo's design. > + > +=3D=3D Background =3D=3D > +High availability solutions such as micro checkpoint and COLO will do > +consecutive checkpoint. The VM state of Primary VM and Secondary VM is= s/checkpoint/checkpoints/ > +identical right after a VM checkpoint, but becomes different as the VM= > +executes till the next checkpoint. To support disk contents checkpoint= , > +the modified disk contents in the Secondary VM must be buffered, and a= re > +only dropped at next checkpoint time. To reduce the network transporta= tion > +effort at the time of checkpoint, the disk modification operations of > +Primary disk are asynchronously forwarded to the Secondary node. > + > +=3D=3D Disk Buffer =3D=3D > +The following is the image of Disk buffer: > + > + +----------------------+ +------------------------+= > + |Primary Write Requests| |Secondary Write Requests|= > + +----------------------+ +------------------------+= > + | | > + | (4) > + | V > + | /-------------\ > + | Copy and Forward | | > + |---------(1)----------+ | Disk Buffer | > + | | | | > + | (3) \-------------/ > + | speculative ^ > + | write through (2) > + | | | > + V V | > + +--------------+ +----------------+ > + | Primary Disk | | Secondary Disk | > + +--------------+ +----------------+ > + 1) Primary write requests will be copied and forwarded to Secondar= y > + QEMU. > + 2) Before Primary write requests are written to Secondary disk, th= e > + original sector content will be read from Secondary disk and > + buffered in the Disk buffer, but it will not overwrite the exis= ting > + sector content in the Disk buffer. > + 3) Primary write requests will be written to Secondary disk. > + 4) Secondary write requests will be bufferd in the Disk buffer and= it s/bufferd/buffered/ > + will overwrite the existing sector content in the buffer. > + > +=3D=3D Capture I/O request =3D=3D > +The blkcolo is a new block driver protocol, so all I/O requests can be= > +captured in the driver interface bdrv_co_readv()/bdrv_co_writev(). > + > +=3D=3D Checkpoint & failover =3D=3D > +The blkcolo buffers the write requests in Secondary QEMU. And the buff= er > +should be dropped at a checkpoint, or be flushed to Secondary disk whe= n s/when/on/ > +failover. We add four block driver interfaces to do this: > +a. bdrv_prepare_checkpoint() > + This interface may block, and return when all Primary write s/return/returns/ > + requests are forwarded to Secondary QEMU. > +b. bdrv_do_checkpoint() > + This interface is called after all VM state is transfered to s/transfered/transferred/ > + Secondary QEMU. The Disk buffer will be dropped in this interface. > +c. bdrv_get_sent_data_size() > + This is used on Primary node. > + It should be called by migration/checkpoint thread in order > + to decide whether to start a new checkpoint or not. If the data > + amount being sent is too large, we should start a new checkpoint. > +d. bdrv_stop_replication() > + It is called when failover. We will flush the Disk buffer into s/when/on/ > + Secondary Disk and stop disk replication. > + > +=3D=3D Usage =3D=3D > +On both Primary/Secondary host, invoke QEMU with the following paramet= ers: > + "-drive file=3Dblkcolo:host:port:/path/to/image" > +a. host > + Hostname or IP of the Secondary host. > +b. port > + The Secondary QEMU will listen on this port, and the Primary QEMU > + will connect to this port. >=20 --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --kdav6lL4Rm2UhCxKx046ABOidufUHExDD Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJVEt0dAAoJEKeha0olJ0NqrNcH/1GZq8CofX9L46kYMfFJ0D/G glGbH/9cnnJXC8SAgd9gFgGDxvZr2WRWQdH4CrtU4xlz990mty+j5fkI3x6Ec3+7 w++Yt1Awss1D6oBVOTrYFdXVotWXXdXKNqnJ7YPhbGhhFZSNHTaAQ2dVQZVY54Oi ouxdsYlDdpCTaAIDIOXTC3hD84bESJXsDDWp6Lytfoptp8b0/3TtrVqOekfPLsSr 2jo0/M3kuoge+S/ywiHbkjyTbv1AB04SBvM5Uw6u0uSjDTBfKOT++YYntdVz+j2V X7mgaX+/CR0lLq7KN/5AJR3LQlGHwH2SYF1vvgupVqUiWLE2KESdan3DPssZTF8= =rONc -----END PGP SIGNATURE----- --kdav6lL4Rm2UhCxKx046ABOidufUHExDD--