From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45244) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1byxWR-0004C8-UW for qemu-devel@nongnu.org; Tue, 25 Oct 2016 04:55:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1byxWP-0000CN-SY for qemu-devel@nongnu.org; Tue, 25 Oct 2016 04:55:52 -0400 Message-ID: <580F1FDE.8050401@cn.fujitsu.com> Date: Tue, 25 Oct 2016 17:03:26 +0800 From: Changlong Xie MIME-Version: 1.0 References: <1476971860-20860-1-git-send-email-zhang.zhanghailiang@huawei.com> <1476971860-20860-2-git-send-email-zhang.zhanghailiang@huawei.com> In-Reply-To: <1476971860-20860-2-git-send-email-zhang.zhanghailiang@huawei.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH RFC 1/7] docs/block-replication: Add description for shared-disk case List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: zhanghailiang , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: stefanha@redhat.com, kwolf@redhat.com, mreitz@redhat.com, pbonzini@redhat.com, wency@cn.fujitsu.com, Zhang Chen On 10/20/2016 09:57 PM, zhanghailiang wrote: > Introuduce the scenario of shared-disk block replication > and how to use it. > > Signed-off-by: zhanghailiang > Signed-off-by: Wen Congyang > Signed-off-by: Zhang Chen > --- > docs/block-replication.txt | 131 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 127 insertions(+), 4 deletions(-) > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > index 6bde673..97fcfc1 100644 > --- a/docs/block-replication.txt > +++ b/docs/block-replication.txt > @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network transportation > effort during a vmstate checkpoint, the disk modification operations of > the Primary disk are asynchronously forwarded to the Secondary node. > > -== Workflow == > +== Non-shared disk workflow == > The following is the image of block replication workflow: > > +----------------------+ +------------------------+ > @@ -57,7 +57,7 @@ The following is the image of block replication workflow: > 4) Secondary write requests will be buffered in the Disk buffer and it > will overwrite the existing sector content in the buffer. > > -== Architecture == > +== None-shared disk architecture == s/None-shared/Non-shared/g > We are going to implement block replication from many basic > blocks that are already in QEMU. > > @@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative write-through > of the NBD server into the secondary disk. So before block replication, > the primary disk and secondary disk should contain the same data. > > +== Shared Disk Mode Workflow == > +The following is the image of block replication workflow: > + > + +----------------------+ +------------------------+ > + |Primary Write Requests| |Secondary Write Requests| > + +----------------------+ +------------------------+ > + | | > + | (4) > + | V > + | /-------------\ > + | (2)Forward and write through | | > + | +--------------------------> | Disk Buffer | > + | | | | > + | | \-------------/ > + | |(1)read | > + | | | > + (3)write | | | backing file > + V | | > + +-----------------------------+ | > + | Shared Disk | <-----+ > + +-----------------------------+ > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) Before Primary write requests are written to Shared disk, the > + original sector content will be read from Shared disk and > + forwarded and buffered in the Disk buffer on the secondary site, > + but it will not overwrite the existing extra spaces at the end of line > + sector content(it could be from either "Secondary Write Requests" or Need a space before "(" for better style. > + previous COW of "Primary Write Requests") in the Disk buffer. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the Disk buffer and it > + will overwrite the existing sector content in the buffer. > + > +== Shared Disk Mode Architecture == > +We are going to implement block replication from many basic > +blocks that are already in QEMU. > + virtio-blk || .---------- > + / || | Secondary > + / || '---------- > + / || virtio-blk > + / || | > + | || replication(5) > + | NBD --------> NBD (2) | > + | client || server ---> hidden disk <-- active disk(4) > + | ^ || | > + | replication(1) || | > + | | || | > + | +-----------------' || | > + (3) |drive-backup sync=none || | > +--------. | +-----------------+ || | > +Primary | | | || backing | > +--------' | | || | > + V | | > + +-------------------------------------------+ | > + | shared disk | <----------+ > + +-------------------------------------------+ > + > + > + 1) Primary writes will read original data and forward it to Secondary > + QEMU. > + 2) The hidden-disk buffers the original content that is modified by the > + primary VM. It should also be an empty disk, and extra spaces at end of line > + the driver supports bdrv_make_empty() and backing file. > + 3) Primary write requests will be written to Shared disk. > + 4) Secondary write requests will be buffered in the active disk and it > + will overwrite the existing sector content in the buffer. > + > == Failure Handling == > There are 7 internal errors when block replication is running: > 1. I/O error on primary disk > @@ -145,7 +213,7 @@ d. replication_stop_all() > things except failover. The caller must hold the I/O mutex lock if it is > in migration/checkpoint thread. > > -== Usage == > +== Non-shared disk usage == > Primary: > -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ > children.0.file.filename=1.raw,\ > @@ -234,6 +302,61 @@ Secondary: > The primary host is down, so we should do the following thing: > { 'execute': 'nbd-server-stop' } > > +== Shared disk usage == Keep the some coding style with "== Non-shared disk usage ==" part is good to me. > +Primary: > + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw > + > +Issue qmp command: > + {'execute': 'human-monitor-command', two space indentation for the whole "{...}" part > + 'arguments': { > + 'command-line': 'drive_add-nbuddydriver=replication, missing spaces > + mode=primary, > + file.driver=nbd, > + file.host=9.42.3.17, > + file.port=9998, > + file.export=hidden_disk0, > + shared-disk-id=primary_disk0, > + shared-disk=on, > + node-name=rep' Keep the whole commands after "command-line" in one line, or you can execute it correctly. IIRC > + } > + } Secondary: > + -drive if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ > + backing.driver=raw,backing.file.filename=1.raw \ > + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ > + file.driver=qcow2,top-id=active-disk0,\ > + file.file.filename=/mnt/ramfs/active_disk.img,\ > + file.backing=hidden_disk0,shared-disk=on > + > +Issue qmp command: > +1. {'execute': 'nbd-server-start', > + 'arguments': { > + 'addr': { > + 'type': 'inet', > + 'data': { > + 'host': '0', s/0/9.42.3.17/g, since you use designated ip address above > + 'port': '9998' > + } > + } > + } > + } > +2. { > + 'execute': 'nbd-server-add', > + 'arguments': { > + 'device': 'hidden_disk0', > + 'writable': true > + } > + } > + > +After Failover: > +Primary: > +{'execute': 'human-monitor-command', > + 'arguments': { > + 'command-line': 'drive_delrep' drive_del rep > + } > +} > + > +Secondary: > + {'execute': 'nbd-server-stop' } > + > TODO: > 1. Continuous block replication > -2. Shared disk >