From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58270) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cBEyP-0000Ph-J4 for qemu-devel@nongnu.org; Mon, 28 Nov 2016 00:59:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cBEyO-0001B6-4n for qemu-devel@nongnu.org; Mon, 28 Nov 2016 00:59:29 -0500 References: <1476971860-20860-1-git-send-email-zhang.zhanghailiang@huawei.com> <1476971860-20860-2-git-send-email-zhang.zhanghailiang@huawei.com> <580F1FDE.8050401@cn.fujitsu.com> <583BBCE6.5070307@huawei.com> <583BC7FC.2040002@cn.fujitsu.com> From: Hailiang Zhang Message-ID: <583BC790.9050206@huawei.com> Date: Mon, 28 Nov 2016 13:58:40 +0800 MIME-Version: 1.0 In-Reply-To: <583BC7FC.2040002@cn.fujitsu.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH RFC 1/7] docs/block-replication: Add description for shared-disk case List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Changlong Xie , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: stefanha@redhat.com, kwolf@redhat.com, mreitz@redhat.com, pbonzini@redhat.com, wency@cn.fujitsu.com, Zhang Chen , Markus Armbruster On 2016/11/28 14:00, Changlong Xie wrote: > On 11/28/2016 01:13 PM, Hailiang Zhang wrote: >> >> On 2016/10/25 17:03, Changlong Xie wrote: >>> On 10/20/2016 09:57 PM, zhanghailiang wrote: >>>> Introuduce the scenario of shared-disk block replication >>>> and how to use it. >>>> >>>> Signed-off-by: zhanghailiang >>>> Signed-off-by: Wen Congyang >>>> Signed-off-by: Zhang Chen >>>> --- >>>> docs/block-replication.txt | 131 >>>> +++++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 127 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >>>> index 6bde673..97fcfc1 100644 >>>> --- a/docs/block-replication.txt >>>> +++ b/docs/block-replication.txt >>>> @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the >>>> network transportation >>>> effort during a vmstate checkpoint, the disk modification >>>> operations of >>>> the Primary disk are asynchronously forwarded to the Secondary node. >>>> >>>> -== Workflow == >>>> +== Non-shared disk workflow == >>>> The following is the image of block replication workflow: >>>> >>>> +----------------------+ >>>> +------------------------+ >>>> @@ -57,7 +57,7 @@ The following is the image of block replication >>>> workflow: >>>> 4) Secondary write requests will be buffered in the Disk >>>> buffer and it >>>> will overwrite the existing sector content in the buffer. >>>> >>>> -== Architecture == >>>> +== None-shared disk architecture == >>> >>> s/None-shared/Non-shared/g >>> >> >>>> We are going to implement block replication from many basic >>>> blocks that are already in QEMU. >>>> >>>> @@ -106,6 +106,74 @@ any state that would otherwise be lost by the >>>> speculative write-through >>>> of the NBD server into the secondary disk. So before block >>>> replication, >>>> the primary disk and secondary disk should contain the same data. >>>> >>>> +== Shared Disk Mode Workflow == >>>> +The following is the image of block replication workflow: >>>> + >>>> + +----------------------+ +------------------------+ >>>> + |Primary Write Requests| |Secondary Write Requests| >>>> + +----------------------+ +------------------------+ >>>> + | | >>>> + | (4) >>>> + | V >>>> + | /-------------\ >>>> + | (2)Forward and write through | | >>>> + | +--------------------------> | Disk Buffer | >>>> + | | | | >>>> + | | \-------------/ >>>> + | |(1)read | >>>> + | | | >>>> + (3)write | | | backing file >>>> + V | | >>>> + +-----------------------------+ | >>>> + | Shared Disk | <-----+ >>>> + +-----------------------------+ >>>> + >>>> + 1) Primary writes will read original data and forward it to >>>> Secondary >>>> + QEMU. >>>> + 2) Before Primary write requests are written to Shared disk, the >>>> + original sector content will be read from Shared disk and >>>> + forwarded and buffered in the Disk buffer on the secondary site, >>>> + but it will not overwrite the existing >>> >>> extra spaces at the end of line >>> >> >>>> + sector content(it could be from either "Secondary Write >>>> Requests" or >>> >>> Need a space before "(" for better style. >>> >> >>>> + previous COW of "Primary Write Requests") in the Disk buffer. >>>> + 3) Primary write requests will be written to Shared disk. >>>> + 4) Secondary write requests will be buffered in the Disk buffer >>>> and it >>>> + will overwrite the existing sector content in the buffer. >>>> + >>>> +== Shared Disk Mode Architecture == >>>> +We are going to implement block replication from many basic >>>> +blocks that are already in QEMU. >>>> + virtio-blk >>>> || .---------- >>>> + / >>>> || | Secondary >>>> + / >>>> || '---------- >>>> + / >>>> || virtio-blk >>>> + / >>>> || | >>>> + | >>>> || replication(5) >>>> + | NBD --------> NBD >>>> (2) | >>>> + | client || server ---> hidden >>>> disk <-- active disk(4) >>>> + | ^ || | >>>> + | replication(1) || | >>>> + | | || | >>>> + | +-----------------' || | >>>> + (3) |drive-backup sync=none || | >>>> +--------. | +-----------------+ || | >>>> +Primary | | | || backing | >>>> +--------' | | || | >>>> + V | | >>>> + +-------------------------------------------+ | >>>> + | shared disk | <----------+ >>>> + +-------------------------------------------+ >>>> + >>>> + >>>> + 1) Primary writes will read original data and forward it to >>>> Secondary >>>> + QEMU. >>>> + 2) The hidden-disk buffers the original content that is modified >>>> by the >>>> + primary VM. It should also be an empty disk, and >>> >>> extra spaces at end of line >>> >> >>>> + the driver supports bdrv_make_empty() and backing file. >>>> + 3) Primary write requests will be written to Shared disk. >>>> + 4) Secondary write requests will be buffered in the active disk >>>> and it >>>> + will overwrite the existing sector content in the buffer. >>>> + >>>> == Failure Handling == >>>> There are 7 internal errors when block replication is running: >>>> 1. I/O error on primary disk >>>> @@ -145,7 +213,7 @@ d. replication_stop_all() >>>> things except failover. The caller must hold the I/O mutex lock >>>> if it is >>>> in migration/checkpoint thread. >>>> >>>> -== Usage == >>>> +== Non-shared disk usage == >>>> Primary: >>>> -drive >>>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ >>>> children.0.file.filename=1.raw,\ >>>> @@ -234,6 +302,61 @@ Secondary: >>>> The primary host is down, so we should do the following thing: >>>> { 'execute': 'nbd-server-stop' } >>>> >>>> +== Shared disk usage == >>> >>> Keep the some coding style with "== Non-shared disk usage ==" part is >>> good to me. >>> >> >>>> +Primary: >>>> + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw >>>> + >>>> +Issue qmp command: >>>> + {'execute': 'human-monitor-command', >>> >>> two space indentation for the whole "{...}" part >>> >>>> + 'arguments': { >>>> + 'command-line': 'drive_add-nbuddydriver=replication, >>> >>> missing spaces >>> >>>> + mode=primary, >>>> + file.driver=nbd, >>>> + file.host=9.42.3.17, >>>> + file.port=9998, >>>> + file.export=hidden_disk0, >>>> + shared-disk-id=primary_disk0, >>>> + shared-disk=on, >>>> + node-name=rep' >>> >> >>> Keep the whole commands after "command-line" in one line, or you can >>> execute it correctly. IIRC >>> >> >> Hmm, i will change this hmp command to qmp 'blockdev-add' command in next >> version, because it is supported now, though it is ready for production. >> > > It's a good start, but i'm not sure here. > > http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg01062.html > Yes, i noticed that, but for COLO, it is not ready for production either. So I think it is OK to use it here ... > Thanks > -Xie >>>> + } >>>> + } >>> >>> Secondary: >>> >>>> + -drive >>>> if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ >>>> >>>> + backing.driver=raw,backing.file.filename=1.raw \ >>>> + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ >>>> + file.driver=qcow2,top-id=active-disk0,\ >>>> + file.file.filename=/mnt/ramfs/active_disk.img,\ >>>> + file.backing=hidden_disk0,shared-disk=on >>>> + >>>> +Issue qmp command: >>>> +1. {'execute': 'nbd-server-start', >>>> + 'arguments': { >>>> + 'addr': { >>>> + 'type': 'inet', >>>> + 'data': { >>>> + 'host': '0', >>> >>> s/0/9.42.3.17/g, since you use designated ip address above >>> >> >>>> + 'port': '9998' >>>> + } >>>> + } >>>> + } >>>> + } >>>> +2. { >>>> + 'execute': 'nbd-server-add', >>>> + 'arguments': { >>>> + 'device': 'hidden_disk0', >>>> + 'writable': true >>>> + } >>>> + } >>>> + >>>> +After Failover: >>>> +Primary: >>>> +{'execute': 'human-monitor-command', >>>> + 'arguments': { >>>> + 'command-line': 'drive_delrep' >>> >>> drive_del rep >>> >> >> I'll use the qmp command instead here. >> >>>> + } >>>> +} >>>> + >>>> +Secondary: >>>> + {'execute': 'nbd-server-stop' } >>>> + >>>> TODO: >>>> 1. Continuous block replication >>>> -2. Shared disk >>>> >>> >> >> I will fix all the above problems in next version, thanks. >> >>> >>> >>> . >>> >> >> >> >> . >> > > > > . >