From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57414) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cBErQ-0007A0-A0 for qemu-devel@nongnu.org; Mon, 28 Nov 2016 00:52:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cBErO-0007lU-QZ for qemu-devel@nongnu.org; Mon, 28 Nov 2016 00:52:16 -0500 Message-ID: <583BC7FC.2040002@cn.fujitsu.com> Date: Mon, 28 Nov 2016 14:00:28 +0800 From: Changlong Xie MIME-Version: 1.0 References: <1476971860-20860-1-git-send-email-zhang.zhanghailiang@huawei.com> <1476971860-20860-2-git-send-email-zhang.zhanghailiang@huawei.com> <580F1FDE.8050401@cn.fujitsu.com> <583BBCE6.5070307@huawei.com> In-Reply-To: <583BBCE6.5070307@huawei.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH RFC 1/7] docs/block-replication: Add description for shared-disk case List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hailiang Zhang , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: stefanha@redhat.com, kwolf@redhat.com, mreitz@redhat.com, pbonzini@redhat.com, wency@cn.fujitsu.com, Zhang Chen , Markus Armbruster On 11/28/2016 01:13 PM, Hailiang Zhang wrote: > > On 2016/10/25 17:03, Changlong Xie wrote: >> On 10/20/2016 09:57 PM, zhanghailiang wrote: >>> Introuduce the scenario of shared-disk block replication >>> and how to use it. >>> >>> Signed-off-by: zhanghailiang >>> Signed-off-by: Wen Congyang >>> Signed-off-by: Zhang Chen >>> --- >>> docs/block-replication.txt | 131 >>> +++++++++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 127 insertions(+), 4 deletions(-) >>> >>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >>> index 6bde673..97fcfc1 100644 >>> --- a/docs/block-replication.txt >>> +++ b/docs/block-replication.txt >>> @@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the >>> network transportation >>> effort during a vmstate checkpoint, the disk modification >>> operations of >>> the Primary disk are asynchronously forwarded to the Secondary node. >>> >>> -== Workflow == >>> +== Non-shared disk workflow == >>> The following is the image of block replication workflow: >>> >>> +----------------------+ >>> +------------------------+ >>> @@ -57,7 +57,7 @@ The following is the image of block replication >>> workflow: >>> 4) Secondary write requests will be buffered in the Disk >>> buffer and it >>> will overwrite the existing sector content in the buffer. >>> >>> -== Architecture == >>> +== None-shared disk architecture == >> >> s/None-shared/Non-shared/g >> > >>> We are going to implement block replication from many basic >>> blocks that are already in QEMU. >>> >>> @@ -106,6 +106,74 @@ any state that would otherwise be lost by the >>> speculative write-through >>> of the NBD server into the secondary disk. So before block >>> replication, >>> the primary disk and secondary disk should contain the same data. >>> >>> +== Shared Disk Mode Workflow == >>> +The following is the image of block replication workflow: >>> + >>> + +----------------------+ +------------------------+ >>> + |Primary Write Requests| |Secondary Write Requests| >>> + +----------------------+ +------------------------+ >>> + | | >>> + | (4) >>> + | V >>> + | /-------------\ >>> + | (2)Forward and write through | | >>> + | +--------------------------> | Disk Buffer | >>> + | | | | >>> + | | \-------------/ >>> + | |(1)read | >>> + | | | >>> + (3)write | | | backing file >>> + V | | >>> + +-----------------------------+ | >>> + | Shared Disk | <-----+ >>> + +-----------------------------+ >>> + >>> + 1) Primary writes will read original data and forward it to >>> Secondary >>> + QEMU. >>> + 2) Before Primary write requests are written to Shared disk, the >>> + original sector content will be read from Shared disk and >>> + forwarded and buffered in the Disk buffer on the secondary site, >>> + but it will not overwrite the existing >> >> extra spaces at the end of line >> > >>> + sector content(it could be from either "Secondary Write >>> Requests" or >> >> Need a space before "(" for better style. >> > >>> + previous COW of "Primary Write Requests") in the Disk buffer. >>> + 3) Primary write requests will be written to Shared disk. >>> + 4) Secondary write requests will be buffered in the Disk buffer >>> and it >>> + will overwrite the existing sector content in the buffer. >>> + >>> +== Shared Disk Mode Architecture == >>> +We are going to implement block replication from many basic >>> +blocks that are already in QEMU. >>> + virtio-blk >>> || .---------- >>> + / >>> || | Secondary >>> + / >>> || '---------- >>> + / >>> || virtio-blk >>> + / >>> || | >>> + | >>> || replication(5) >>> + | NBD --------> NBD >>> (2) | >>> + | client || server ---> hidden >>> disk <-- active disk(4) >>> + | ^ || | >>> + | replication(1) || | >>> + | | || | >>> + | +-----------------' || | >>> + (3) |drive-backup sync=none || | >>> +--------. | +-----------------+ || | >>> +Primary | | | || backing | >>> +--------' | | || | >>> + V | | >>> + +-------------------------------------------+ | >>> + | shared disk | <----------+ >>> + +-------------------------------------------+ >>> + >>> + >>> + 1) Primary writes will read original data and forward it to >>> Secondary >>> + QEMU. >>> + 2) The hidden-disk buffers the original content that is modified >>> by the >>> + primary VM. It should also be an empty disk, and >> >> extra spaces at end of line >> > >>> + the driver supports bdrv_make_empty() and backing file. >>> + 3) Primary write requests will be written to Shared disk. >>> + 4) Secondary write requests will be buffered in the active disk >>> and it >>> + will overwrite the existing sector content in the buffer. >>> + >>> == Failure Handling == >>> There are 7 internal errors when block replication is running: >>> 1. I/O error on primary disk >>> @@ -145,7 +213,7 @@ d. replication_stop_all() >>> things except failover. The caller must hold the I/O mutex lock >>> if it is >>> in migration/checkpoint thread. >>> >>> -== Usage == >>> +== Non-shared disk usage == >>> Primary: >>> -drive >>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ >>> children.0.file.filename=1.raw,\ >>> @@ -234,6 +302,61 @@ Secondary: >>> The primary host is down, so we should do the following thing: >>> { 'execute': 'nbd-server-stop' } >>> >>> +== Shared disk usage == >> >> Keep the some coding style with "== Non-shared disk usage ==" part is >> good to me. >> > >>> +Primary: >>> + -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw >>> + >>> +Issue qmp command: >>> + {'execute': 'human-monitor-command', >> >> two space indentation for the whole "{...}" part >> >>> + 'arguments': { >>> + 'command-line': 'drive_add-nbuddydriver=replication, >> >> missing spaces >> >>> + mode=primary, >>> + file.driver=nbd, >>> + file.host=9.42.3.17, >>> + file.port=9998, >>> + file.export=hidden_disk0, >>> + shared-disk-id=primary_disk0, >>> + shared-disk=on, >>> + node-name=rep' >> > >> Keep the whole commands after "command-line" in one line, or you can >> execute it correctly. IIRC >> > > Hmm, i will change this hmp command to qmp 'blockdev-add' command in next > version, because it is supported now, though it is ready for production. > It's a good start, but i'm not sure here. http://lists.nongnu.org/archive/html/qemu-devel/2016-11/msg01062.html Thanks -Xie >>> + } >>> + } >> >> Secondary: >> >>> + -drive >>> if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\ >>> >>> + backing.driver=raw,backing.file.filename=1.raw \ >>> + -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\ >>> + file.driver=qcow2,top-id=active-disk0,\ >>> + file.file.filename=/mnt/ramfs/active_disk.img,\ >>> + file.backing=hidden_disk0,shared-disk=on >>> + >>> +Issue qmp command: >>> +1. {'execute': 'nbd-server-start', >>> + 'arguments': { >>> + 'addr': { >>> + 'type': 'inet', >>> + 'data': { >>> + 'host': '0', >> >> s/0/9.42.3.17/g, since you use designated ip address above >> > >>> + 'port': '9998' >>> + } >>> + } >>> + } >>> + } >>> +2. { >>> + 'execute': 'nbd-server-add', >>> + 'arguments': { >>> + 'device': 'hidden_disk0', >>> + 'writable': true >>> + } >>> + } >>> + >>> +After Failover: >>> +Primary: >>> +{'execute': 'human-monitor-command', >>> + 'arguments': { >>> + 'command-line': 'drive_delrep' >> >> drive_del rep >> > > I'll use the qmp command instead here. > >>> + } >>> +} >>> + >>> +Secondary: >>> + {'execute': 'nbd-server-stop' } >>> + >>> TODO: >>> 1. Continuous block replication >>> -2. Shared disk >>> >> > > I will fix all the above problems in next version, thanks. > >> >> >> . >> > > > > . >