Re: [PATCH v11 20/27] Support colo mode for qemu disk

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wen Congyang <wency@cn.fujitsu.com>
To: Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Cc: Lars Kurth <lars.kurth@citrix.com>, Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Jiang Yunhong <yunhong.jiang@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	xen devel <xen-devel@lists.xen.org>,
	Anthony Perard <anthony.perard@citrix.com>,
	Gui Jianfeng <guijianfeng@cn.fujitsu.com>,
	Shriram Rajagopalan <rshriram@cs.ubc.ca>,
	Yang Hongyang <hongyang.yang@easystack.cn>
Subject: Re: [PATCH v11 20/27] Support colo mode for qemu disk
Date: Mon, 7 Mar 2016 10:06:06 +0800	[thread overview]
Message-ID: <56DCE20E.7060804@cn.fujitsu.com> (raw)
In-Reply-To: <22233.51582.196386.722857@mariner.uk.xensource.com>

On 03/05/2016 01:44 AM, Ian Jackson wrote:
> Changlong Xie writes ("[PATCH v11 20/27] Support colo mode for qemu disk"):
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> For QEMU block replication details:
>> http://wiki.qemu.org/Features/BlockReplication
> 
> So now I am slightly confused by the design, I think.
> 
> When you replicate a VM with COLO using xl, its memory state is
> transferred over ssh.  But its disk replication is done unencrypted
> and unauthenticated ?

Yes, it is a problem. I will think how to improve it.

> 
> And the disk replication is, out of band, and needs to be configured
> separately ?  This is rather awkward, although maybe not a
> showstopper.  (Maybe we can have a plan to fix it in the future...)

colo-host,colo-port should be the global configuration. And colo-export,
active-disk,hidden-disk must be configured separately, because each
disk should have a different configuration.

> 
> And, how does the disk replication, which doesn't depend on there
> being xl running, relate to the vm state replication, which does ?  I
> think at the very least I'd like to see some information about the
> principles of operation - either explained, or referred to, in the
> user manual.

OK. The disk replication doesn't depend on xl. We only can operate it
via qemu monitor command:
1. stop the vm
2. do the checkpoint
3. start the vm
1/3 is suspend/resume the guest. We only need to do 2 when both vm are
in the consistent state.

> 
> Is it possible to use COLO with an existing full-service disk
> replication service such as DRBD ?

DRBD doesn's support the case like COLO. Because both primary guest
and secondary guest need to write to the disk.

> 
>> +(a) An example for COLO replication's configuration: disk =['...,colo,colo-host
>> +=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
>> +
>> +=item B<colo-host>      :Secondary host's ip address.
>> +
>> +=item B<colo-port>      :Secondary host's port, we will run a nbd server on
>> +secondary host, and the nbd server will listen this port.
>> +
>> +=item B<colo-export>    :Nbd server's disk export name of secondary host.
>> +
>> +=item B<active-disk>    :Secondary's guest write will be buffered in this disk,
>> +and it's used by secondary.
>> +
>> +=item B<hidden-disk>    :Primary's modified contents will be buffered in this
>> +disk, and it's used by secondary.
> 
> What would a typical configuration look like ?  I don't understand the
> relationship between active-disk and hidden-disk, etc.

QEMU has a feature: backing file
For example: A's backing file is B
1. If we read from A, but the sector is not allocated in A. We wil return a zero
   sector to the guest. If A has a backing file, we will read the sector from B
   instead of returning a zero sector.
2. The backing file doesn't affect the write operation.

QEMU has another feature: backup block job
Backup job has two file: one is source and another is the target. It has some running
mode. For block replication, we use the mode "sync=none". In this mode, we will read
the data from the source disk before we modify it, and write it to the target disk.
We keep a bitmap to remeber which sector is backuped from the source disk to the
target disk. If the target disk is an empty disk, and empty disk's backing file is
the source disk, we can read from the target disk to get the source disk's originnal data.


How does block replication work:
A. primary qemu:
1. use the block driver quorum: it will read from all children and write to all children.
   child 0: real disk
   child 1: nbd client
   reading from child 1 will fail, but we use the fifo mode. In this mode, we read from
   child 0 will success and we don't read from child 0
   write to child 1: because child 1 is nbd client, it will forward the write request to
   nbd server

B. secondary qemu:
We have 3 disks: active disk(called it A), hidden disk(called it H), and secondary disk
(real disk, called it S).
A's backing file is H, and H's backing file is S.
We also start a backup job: the source disk is S, and the target disk is H.
we run nbd server in secondary qemu. And the nbd server will write to S.

Before resuming both primary vm and secondary vm: the state is:
1. primary disk and secondary disk are in the consistent state(contain the same data)
2. active disk and hidden disk are the empty disk
When the guest is running:
1. NBD server receives the primary write operation and writes the data to S
2. Before we write data to S, the backup job will read the original data and backup it
   to H
3. The secondary vm will write data to A.
4. If secondary vm will read data from A:
   I. If the sector is allocated in A, read it from A.
  II. Otherwise, the secondary vm doesn't modify this sector after the latest is resumed.
 III. In this case, we read it from H. We can read S's original data from H(See the explanation
      In backup job).

If we have more than 1 real disk, we can use exportname to tag each disk. Each pair of primary disk and
secondary disk should have the same export name.

> 
>> +colo-host
>> +---------
>> +
>> +Description:           Secondary host's address
>> +Mandatory:             Yes when COLO enabled
> 
> Is it permitted to specify a host DNS name ?

IIRC, I think it is OK.

> 
>> +    if (libxl_defbool_val(disk->colo_enable)) {
>> +        tmp = xs_read(ctx->xsh, XBT_NULL,
>> +                      GCSPRINTF("%s/colo-port", be_path), &len);
>> +        if (!tmp) {
>> +            LOG(ERROR, "Missing xenstore node %s/colo-port", be_path);
>> +            goto cleanup;
>> +        }
>> +        disk->colo_port = tmp;
> 
> This is quite repetitive code.

Yes. Will introduce a new function to avoid it in the next version.

> 
> 
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 3610a39..bff08b0 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -1800,12 +1800,29 @@ static void domain_create_cb(libxl__egc *egc,
> ...
>> @@ -256,6 +280,36 @@ static int disk_try_backend(disk_try_backend_args *a,
>>      LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
>>          a->disk->vdev, libxl_disk_backend_to_string(backend));
>>      return 0;
>> +
>> + bad_colo:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> This is correct here, I think.
> 
>> + bad_colo_host:
>> +    LOG(DEBUG, "Disk vdev=%s, backend %s needs colo-host=... for colo",
>> +        a->disk->vdev, libxl_disk_backend_to_string(backend));
>> +    return 0;
> 
> I think these options should be checked later.  disk_try_backend isn't
> the place for general parameter checking; it is searching for which
> backend to try.

Hmm, do you mean we check it when we need to use COLO?

> 
>>  int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
>> diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
>> index 4aca38e..ba17251 100644
>> --- a/tools/libxl/libxl_dm.c
>> +++ b/tools/libxl/libxl_dm.c
>> @@ -751,6 +751,139 @@ static int libxl__dm_runas_helper(libxl__gc *gc, const char *username)
> ...
>> +static char *qemu_disk_scsi_drive_string(libxl__gc *gc, const char *pdev_path,
>> +                                         int unit, const char *format,
>> +                                         const libxl_device_disk *disk,
>> +                                         int colo_mode)
>> +{
>> +    char *drive = NULL;
>> +    const char *exportname = disk->colo_export;
>> +    const char *active_disk = disk->active_disk;
>> +    const char *hidden_disk = disk->hidden_disk;
>> +
>> +    switch (colo_mode) {
>> +    case LIBXL__COLO_NONE:
>> +        drive = libxl__sprintf
>> +            (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
>> +             pdev_path, unit, format);
> 
> I think this would be a lot clearer if the refactoring was split into
> a seperate patch.

OK.

> 
>>                  if (strncmp(disks[i].vdev, "sd", 2) == 0) {
>> -                    drive = libxl__sprintf
>> -                        (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,readonly=%s,cache=writeback",
>> -                         pdev_path, disk, format, disks[i].readwrite ? "off" : "on");
>> +                    if (colo_mode == LIBXL__COLO_SECONDARY) {
>> +                        /*
>> +                         * -drive if=none,driver=format,file=pdev_path,\
>> +                         * id=exportname
>> +                         */
> 
> I think this comment adds nothing to the code and could be profitably
> omitted.

OK.

> 
>> +                        drive = libxl__sprintf
>> +                            (gc, "if=none,driver=%s,file=%s,id=%s",
>> +                             format, pdev_path, disks[i].colo_export);
> 
> I don't understand how this works here.  COLO_SECONDARY seems to
> suppress the rest of the disk specification.

COLO_SECONDAY will use two disks: one is for S, and one is for A.
H is sepecified in A. This line is for S, and the codes for A is
in the function qemu_disk_scsi_drive_string().

> 
> Also, this same logic seems to appear many times.  Maybe it could be
> centralised.  Perhaps I would be able to advise more clearly if I
> understood how this was put together.
> 
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 9b0a537..a2078d1 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -575,6 +575,13 @@ libxl_device_disk = Struct("device_disk", [
>>      ("is_cdrom", integer),
>>      ("direct_io_safe", bool),
>>      ("discard_enable", libxl_defbool),
>> +    ("colo_enable", libxl_defbool),
>> +    ("colo_restore_enable", libxl_defbool),
>> +    ("colo_host", string),
>> +    ("colo_port", string),
>> +    ("colo_export", string),
>> +    ("active_disk", string),
>> +    ("hidden_disk", string)
> 
> In general, many these should probably not be strings.  Certainly the
> port should be an integer.  I don't quite understand the semantics of
> the others.

Yes, the port should be an integer. Will fix it in the next version.

Thanks
Wen Congyang

> 
> Ian.
> 
> 
> .
> 




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2016-03-07  2:06 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-04  8:41 [PATCH v11 00/27] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Changlong Xie
2016-03-04  8:41 ` [PATCH v11 01/27] tools/libxl: introduction of libxl__qmp_restore to load qemu state Changlong Xie
2016-03-04 16:30   ` Ian Jackson
2016-03-14  9:03     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 02/27] tools/libxl: introduce libxl__domain_common_switch_qemu_logdirty() Changlong Xie
2016-03-04  8:41 ` [PATCH v11 03/27] tools/libxl: Add back channel to allow migration target send data back Changlong Xie
2016-03-04 16:38   ` Ian Jackson
2016-03-08 16:38     ` Wei Liu
2016-03-17  8:07     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 04/27] tools/libxl: Introduce new helper function dup_fd_helper() Changlong Xie
2016-03-04 16:42   ` Ian Jackson
2016-03-17  8:08     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 05/27] tools/libx{l, c}: add back channel to libxc Changlong Xie
2016-03-04 16:45   ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 06/27] docs: add colo readme Changlong Xie
2016-03-04  8:41 ` [PATCH v11 07/27] docs/libxl: Introduce CHECKPOINT_CONTEXT to support migration v2 colo streams Changlong Xie
2016-03-04 16:51   ` Ian Jackson
2016-03-08 16:38     ` Wei Liu
2016-03-11  7:13     ` Wen Congyang
2016-03-04  8:41 ` [PATCH v11 08/27] libxc/migration: Specification update for DIRTY_PFN_LIST records Changlong Xie
2016-03-04 16:53   ` Ian Jackson
2016-03-17  8:10     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 09/27] libxc/migration: export read_record for common use Changlong Xie
2016-03-04 16:55   ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 10/27] tools/libxl: add back channel support to write stream Changlong Xie
2016-03-04 17:00   ` Ian Jackson
2016-03-07  2:13     ` Wen Congyang
2016-03-11  9:05     ` Wen Congyang
2016-03-17  8:11     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 11/27] tools/libxl: add back channel support to read stream Changlong Xie
2016-03-04 17:01   ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 12/27] tools/libx{l, c}: introduce wait_checkpoint callback Changlong Xie
2016-03-04 17:03   ` Ian Jackson
2016-03-04 20:23     ` Konrad Rzeszutek Wilk
2016-03-07  2:16       ` Wen Congyang
2016-03-17  8:16     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 13/27] tools/libx{l, c}: add postcopy/suspend callback to restore side Changlong Xie
2016-03-04 17:05   ` Ian Jackson
2016-03-17  8:17     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 14/27] secondary vm suspend/resume/checkpoint code Changlong Xie
2016-03-04 17:11   ` Ian Jackson
2016-03-07  2:57     ` Wen Congyang
2016-03-17  9:03     ` Changlong Xie
2016-03-17 12:19       ` Wei Liu
2016-03-04  8:41 ` [PATCH v11 15/27] primary " Changlong Xie
2016-03-04 17:14   ` Ian Jackson
2016-03-07  2:59     ` Wen Congyang
2016-03-04  8:41 ` [PATCH v11 16/27] libxc/restore: support COLO restore Changlong Xie
2016-03-04 17:16   ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 17/27] libxc/save: support COLO save Changlong Xie
2016-03-04 17:18   ` Ian Jackson
2016-03-07  3:00     ` Wen Congyang
2016-03-04  8:41 ` [PATCH v11 18/27] implement the cmdline for COLO Changlong Xie
2016-03-04 17:22   ` Ian Jackson
2016-03-07  3:04     ` Wen Congyang
2016-03-04  8:41 ` [PATCH v11 19/27] COLO: introduce new API to prepare/start/do/get_error/stop replication Changlong Xie
2016-03-04 17:26   ` Ian Jackson
2016-03-08 16:46     ` Wei Liu
2016-03-18  3:44     ` Changlong Xie
2016-03-18 11:35       ` Wei Liu
2016-03-18  3:45     ` Changlong Xie
2016-03-04 17:29   ` Ian Jackson
2016-03-18  3:49     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 20/27] Support colo mode for qemu disk Changlong Xie
2016-03-04 17:44   ` Ian Jackson
2016-03-07  2:06     ` Wen Congyang [this message]
2016-03-17 17:18       ` Ian Jackson
2016-03-18  5:42         ` Wen Congyang
2016-03-04 17:52   ` Ian Jackson
2016-03-04 20:30     ` Konrad Rzeszutek Wilk
2016-03-07  2:10       ` Wen Congyang
2016-03-08 17:22         ` Wei Liu
2016-03-09  2:09           ` Konrad Rzeszutek Wilk
2016-03-09 16:55             ` Wei Liu
2016-03-17 17:09           ` Ian Jackson
2016-03-17 17:10       ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 21/27] COLO: use qemu block replication Changlong Xie
2016-03-04  8:41 ` [PATCH v11 22/27] COLO proxy: implement setup/teardown of COLO proxy module Changlong Xie
2016-03-04 17:59   ` Ian Jackson
2016-03-18  8:22     ` Changlong Xie
2016-03-22  5:44     ` Changlong Xie
2016-03-22  5:55       ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 23/27] COLO proxy: preresume, postresume and checkpoint Changlong Xie
2016-03-04 18:01   ` Ian Jackson
2016-03-18  8:20     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 24/27] COLO nic: implement COLO nic subkind Changlong Xie
2016-03-04 18:02   ` Ian Jackson
2016-03-18  8:20     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 25/27] setup and control colo proxy on primary side Changlong Xie
2016-03-04 18:05   ` Ian Jackson
2016-03-22  6:01     ` Changlong Xie
2016-03-04  8:41 ` [PATCH v11 26/27] setup and control colo proxy on secondary side Changlong Xie
2016-03-04 18:05   ` Ian Jackson
2016-03-04  8:41 ` [PATCH v11 27/27] cmdline switches and config vars to control colo-proxy Changlong Xie
2016-03-04 18:09   ` Ian Jackson
2016-03-22  4:13     ` Changlong Xie
2016-03-04 18:17 ` [PATCH v11 00/27] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Ian Jackson
2016-03-04 20:35   ` Konrad Rzeszutek Wilk
2016-03-17 17:19     ` Ian Jackson
2016-03-17 17:41   ` Ian Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56DCE20E.7060804@cn.fujitsu.com \
    --to=wency@cn.fujitsu.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=anthony.perard@citrix.com \
    --cc=eddie.dong@intel.com \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=hongyang.yang@easystack.cn \
    --cc=ian.campbell@citrix.com \
    --cc=lars.kurth@citrix.com \
    --cc=rshriram@cs.ubc.ca \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    --cc=xiecl.fnst@cn.fujitsu.com \
    --cc=yunhong.jiang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.