[RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
@ 2014-08-08  7:00 Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 01/45] copy the correct page to memory Wen Congyang
                   ` (46 more replies)
  0 siblings, 47 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:00 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Virtual machine (VM) replication is a well known technique for providing
application-agnostic software-implemented hardware fault tolerance -
"non-stop service". Currently, remus provides this function, but it buffers
all output packets, and the latency is unacceptable.

In xen summit 2012, We introduce a new VM replication solution: colo
(COarse-grain LOck-stepping virtual machine). The presentation is in
the following URL:
http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service

Here is the summary of the solution:
>From the client's point of view, as long as the client observes identical
responses from the primary and secondary VMs, according to the service
semantics, then the secondary vm is a valid replica of the primary
vm, and can successfully take over when a hardware failure of the
primary vm is detected.

This patchset is RFC, and implements the framework and disk replication of COLO:
1. Both primary vm and secondary vm are running
2. do checkoint
3. disk replication(use blktap2)

This patchset is based on remus-v18, and use migration v1. Only supports hvm
guest now.

TODO list:
1. Use migration v2 to implement COLO
3. nic replication
4. support pvm

Patch 1-3  : bugfix
Patch 4-11 : update some APIs which will be used by colo
Patch 12-15: temporarily update remus to reuse remus device codes
Patch 16-23: COLO framework related codes
Patch 24   : Hack patch, just for test
Patch 25-34: bugfix for blktap2
Patch 35-38: move some block-remus's codes to block-replication.c. These codes will
             be reused by COLO.
Patch 39   : implement block-colo
Patch 40-43: update libxl to support blktap2
Patch 44   : implement disk replication
Patch 45   : hypervisor bugfix. We find this bug before rebasing colo to newest xen.
             But we don't trigger this bug now.
Patch 46   : A patch for qemu-xen

Changlog from v1 to v2:
1. rebase to newest remus
2. add disk replication support

Hong Tao (1):
  copy the correct page to memory

Lai Jiangshan (1):
  colo: dynamic allocate aio_requests to avoid -EBUSY error

Wen Congyang (43):
  csum the correct page
  don't zero out ioreq page
  Refactor domain_suspend_callback_common()
  Update libxl__domain_resume() for colo
  Update libxl__domain_suspend_common_switch_qemu_logdirty() for colo
  Introduce a new internal API libxl__domain_unpause()
  Update libxl__domain_unpause() to support qemu-xen
  support to resume uncooperative HVM guests
  update datecopier to support sending data only
  introduce a new API to aync read data from fd
  move remus related codes to libxl_remus.c
  rename remus device to checkpoint device
  adjust the indentation
  don't touch remus in checkpoint_device
  Update libxl_save_msgs_gen.pl to support return data from xl to xc
  Allow slave sends data to master
  secondary vm suspend/resume/checkpoint code
  primary vm suspend/get_dirty_pfn/resume/checkpoint code
  xc_domain_save: flush cache before calling callbacks->postcopy() in
    colo mode
  COLO: xc related codes
  send store mfn and console mfn to xl before resuming secondary vm
  implement the cmdline for COLO
  HACK: do checkpoint per 20ms
  fix memory leak in block-remus
  pass uuid to the callback td_open
  return the correct dev path
  blktap2: use correct way to get remus_image
  don't call client_flush() when switching to unprotected mode
  remus: fix bug in tdremus_close()
  blktap2: use correct way to get free event id
  blktap2: don't return negative event id
  blktap2: use correct way to define array.
  blktap2: connect to backup asynchronously
  switch to unprotected mode before closing
  blktap2: move async connect related codes to block-replication.c
  blktap2: move ramdisk related codes to block-replication.c
  block-colo: implement colo disk replication
  pass correct file to qemu if we use blktap2
  support blktap remus in xl
  support blktap colo in xl:
  update libxl__device_disk_from_xs_be() to support blktap device
  libxl/colo: setup and control disk replication for blktap2 backends
  x86/hvm: Always set pending event injection when loading VMC[BS]
    state.

 docs/man/xl.pod.1                                  |   11 +-
 tools/blktap2/drivers/Makefile                     |    5 +-
 tools/blktap2/drivers/block-aio.c                  |   41 +-
 tools/blktap2/drivers/block-cache.c                |    4 +-
 tools/blktap2/drivers/block-colo.c                 | 1151 ++++++++++++++++++
 tools/blktap2/drivers/block-log.c                  |    4 +-
 tools/blktap2/drivers/block-qcow.c                 |    5 +-
 tools/blktap2/drivers/block-ram.c                  |    5 +-
 tools/blktap2/drivers/block-remus.c                | 1266 +++++---------------
 tools/blktap2/drivers/block-replication.c          | 1116 +++++++++++++++++
 tools/blktap2/drivers/block-replication.h          |  217 ++++
 tools/blktap2/drivers/block-vhd.c                  |    5 +-
 tools/blktap2/drivers/scheduler.c                  |   33 +-
 tools/blktap2/drivers/tapdisk-control.c            |   17 +-
 tools/blktap2/drivers/tapdisk-disktype.c           |   21 +-
 tools/blktap2/drivers/tapdisk-disktype.h           |    3 +-
 tools/blktap2/drivers/tapdisk-interface.c          |   21 +-
 tools/blktap2/drivers/tapdisk-interface.h          |    1 +
 tools/blktap2/drivers/tapdisk-vbd.c                |    9 +
 tools/blktap2/drivers/tapdisk-vbd.h                |    1 +
 tools/blktap2/drivers/tapdisk.h                    |    3 +-
 tools/libxc/xc_domain_restore.c                    |   74 +-
 tools/libxc/xc_domain_save.c                       |   66 +-
 tools/libxc/xc_resume.c                            |   20 +-
 tools/libxc/xenguest.h                             |   40 +
 tools/libxl/Makefile                               |    5 +-
 tools/libxl/libxl.c                                |  148 ++-
 tools/libxl/libxl.h                                |    3 +-
 tools/libxl/libxl_aoutils.c                        |   81 +-
 tools/libxl/libxl_blktap2.c                        |   35 +
 ...xl_remus_device.c => libxl_checkpoint_device.c} |  221 ++--
 tools/libxl/libxl_colo.h                           |   48 +
 tools/libxl/libxl_colo_restore.c                   |  878 ++++++++++++++
 tools/libxl/libxl_colo_save.c                      |  628 ++++++++++
 tools/libxl/libxl_colo_save_disk_blktap2.c         |  216 ++++
 tools/libxl/libxl_create.c                         |  138 ++-
 tools/libxl/libxl_device.c                         |    6 +-
 tools/libxl/libxl_dm.c                             |   20 +-
 tools/libxl/libxl_dom.c                            |  565 ++++-----
 tools/libxl/libxl_internal.h                       |  309 +++--
 tools/libxl/libxl_netbuffer.c                      |  127 +-
 tools/libxl/libxl_noblktap2.c                      |   35 +
 tools/libxl/libxl_nonetbuffer.c                    |   14 +-
 tools/libxl/libxl_qmp.c                            |   10 +
 tools/libxl/libxl_remus.c                          |  335 ++++++
 tools/libxl/libxl_remus.h                          |   27 +
 tools/libxl/libxl_remus_disk_drbd.c                |   67 +-
 tools/libxl/libxl_save_callout.c                   |   37 +-
 tools/libxl/libxl_save_helper.c                    |   17 +
 tools/libxl/libxl_save_msgs_gen.pl                 |   74 +-
 tools/libxl/libxl_types.idl                        |   14 +-
 tools/libxl/libxl_utils.c                          |   23 +
 tools/libxl/libxl_utils.h                          |    1 +
 tools/libxl/libxlu_disk_l.l                        |    2 +
 tools/libxl/xl_cmdimpl.c                           |   54 +-
 tools/libxl/xl_cmdtable.c                          |    3 +-
 xen/arch/x86/hvm/svm/svm.c                         |   16 +-
 xen/arch/x86/hvm/vmx/vmx.c                         |   25 +-
 58 files changed, 6558 insertions(+), 1763 deletions(-)
 create mode 100644 tools/blktap2/drivers/block-colo.c
 create mode 100644 tools/blktap2/drivers/block-replication.c
 create mode 100644 tools/blktap2/drivers/block-replication.h
 rename tools/libxl/{libxl_remus_device.c => libxl_checkpoint_device.c} (41%)
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_restore.c
 create mode 100644 tools/libxl/libxl_colo_save.c
 create mode 100644 tools/libxl/libxl_colo_save_disk_blktap2.c
 create mode 100644 tools/libxl/libxl_remus.c
 create mode 100644 tools/libxl/libxl_remus.h

-- 
1.9.3

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC Patch v2 01/45] copy the correct page to memory
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 02/45] csum the correct page Wen Congyang
                   ` (45 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Hong Tao, Yang Hongyang, Lai Jiangshan

From: Hong Tao <bobby.hong@huawei.com>

apply_batch() only handles MAX_BATCH_SIZE pages at one time. If
there is some bogus/unmapped/allocate-only/broken page, we will
skip it. So when we call apply_batch() again, the first page's
index is curbatch - invalid_pages. invalid_pages stores the number
of bogus/unmapped/allocate-only/broken pages we have found.

In many cases, invalid_pages is 0, so we don't catch this error.

Signed-off-by: Hong Tao <bobby.hong@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_restore.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index e73e0a2..6c346f9 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1106,7 +1106,7 @@ static int pagebuf_get(xc_interface *xch, struct restore_ctx *ctx,
 static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
                        xen_pfn_t* region_mfn, unsigned long* pfn_type, int pae_extended_cr3,
                        struct xc_mmu* mmu,
-                       pagebuf_t* pagebuf, int curbatch)
+                       pagebuf_t* pagebuf, int curbatch, int *invalid_pages)
 {
     int i, j, curpage, nr_mfns;
     int k, scount;
@@ -1121,6 +1121,12 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
     struct domain_info_context *dinfo = &ctx->dinfo;
     int* pfn_err = NULL;
     int rc = -1;
+    int local_invalid_pages = 0;
+    /* We have handled curbatch pages before this batch, and there are
+     * *invalid_pages pages that are not in pagebuf->pages. So the first
+     * page for this page is (curbatch - *invalid_pages) page.
+     */
+    int first_page = curbatch - *invalid_pages;
 
     unsigned long mfn, pfn, pagetype;
 
@@ -1293,10 +1299,13 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
         pfn      = pagebuf->pfn_types[i + curbatch] & ~XEN_DOMCTL_PFINFO_LTAB_MASK;
         pagetype = pagebuf->pfn_types[i + curbatch] &  XEN_DOMCTL_PFINFO_LTAB_MASK;
 
-        if ( pagetype == XEN_DOMCTL_PFINFO_XTAB 
+        if ( pagetype == XEN_DOMCTL_PFINFO_XTAB
              || pagetype == XEN_DOMCTL_PFINFO_XALLOC)
+        {
+            local_invalid_pages++;
             /* a bogus/unmapped/allocate-only page: skip it */
             continue;
+        }
 
         if ( pagetype == XEN_DOMCTL_PFINFO_BROKEN )
         {
@@ -1306,6 +1315,8 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
                       "dom=%d, pfn=%lx\n", dom, pfn);
                 goto err_mapped;
             }
+
+            local_invalid_pages++;
             continue;
         }
 
@@ -1344,7 +1355,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
             }
         }
         else
-            memcpy(page, pagebuf->pages + (curpage + curbatch) * PAGE_SIZE,
+            memcpy(page, pagebuf->pages + (first_page + curpage) * PAGE_SIZE,
                    PAGE_SIZE);
 
         pagetype &= XEN_DOMCTL_PFINFO_LTABTYPE_MASK;
@@ -1418,6 +1429,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
     } /* end of 'batch' for loop */
 
     rc = nraces;
+    *invalid_pages += local_invalid_pages;
 
   err_mapped:
     munmap(region_base, j*PAGE_SIZE);
@@ -1621,7 +1633,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
  loadpages:
     for ( ; ; )
     {
-        int j, curbatch;
+        int j, curbatch, invalid_pages;
 
         xc_report_progress_step(xch, n, dinfo->p2m_size);
 
@@ -1665,11 +1677,13 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
 
         /* break pagebuf into batches */
         curbatch = 0;
+        invalid_pages = 0;
         while ( curbatch < j ) {
             int brc;
 
             brc = apply_batch(xch, dom, ctx, region_mfn, pfn_type,
-                              pae_extended_cr3, mmu, &pagebuf, curbatch);
+                              pae_extended_cr3, mmu, &pagebuf, curbatch,
+                              &invalid_pages);
             if ( brc < 0 )
                 goto out;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 02/45] csum the correct page
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 01/45] copy the correct page to memory Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 03/45] don't zero out ioreq page Wen Congyang
                   ` (44 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

In verify mode, we map the guest memory, and the guest page is
region_base + i * PAGE_SIZE. So we should csum page (region_base
+ i * PAGE_SIZE), not (region_base + (i+curbatch) * PAGE_SIZE)

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_restore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index 6c346f9..42abb22 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1405,7 +1405,7 @@ static int apply_batch(xc_interface *xch, uint32_t dom, struct restore_ctx *ctx,
 
                 DPRINTF("************** pfn=%lx type=%lx gotcs=%08lx "
                         "actualcs=%08lx\n", pfn, pagebuf->pfn_types[pfn],
-                        csum_page(region_base + (i + curbatch)*PAGE_SIZE),
+                        csum_page(region_base + i * PAGE_SIZE),
                         csum_page(buf));
 
                 for ( v = 0; v < 4; v++ )
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 03/45] don't zero out ioreq page
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 01/45] copy the correct page to memory Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 02/45] csum the correct page Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 04/45] Refactor domain_suspend_callback_common() Wen Congyang
                   ` (43 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Paul Durrant, Yang Hongyang, Lai Jiangshan

ioreq page may contain some pending I/O requests, and we need to
handle the pending I/O req after migration.

TODO:
1. update qemu to handle the pending I/O req

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
 tools/libxc/xc_domain_restore.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index 42abb22..2d6139c 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -2301,9 +2301,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     }
 
     /* These comms pages need to be zeroed at the start of day */
-    if ( xc_clear_domain_page(xch, dom, tailbuf.u.hvm.magicpfns[0]) ||
-         xc_clear_domain_page(xch, dom, tailbuf.u.hvm.magicpfns[1]) ||
-         xc_clear_domain_page(xch, dom, tailbuf.u.hvm.magicpfns[2]) )
+    if ( xc_clear_domain_page(xch, dom, tailbuf.u.hvm.magicpfns[2]) )
     {
         PERROR("error zeroing magic pages");
         goto out;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 04/45] Refactor domain_suspend_callback_common()
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (2 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 03/45] don't zero out ioreq page Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 05/45] Update libxl__domain_resume() for colo Wen Congyang
                   ` (42 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

libxl__domain_suspend() is to save the guest. I think
we should call it libxl__domain_save(), but I don't
rename it.

Secondary vm is running in colo mode. So we will do
the following things again and again:
1. suspend both primay vm and secondary vm
2. sync the state
3. resume both primary vm and secondary vm
To suspend secondary vm, we need an independent API to
suspend vm.

The core function to suspend vm is domain_suspend_callback_common().
So use a new structure libxl__domain_suspend_state2 to
instead of libxl__domain_suspend_state. The dss's members that
will be used in domain_suspend_callback_common() are
moved to dss2.

We introduce a new API libxl__domain_suspend2() too.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_dom.c      | 235 ++++++++++++++++++++++++-------------------
 tools/libxl/libxl_internal.h |  39 +++++--
 2 files changed, 159 insertions(+), 115 deletions(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 302062a..1607930 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -842,7 +842,7 @@ int libxl__toolstack_restore(uint32_t domid, const uint8_t *buf,
 static void domain_suspend_done(libxl__egc *egc,
                         libxl__domain_suspend_state *dss, int rc);
 static void domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state *dss, int ok);
+                                libxl__domain_suspend_state2 *dss2, int ok);
 
 /*----- complicated callback, called by xc_domain_save -----*/
 
@@ -1060,16 +1060,17 @@ static void switch_logdirty_done(libxl__egc *egc,
 /*----- callbacks, called by xc_domain_save -----*/
 
 int libxl__domain_suspend_device_model(libxl__gc *gc,
-                                       libxl__domain_suspend_state *dss)
+                                       libxl__domain_suspend_state2 *dss2)
 {
     int ret = 0;
-    uint32_t const domid = dss->domid;
-    const char *const filename = dss->dm_savefile;
+    uint32_t const domid = dss2->domid;
+    const char *const filename = dss2->dm_savefile;
 
     switch (libxl__device_model_version_running(gc, domid)) {
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL: {
         LOG(DEBUG, "Saving device model state to %s", filename);
-        libxl__qemu_traditional_cmd(gc, domid, "save");
+        if (dss2->save_dm)
+            libxl__qemu_traditional_cmd(gc, domid, "save");
         libxl__wait_for_device_model_deprecated(gc, domid, "paused", NULL, NULL, NULL);
         break;
     }
@@ -1077,9 +1078,11 @@ int libxl__domain_suspend_device_model(libxl__gc *gc,
         if (libxl__qmp_stop(gc, domid))
             return ERROR_FAIL;
         /* Save DM state into filename */
-        ret = libxl__qmp_save(gc, domid, filename);
-        if (ret)
-            unlink(filename);
+        if (dss2->save_dm) {
+            ret = libxl__qmp_save(gc, domid, filename);
+            if (ret)
+                unlink(filename);
+        }
         break;
     default:
         return ERROR_INVAL;
@@ -1109,9 +1112,9 @@ int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid)
 }
 
 static void domain_suspend_common_wait_guest(libxl__egc *egc,
-                                             libxl__domain_suspend_state *dss);
+                                             libxl__domain_suspend_state2 *dss2);
 static void domain_suspend_common_guest_suspended(libxl__egc *egc,
-                                         libxl__domain_suspend_state *dss);
+                                         libxl__domain_suspend_state2 *dss2);
 
 static void domain_suspend_common_pvcontrol_suspending(libxl__egc *egc,
       libxl__xswait_state *xswa, int rc, const char *state);
@@ -1120,14 +1123,14 @@ static void domain_suspend_common_wait_guest_evtchn(libxl__egc *egc,
 static void suspend_common_wait_guest_watch(libxl__egc *egc,
       libxl__ev_xswatch *xsw, const char *watch_path, const char *event_path);
 static void suspend_common_wait_guest_check(libxl__egc *egc,
-        libxl__domain_suspend_state *dss);
+                                            libxl__domain_suspend_state2 *dss2);
 static void suspend_common_wait_guest_timeout(libxl__egc *egc,
       libxl__ev_time *ev, const struct timeval *requested_abs);
 
 static void domain_suspend_common_failed(libxl__egc *egc,
-                                         libxl__domain_suspend_state *dss);
+                                         libxl__domain_suspend_state2 *dss2);
 static void domain_suspend_common_done(libxl__egc *egc,
-                                       libxl__domain_suspend_state *dss,
+                                       libxl__domain_suspend_state2 *dss2,
                                        bool ok);
 
 static bool domain_suspend_pvcontrol_acked(const char *state) {
@@ -1136,36 +1139,36 @@ static bool domain_suspend_pvcontrol_acked(const char *state) {
     return strcmp(state,"suspend");
 }
 
-/* calls dss->callback_common_done when done */
+/* calls dss2->callback_common_done when done */
 static void domain_suspend_callback_common(libxl__egc *egc,
-                                           libxl__domain_suspend_state *dss)
+                                           libxl__domain_suspend_state2 *dss2)
 {
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(dss2->ao);
     uint64_t hvm_s_state = 0, hvm_pvdrv = 0;
     int ret, rc;
 
     /* Convenience aliases */
-    const uint32_t domid = dss->domid;
+    const uint32_t domid = dss2->domid;
 
-    if (dss->hvm) {
+    if (dss2->hvm) {
         xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_CALLBACK_IRQ, &hvm_pvdrv);
         xc_hvm_param_get(CTX->xch, domid, HVM_PARAM_ACPI_S_STATE, &hvm_s_state);
     }
 
-    if ((hvm_s_state == 0) && (dss->guest_evtchn.port >= 0)) {
+    if ((hvm_s_state == 0) && (dss2->guest_evtchn.port >= 0)) {
         LOG(DEBUG, "issuing %s suspend request via event channel",
-            dss->hvm ? "PVHVM" : "PV");
-        ret = xc_evtchn_notify(CTX->xce, dss->guest_evtchn.port);
+            dss2->hvm ? "PVHVM" : "PV");
+        ret = xc_evtchn_notify(CTX->xce, dss2->guest_evtchn.port);
         if (ret < 0) {
             LOG(ERROR, "xc_evtchn_notify failed ret=%d", ret);
             goto err;
         }
 
-        dss->guest_evtchn.callback = domain_suspend_common_wait_guest_evtchn;
-        rc = libxl__ev_evtchn_wait(gc, &dss->guest_evtchn);
+        dss2->guest_evtchn.callback = domain_suspend_common_wait_guest_evtchn;
+        rc = libxl__ev_evtchn_wait(gc, &dss2->guest_evtchn);
         if (rc) goto err;
 
-        rc = libxl__ev_time_register_rel(gc, &dss->guest_timeout,
+        rc = libxl__ev_time_register_rel(gc, &dss2->guest_timeout,
                                          suspend_common_wait_guest_timeout,
                                          60*1000);
         if (rc) goto err;
@@ -1173,7 +1176,7 @@ static void domain_suspend_callback_common(libxl__egc *egc,
         return;
     }
 
-    if (dss->hvm && (!hvm_pvdrv || hvm_s_state)) {
+    if (dss2->hvm && (!hvm_pvdrv || hvm_s_state)) {
         LOG(DEBUG, "Calling xc_domain_shutdown on HVM domain");
         ret = xc_domain_shutdown(CTX->xch, domid, SHUTDOWN_suspend);
         if (ret < 0) {
@@ -1181,55 +1184,55 @@ static void domain_suspend_callback_common(libxl__egc *egc,
             goto err;
         }
         /* The guest does not (need to) respond to this sort of request. */
-        dss->guest_responded = 1;
-        domain_suspend_common_wait_guest(egc, dss);
+        dss2->guest_responded = 1;
+        domain_suspend_common_wait_guest(egc, dss2);
         return;
     }
 
     LOG(DEBUG, "issuing %s suspend request via XenBus control node",
-        dss->hvm ? "PVHVM" : "PV");
+        dss2->hvm ? "PVHVM" : "PV");
 
     libxl__domain_pvcontrol_write(gc, XBT_NULL, domid, "suspend");
 
-    dss->pvcontrol.path = libxl__domain_pvcontrol_xspath(gc, domid);
-    if (!dss->pvcontrol.path) goto err;
+    dss2->pvcontrol.path = libxl__domain_pvcontrol_xspath(gc, domid);
+    if (!dss2->pvcontrol.path) goto err;
 
-    dss->pvcontrol.ao = ao;
-    dss->pvcontrol.what = "guest acknowledgement of suspend request";
-    dss->pvcontrol.timeout_ms = 60 * 1000;
-    dss->pvcontrol.callback = domain_suspend_common_pvcontrol_suspending;
-    libxl__xswait_start(gc, &dss->pvcontrol);
+    dss2->pvcontrol.ao = ao;
+    dss2->pvcontrol.what = "guest acknowledgement of suspend request";
+    dss2->pvcontrol.timeout_ms = 60 * 1000;
+    dss2->pvcontrol.callback = domain_suspend_common_pvcontrol_suspending;
+    libxl__xswait_start(gc, &dss2->pvcontrol);
     return;
 
  err:
-    domain_suspend_common_failed(egc, dss);
+    domain_suspend_common_failed(egc, dss2);
 }
 
 static void domain_suspend_common_wait_guest_evtchn(libxl__egc *egc,
-        libxl__ev_evtchn *evev)
+                                                    libxl__ev_evtchn *evev)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(evev, *dss, guest_evtchn);
-    STATE_AO_GC(dss->ao);
+    libxl__domain_suspend_state2 *dss2 = CONTAINER_OF(evev, *dss2, guest_evtchn);
+    STATE_AO_GC(dss2->ao);
     /* If we should be done waiting, suspend_common_wait_guest_check
      * will end up calling domain_suspend_common_guest_suspended or
      * domain_suspend_common_failed, both of which cancel the evtchn
      * wait.  So re-enable it now. */
-    libxl__ev_evtchn_wait(gc, &dss->guest_evtchn);
-    suspend_common_wait_guest_check(egc, dss);
+    libxl__ev_evtchn_wait(gc, &dss2->guest_evtchn);
+    suspend_common_wait_guest_check(egc, dss2);
 }
 
 static void domain_suspend_common_pvcontrol_suspending(libxl__egc *egc,
       libxl__xswait_state *xswa, int rc, const char *state)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(xswa, *dss, pvcontrol);
-    STATE_AO_GC(dss->ao);
+    libxl__domain_suspend_state2 *dss2 = CONTAINER_OF(xswa, *dss2, pvcontrol);
+    STATE_AO_GC(dss2->ao);
     xs_transaction_t t = 0;
 
     if (!rc && !domain_suspend_pvcontrol_acked(state))
         /* keep waiting */
         return;
 
-    libxl__xswait_stop(gc, &dss->pvcontrol);
+    libxl__xswait_stop(gc, &dss2->pvcontrol);
 
     if (rc == ERROR_TIMEDOUT) {
         /*
@@ -1272,56 +1275,56 @@ static void domain_suspend_common_pvcontrol_suspending(libxl__egc *egc,
     LOG(DEBUG, "guest acknowledged suspend request");
 
     libxl__xs_transaction_abort(gc, &t);
-    dss->guest_responded = 1;
-    domain_suspend_common_wait_guest(egc,dss);
+    dss2->guest_responded = 1;
+    domain_suspend_common_wait_guest(egc,dss2);
     return;
 
  err:
     libxl__xs_transaction_abort(gc, &t);
-    domain_suspend_common_failed(egc, dss);
+    domain_suspend_common_failed(egc, dss2);
     return;
 }
 
 static void domain_suspend_common_wait_guest(libxl__egc *egc,
-                                             libxl__domain_suspend_state *dss)
+                                             libxl__domain_suspend_state2 *dss2)
 {
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(dss2->ao);
     int rc;
 
     LOG(DEBUG, "wait for the guest to suspend");
 
-    rc = libxl__ev_xswatch_register(gc, &dss->guest_watch,
+    rc = libxl__ev_xswatch_register(gc, &dss2->guest_watch,
                                     suspend_common_wait_guest_watch,
                                     "@releaseDomain");
     if (rc) goto err;
 
-    rc = libxl__ev_time_register_rel(gc, &dss->guest_timeout,
+    rc = libxl__ev_time_register_rel(gc, &dss2->guest_timeout,
                                      suspend_common_wait_guest_timeout,
                                      60*1000);
     if (rc) goto err;
     return;
 
  err:
-    domain_suspend_common_failed(egc, dss);
+    domain_suspend_common_failed(egc, dss2);
 }
 
 static void suspend_common_wait_guest_watch(libxl__egc *egc,
       libxl__ev_xswatch *xsw, const char *watch_path, const char *event_path)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(xsw, *dss, guest_watch);
-    suspend_common_wait_guest_check(egc, dss);
+    libxl__domain_suspend_state2 *dss2 = CONTAINER_OF(xsw, *dss2, guest_watch);
+    suspend_common_wait_guest_check(egc, dss2);
 }
 
 static void suspend_common_wait_guest_check(libxl__egc *egc,
-        libxl__domain_suspend_state *dss)
+                                            libxl__domain_suspend_state2 *dss2)
 {
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(dss2->ao);
     xc_domaininfo_t info;
     int ret;
     int shutdown_reason;
 
     /* Convenience aliases */
-    const uint32_t domid = dss->domid;
+    const uint32_t domid = dss2->domid;
 
     ret = xc_domain_getinfolist(CTX->xch, domid, 1, &info);
     if (ret < 0) {
@@ -1348,59 +1351,59 @@ static void suspend_common_wait_guest_check(libxl__egc *egc,
     }
 
     LOG(DEBUG, "guest has suspended");
-    domain_suspend_common_guest_suspended(egc, dss);
+    domain_suspend_common_guest_suspended(egc, dss2);
     return;
 
  err:
-    domain_suspend_common_failed(egc, dss);
+    domain_suspend_common_failed(egc, dss2);
 }
 
 static void suspend_common_wait_guest_timeout(libxl__egc *egc,
       libxl__ev_time *ev, const struct timeval *requested_abs)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(ev, *dss, guest_timeout);
-    STATE_AO_GC(dss->ao);
+    libxl__domain_suspend_state2 *dss2 = CONTAINER_OF(ev, *dss2, guest_timeout);
+    STATE_AO_GC(dss2->ao);
     LOG(ERROR, "guest did not suspend, timed out");
-    domain_suspend_common_failed(egc, dss);
+    domain_suspend_common_failed(egc, dss2);
 }
 
 static void domain_suspend_common_guest_suspended(libxl__egc *egc,
-                                         libxl__domain_suspend_state *dss)
+                                            libxl__domain_suspend_state2 *dss2)
 {
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(dss2->ao);
     int ret;
 
-    libxl__ev_evtchn_cancel(gc, &dss->guest_evtchn);
-    libxl__ev_xswatch_deregister(gc, &dss->guest_watch);
-    libxl__ev_time_deregister(gc, &dss->guest_timeout);
+    libxl__ev_evtchn_cancel(gc, &dss2->guest_evtchn);
+    libxl__ev_xswatch_deregister(gc, &dss2->guest_watch);
+    libxl__ev_time_deregister(gc, &dss2->guest_timeout);
 
-    if (dss->hvm) {
-        ret = libxl__domain_suspend_device_model(gc, dss);
+    if (dss2->hvm) {
+        ret = libxl__domain_suspend_device_model(gc, dss2);
         if (ret) {
             LOG(ERROR, "libxl__domain_suspend_device_model failed ret=%d", ret);
-            domain_suspend_common_failed(egc, dss);
+            domain_suspend_common_failed(egc, dss2);
             return;
         }
     }
-    domain_suspend_common_done(egc, dss, 1);
+    domain_suspend_common_done(egc, dss2, 1);
 }
 
 static void domain_suspend_common_failed(libxl__egc *egc,
-                                         libxl__domain_suspend_state *dss)
+                                         libxl__domain_suspend_state2 *dss2)
 {
-    domain_suspend_common_done(egc, dss, 0);
+    domain_suspend_common_done(egc, dss2, 0);
 }
 
 static void domain_suspend_common_done(libxl__egc *egc,
-                                       libxl__domain_suspend_state *dss,
+                                       libxl__domain_suspend_state2 *dss2,
                                        bool ok)
 {
     EGC_GC;
-    assert(!libxl__xswait_inuse(&dss->pvcontrol));
-    libxl__ev_evtchn_cancel(gc, &dss->guest_evtchn);
-    libxl__ev_xswatch_deregister(gc, &dss->guest_watch);
-    libxl__ev_time_deregister(gc, &dss->guest_timeout);
-    dss->callback_common_done(egc, dss, ok);
+    assert(!libxl__xswait_inuse(&dss2->pvcontrol));
+    libxl__ev_evtchn_cancel(gc, &dss2->guest_evtchn);
+    libxl__ev_xswatch_deregister(gc, &dss2->guest_watch);
+    libxl__ev_time_deregister(gc, &dss2->guest_timeout);
+    dss2->callback_common_done(egc, dss2, ok);
 }
 
 static inline char *physmap_path(libxl__gc *gc, uint32_t domid,
@@ -1493,19 +1496,24 @@ static void libxl__domain_suspend_callback(void *data)
     libxl__egc *egc = shs->egc;
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
 
-    dss->callback_common_done = domain_suspend_callback_common_done;
-    domain_suspend_callback_common(egc, dss);
+    /* Convenience aliases */
+    libxl__domain_suspend_state2 *dss2 = &dss->dss2;
+
+    dss2->callback_common_done = domain_suspend_callback_common_done;
+    domain_suspend_callback_common(egc, dss2);
 }
 
 static void domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state *dss, int ok)
+                                libxl__domain_suspend_state2 *dss2, int ok)
 {
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dss2, *dss, dss2);
+
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
 }
 
 /*----- remus callbacks -----*/
 static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state *dss, int ok);
+                                libxl__domain_suspend_state2 *dss2, int ok);
 static void remus_devices_postsuspend_cb(libxl__egc *egc,
                                          libxl__remus_devices_state *rds,
                                          int rc);
@@ -1519,13 +1527,18 @@ static void libxl__remus_domain_suspend_callback(void *data)
     libxl__egc *egc = shs->egc;
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
 
-    dss->callback_common_done = remus_domain_suspend_callback_common_done;
-    domain_suspend_callback_common(egc, dss);
+    /* Convenience aliases */
+    libxl__domain_suspend_state2 *const dss2 = &dss->dss2;
+
+    dss2->callback_common_done = remus_domain_suspend_callback_common_done;
+    domain_suspend_callback_common(egc, dss2);
 }
 
 static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state *dss, int ok)
+                                libxl__domain_suspend_state2 *dss2, int ok)
 {
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dss2, *dss, dss2);
+
     if (!ok)
         goto out;
 
@@ -1687,6 +1700,11 @@ static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
 }
 
 /*----- main code for suspending, in order of execution -----*/
+void libxl__domain_suspend2(libxl__egc *egc,
+                            libxl__domain_suspend_state2 *dss2)
+{
+    domain_suspend_callback_common(egc, dss2);
+}
 
 void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
 {
@@ -1702,20 +1720,23 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     const libxl_domain_remus_info *const r_info = dss->remus;
     libxl__srm_save_autogen_callbacks *const callbacks =
         &dss->shs.callbacks.save.a;
+    libxl__domain_suspend_state2 *dss2 = &dss->dss2;
 
     logdirty_init(&dss->logdirty);
-    libxl__xswait_init(&dss->pvcontrol);
-    libxl__ev_evtchn_init(&dss->guest_evtchn);
-    libxl__ev_xswatch_init(&dss->guest_watch);
-    libxl__ev_time_init(&dss->guest_timeout);
+    libxl__xswait_init(&dss2->pvcontrol);
+    libxl__ev_evtchn_init(&dss2->guest_evtchn);
+    libxl__ev_xswatch_init(&dss2->guest_watch);
+    libxl__ev_time_init(&dss2->guest_timeout);
 
     switch (type) {
     case LIBXL_DOMAIN_TYPE_HVM: {
         dss->hvm = 1;
+        dss2->hvm = 1;
         break;
     }
     case LIBXL_DOMAIN_TYPE_PV:
         dss->hvm = 0;
+        dss2->hvm = 0;
         break;
     default:
         abort();
@@ -1725,10 +1746,13 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
           | (debug ? XCFLAGS_DEBUG : 0)
           | (dss->hvm ? XCFLAGS_HVM : 0);
 
-    dss->guest_evtchn.port = -1;
-    dss->guest_evtchn_lockfd = -1;
-    dss->guest_responded = 0;
-    dss->dm_savefile = libxl__device_model_savefile(gc, domid);
+    dss2->guest_evtchn.port = -1;
+    dss2->guest_evtchn_lockfd = -1;
+    dss2->guest_responded = 0;
+    dss2->dm_savefile = libxl__device_model_savefile(gc, domid);
+    dss2->domid = domid;
+    dss2->ao = ao;
+    dss2->save_dm = 1;
 
     if (r_info != NULL) {
         dss->interval = r_info->interval;
@@ -1739,11 +1763,11 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     port = xs_suspend_evtchn_port(dss->domid);
 
     if (port >= 0) {
-        dss->guest_evtchn.port =
+        dss2->guest_evtchn.port =
             xc_suspend_evtchn_init_exclusive(CTX->xch, CTX->xce,
-                                  dss->domid, port, &dss->guest_evtchn_lockfd);
+                                  dss->domid, port, &dss2->guest_evtchn_lockfd);
 
-        if (dss->guest_evtchn.port < 0) {
+        if (dss2->guest_evtchn.port < 0) {
             LOG(WARN, "Suspend event channel initialization failed");
             rc = ERROR_FAIL;
             goto out;
@@ -1782,10 +1806,10 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
 
     if (retval) {
         LOGEV(ERROR, errnoval, "saving domain: %s",
-                         dss->guest_responded ?
+                         dss->dss2.guest_responded ?
                          "domain responded to suspend request" :
                          "domain did not respond to suspend request");
-        if ( !dss->guest_responded )
+        if ( !dss->dss2.guest_responded )
             rc = ERROR_GUEST_TIMEDOUT;
         else
             rc = ERROR_FAIL;
@@ -1793,7 +1817,7 @@ void libxl__xc_domain_save_done(libxl__egc *egc, void *dss_void,
     }
 
     if (type == LIBXL_DOMAIN_TYPE_HVM) {
-        rc = libxl__domain_suspend_device_model(gc, dss);
+        rc = libxl__domain_suspend_device_model(gc, &dss->dss2);
         if (rc) goto out;
 
         libxl__domain_save_device_model(egc, dss, domain_suspend_done);
@@ -1821,7 +1845,7 @@ void libxl__domain_save_device_model(libxl__egc *egc,
     dss->save_dm_callback = callback;
 
     /* Convenience aliases */
-    const char *const filename = dss->dm_savefile;
+    const char *const filename = dss->dss2.dm_savefile;
     const int fd = dss->fd;
 
     libxl__datacopier_state *dc = &dss->save_dm_datacopier;
@@ -1877,7 +1901,7 @@ static void save_device_model_datacopier_done(libxl__egc *egc,
     STATE_AO_GC(dss->ao);
 
     /* Convenience aliases */
-    const char *const filename = dss->dm_savefile;
+    const char *const filename = dss->dss2.dm_savefile;
     int our_rc = 0;
     int rc;
 
@@ -1908,12 +1932,13 @@ static void domain_suspend_done(libxl__egc *egc,
 
     /* Convenience aliases */
     const uint32_t domid = dss->domid;
+    libxl__domain_suspend_state2 *const dss2 = &dss->dss2;
 
-    libxl__ev_evtchn_cancel(gc, &dss->guest_evtchn);
+    libxl__ev_evtchn_cancel(gc, &dss2->guest_evtchn);
 
-    if (dss->guest_evtchn.port > 0)
+    if (dss2->guest_evtchn.port > 0)
         xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
-                           dss->guest_evtchn.port, &dss->guest_evtchn_lockfd);
+                           dss2->guest_evtchn.port, &dss2->guest_evtchn_lockfd);
 
     if (!dss->remus) {
         remus_teardown_done(egc, &dss->rds, rc);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index e631eaf..901cacd 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2670,6 +2670,7 @@ _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 /*----- Domain suspend (save) state structure -----*/
 
 typedef struct libxl__domain_suspend_state libxl__domain_suspend_state;
+typedef struct libxl__domain_suspend_state2 libxl__domain_suspend_state2;
 
 typedef void libxl__domain_suspend_cb(libxl__egc*,
                                       libxl__domain_suspend_state*, int rc);
@@ -2684,6 +2685,29 @@ typedef struct libxl__logdirty_switch {
     libxl__ev_time timeout;
 } libxl__logdirty_switch;
 
+/*
+ * libxl__domain_suspend_state is for saving guest, not
+ * for suspending guest. We need to an independent API
+ * to suspend guest only.
+ */
+struct libxl__domain_suspend_state2 {
+    /* set by caller of libxl__domain_suspend2 */
+    libxl__ao *ao;
+
+    uint32_t domid;
+    libxl__ev_evtchn guest_evtchn;;
+    int guest_evtchn_lockfd;
+    int hvm;
+    const char *dm_savefile;
+    void (*callback_common_done)(libxl__egc*,
+                                 libxl__domain_suspend_state2*, int ok);
+    int save_dm;
+    int guest_responded;
+    libxl__xswait_state pvcontrol;
+    libxl__ev_xswatch guest_watch;
+    libxl__ev_time guest_timeout;
+};
+
 struct libxl__domain_suspend_state {
     /* set by caller of libxl__domain_suspend */
     libxl__ao *ao;
@@ -2696,22 +2720,14 @@ struct libxl__domain_suspend_state {
     int debug;
     const libxl_domain_remus_info *remus;
     /* private */
-    libxl__ev_evtchn guest_evtchn;
-    int guest_evtchn_lockfd;
+    libxl__domain_suspend_state2 dss2;
     int hvm;
     int xcflags;
-    int guest_responded;
-    libxl__xswait_state pvcontrol;
-    libxl__ev_xswatch guest_watch;
-    libxl__ev_time guest_timeout;
-    const char *dm_savefile;
     libxl__remus_devices_state rds;
     libxl__ev_time checkpoint_timeout; /* used for Remus checkpoint */
     int interval; /* checkpoint interval (for Remus) */
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
-    void (*callback_common_done)(libxl__egc*,
-                                 struct libxl__domain_suspend_state*, int ok);
     /* private for libxl__domain_save_device_model */
     libxl__save_device_model_cb *save_dm_callback;
     libxl__datacopier_state save_dm_datacopier;
@@ -2983,6 +2999,9 @@ struct libxl__domain_create_state {
 
 /*----- Domain suspend (save) functions -----*/
 
+/* calls dss2->callback_common_done when done */
+_hidden void libxl__domain_suspend2(libxl__egc *egc,
+                                    libxl__domain_suspend_state2 *dss2);
 /* calls dss->callback when done */
 _hidden void libxl__domain_suspend(libxl__egc *egc,
                                    libxl__domain_suspend_state *dss);
@@ -3022,7 +3041,7 @@ _hidden void libxl__xc_domain_restore_done(libxl__egc *egc, void *dcs_void,
 
 /* Each time the dm needs to be saved, we must call suspend and then save */
 _hidden int libxl__domain_suspend_device_model(libxl__gc *gc,
-                                           libxl__domain_suspend_state *dss);
+                                           libxl__domain_suspend_state2 *dss2);
 _hidden void libxl__domain_save_device_model(libxl__egc *egc,
                                      libxl__domain_suspend_state *dss,
                                      libxl__save_device_model_cb *callback);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 05/45] Update libxl__domain_resume() for colo
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (3 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 04/45] Refactor domain_suspend_callback_common() Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 06/45] Update libxl__domain_suspend_common_switch_qemu_logdirty() " Wen Congyang
                   ` (41 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Secondary vm is running in colo mode. So we will do
the following things again and again:
1. suspend both primay vm and secondary vm
2. sync the state
3. resume both primary vm and secondary vm
We will send qemu's state each time in step2, and
slave's qemu should read it each time before resuming
secondary vm. libxl__domain_resume() doesn't
read qemu's state. Add a new parameter to
control whether we need to read qemu's state
before resuming.

Note: we should update qemu to support it.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c          |  7 ++++---
 tools/libxl/libxl_dom.c      | 24 +++++++++++++++++++++---
 tools/libxl/libxl_internal.h |  8 ++++++--
 tools/libxl/libxl_qmp.c      | 10 ++++++++++
 4 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 8182966..b262309 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -466,7 +466,8 @@ int libxl_domain_rename(libxl_ctx *ctx, uint32_t domid,
     return rc;
 }
 
-int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel)
+int libxl__domain_resume(libxl__gc *gc, uint32_t domid,
+                         int suspend_cancel, int read_savefile)
 {
     int rc = 0;
 
@@ -483,7 +484,7 @@ int libxl__domain_resume(libxl__gc *gc, uint32_t domid, int suspend_cancel)
     }
 
     if (type == LIBXL_DOMAIN_TYPE_HVM) {
-        rc = libxl__domain_resume_device_model(gc, domid);
+        rc = libxl__domain_resume_device_model(gc, domid, read_savefile);
         if (rc) {
             LOG(ERROR, "failed to resume device model for domain %u:%d",
                 domid, rc);
@@ -503,7 +504,7 @@ int libxl_domain_resume(libxl_ctx *ctx, uint32_t domid, int suspend_cancel,
                         const libxl_asyncop_how *ao_how)
 {
     AO_CREATE(ctx, domid, ao_how);
-    int rc = libxl__domain_resume(gc, domid, suspend_cancel);
+    int rc = libxl__domain_resume(gc, domid, suspend_cancel, 0);
     libxl__ao_complete(egc, ao, rc);
     return AO_INPROGRESS;
 }
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1607930..288cbd8 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -1091,16 +1091,34 @@ int libxl__domain_suspend_device_model(libxl__gc *gc,
     return ret;
 }
 
-int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid)
+int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid,
+                                      int read_savefile)
 {
 
     switch (libxl__device_model_version_running(gc, domid)) {
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL: {
-        libxl__qemu_traditional_cmd(gc, domid, "continue");
+        if (read_savefile)
+            libxl__qemu_traditional_cmd(gc, domid, "resume");
+        else
+            libxl__qemu_traditional_cmd(gc, domid, "continue");
         libxl__wait_for_device_model_deprecated(gc, domid, "running", NULL, NULL, NULL);
         break;
     }
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN:
+        if (read_savefile) {
+            char *state_file;
+            int rc;
+
+            state_file = libxl__sprintf(NOGC,
+                                        XC_DEVICE_MODEL_RESTORE_FILE".%d",
+                                        domid);
+            /* This command only restores the device state */
+            rc = libxl__qmp_restore(gc, domid, state_file);
+            free(state_file);
+            if (rc)
+                return ERROR_FAIL;
+        }
+
         if (libxl__qmp_resume(gc, domid))
             return ERROR_FAIL;
         break;
@@ -1591,7 +1609,7 @@ static void remus_devices_preresume_cb(libxl__egc *egc,
         goto out;
 
     /* Resumes the domain and the device model */
-    if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
+    if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1, 0))
         ok = 1;
 
 out:
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 901cacd..437a9cd 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -989,12 +989,14 @@ _hidden int libxl__domain_rename(libxl__gc *gc, uint32_t domid,
 
 _hidden int libxl__toolstack_restore(uint32_t domid, const uint8_t *buf,
                                      uint32_t size, void *data);
-_hidden int libxl__domain_resume_device_model(libxl__gc *gc, uint32_t domid);
+_hidden int libxl__domain_resume_device_model(libxl__gc *gc,
+                                              uint32_t domid,
+                                              int read_savefile);
 
 _hidden void libxl__userdata_destroyall(libxl__gc *gc, uint32_t domid);
 
 _hidden int libxl__domain_resume(libxl__gc *gc, uint32_t domid,
-                                 int suspend_cancel);
+                                 int suspend_cancel, int read_savefile);
 
 /* returns 0 or 1, or a libxl error code */
 _hidden int libxl__domain_pvcontrol_available(libxl__gc *gc, uint32_t domid);
@@ -1580,6 +1582,8 @@ _hidden int libxl__qmp_stop(libxl__gc *gc, int domid);
 _hidden int libxl__qmp_resume(libxl__gc *gc, int domid);
 /* Save current QEMU state into fd. */
 _hidden int libxl__qmp_save(libxl__gc *gc, int domid, const char *filename);
+/* Load current QEMU state from fd. */
+_hidden int libxl__qmp_restore(libxl__gc *gc, int domid, const char *filename);
 /* Set dirty bitmap logging status */
 _hidden int libxl__qmp_set_global_dirty_log(libxl__gc *gc, int domid, bool enable);
 _hidden int libxl__qmp_insert_cdrom(libxl__gc *gc, int domid, const libxl_device_disk *disk);
diff --git a/tools/libxl/libxl_qmp.c b/tools/libxl/libxl_qmp.c
index c7324e6..e1c2fd1 100644
--- a/tools/libxl/libxl_qmp.c
+++ b/tools/libxl/libxl_qmp.c
@@ -871,6 +871,16 @@ int libxl__qmp_save(libxl__gc *gc, int domid, const char *filename)
                            NULL, NULL);
 }
 
+int libxl__qmp_restore(libxl__gc *gc, int domid, const char *state_file)
+{
+    libxl__json_object *args = NULL;
+
+    qmp_parameters_add_string(gc, &args, "filename", (char *)state_file);
+
+    return qmp_run_command(gc, domid, "xen-load-devices-state", args,
+                           NULL, NULL);
+}
+
 static int qmp_change(libxl__gc *gc, libxl__qmp_handler *qmp,
                       char *device, char *target, char *arg)
 {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 06/45] Update libxl__domain_suspend_common_switch_qemu_logdirty() for colo
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (4 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 05/45] Update libxl__domain_resume() for colo Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 07/45] Introduce a new internal API libxl__domain_unpause() Wen Congyang
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Secondary vm is running in colo mode. So we need to
send secondary vm's dirty page information to master.
libxl__domain_suspend_common_switch_qemu_logdirty() is to enable
qemu logdirty. But it uses domain_suspend_state, and calls
libxl__xc_domain_saverestore_async_callback_done()
before exits.

Introduce a new API libxl__domain_common_switch_qemu_logdirty().
This API only uses libxl__logdirty_switch, and calls
lds->callback before exits.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_dom.c      | 79 +++++++++++++++++++++++++++-----------------
 tools/libxl/libxl_internal.h | 12 +++++--
 2 files changed, 59 insertions(+), 32 deletions(-)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 288cbd8..3e22322 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -859,7 +859,7 @@ static void switch_logdirty_timeout(libxl__egc *egc, libxl__ev_time *ev,
 static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch*,
                             const char *watch_path, const char *event_path);
 static void switch_logdirty_done(libxl__egc *egc,
-                                 libxl__domain_suspend_state *dss, int ok);
+                                 libxl__logdirty_switch *lds, int ok);
 
 static void logdirty_init(libxl__logdirty_switch *lds)
 {
@@ -870,12 +870,10 @@ static void logdirty_init(libxl__logdirty_switch *lds)
 
 static void domain_suspend_switch_qemu_xen_traditional_logdirty
                                (int domid, unsigned enable,
-                                libxl__save_helper_state *shs)
+                                libxl__logdirty_switch *lds,
+                                libxl__egc *egc)
 {
-    libxl__egc *egc = shs->egc;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    libxl__logdirty_switch *lds = &dss->logdirty;
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(lds->ao);
     int rc;
     xs_transaction_t t = 0;
     const char *got;
@@ -936,64 +934,85 @@ static void domain_suspend_switch_qemu_xen_traditional_logdirty
  out:
     LOG(ERROR,"logdirty switch failed (rc=%d), aborting suspend",rc);
     libxl__xs_transaction_abort(gc, &t);
-    switch_logdirty_done(egc,dss,-1);
+    switch_logdirty_done(egc,lds,-1);
 }
 
 static void domain_suspend_switch_qemu_xen_logdirty
                                (int domid, unsigned enable,
-                                libxl__save_helper_state *shs)
+                                libxl__logdirty_switch *lds,
+                                libxl__egc *egc)
 {
-    libxl__egc *egc = shs->egc;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    STATE_AO_GC(dss->ao);
+    STATE_AO_GC(lds->ao);
     int rc;
 
     rc = libxl__qmp_set_global_dirty_log(gc, domid, enable);
     if (!rc) {
-        libxl__xc_domain_saverestore_async_callback_done(egc, shs, 0);
+        lds->callback(egc, lds, 0);
     } else {
         LOG(ERROR,"logdirty switch failed (rc=%d), aborting suspend",rc);
-        libxl__xc_domain_saverestore_async_callback_done(egc, shs, -1);
+        lds->callback(egc, lds, -1);
     }
 }
 
+static void libxl__domain_suspend_switch_qemu_logdirty_done
+                                (libxl__egc *egc,
+                                 libxl__logdirty_switch *lds,
+                                 int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(lds, *dss, logdirty);
+
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, rc);
+}
+
 void libxl__domain_suspend_common_switch_qemu_logdirty
                                (int domid, unsigned enable, void *user)
 {
     libxl__save_helper_state *shs = user;
     libxl__egc *egc = shs->egc;
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    STATE_AO_GC(dss->ao);
+
+    /* convenience aliases */
+    libxl__logdirty_switch *const lds = &dss->logdirty;
+
+    lds->callback = libxl__domain_suspend_switch_qemu_logdirty_done;
+
+    libxl__domain_common_switch_qemu_logdirty(domid, enable, lds, egc);
+}
+
+void libxl__domain_common_switch_qemu_logdirty(int domid, unsigned enable,
+                                               libxl__logdirty_switch *lds,
+                                               libxl__egc *egc)
+{
+    STATE_AO_GC(lds->ao);
 
     switch (libxl__device_model_version_running(gc, domid)) {
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL:
-        domain_suspend_switch_qemu_xen_traditional_logdirty(domid, enable, shs);
+        domain_suspend_switch_qemu_xen_traditional_logdirty(domid, enable,
+                                                            lds, egc);
         break;
     case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN:
-        domain_suspend_switch_qemu_xen_logdirty(domid, enable, shs);
+        domain_suspend_switch_qemu_xen_logdirty(domid, enable, lds, egc);
         break;
     default:
         LOG(ERROR,"logdirty switch failed"
             ", no valid device model version found, aborting suspend");
-        libxl__xc_domain_saverestore_async_callback_done(egc, shs, -1);
+        lds->callback(egc, lds, -1);
     }
 }
 static void switch_logdirty_timeout(libxl__egc *egc, libxl__ev_time *ev,
                                     const struct timeval *requested_abs)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(ev, *dss, logdirty.timeout);
-    STATE_AO_GC(dss->ao);
+    libxl__logdirty_switch *lds = CONTAINER_OF(ev, *lds, timeout);
+    STATE_AO_GC(lds->ao);
     LOG(ERROR,"logdirty switch: wait for device model timed out");
-    switch_logdirty_done(egc,dss,-1);
+    switch_logdirty_done(egc,lds,-1);
 }
 
 static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch *watch,
                             const char *watch_path, const char *event_path)
 {
-    libxl__domain_suspend_state *dss =
-        CONTAINER_OF(watch, *dss, logdirty.watch);
-    libxl__logdirty_switch *lds = &dss->logdirty;
-    STATE_AO_GC(dss->ao);
+    libxl__logdirty_switch *lds = CONTAINER_OF(watch, *lds, watch);
+    STATE_AO_GC(lds->ao);
     const char *got;
     xs_transaction_t t = 0;
     int rc;
@@ -1037,24 +1056,23 @@ static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch *watch,
     libxl__xs_transaction_abort(gc, &t);
 
     if (!rc) {
-        switch_logdirty_done(egc,dss,0);
+        switch_logdirty_done(egc,lds,0);
     } else if (rc < 0) {
         LOG(ERROR,"logdirty switch: failed (rc=%d)",rc);
-        switch_logdirty_done(egc,dss,-1);
+        switch_logdirty_done(egc,lds,-1);
     }
 }
 
 static void switch_logdirty_done(libxl__egc *egc,
-                                 libxl__domain_suspend_state *dss,
+                                 libxl__logdirty_switch *lds,
                                  int broke)
 {
-    STATE_AO_GC(dss->ao);
-    libxl__logdirty_switch *lds = &dss->logdirty;
+    STATE_AO_GC(lds->ao);
 
     libxl__ev_xswatch_deregister(gc, &lds->watch);
     libxl__ev_time_deregister(gc, &lds->timeout);
 
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, broke);
+    lds->callback(egc, lds, broke);
 }
 
 /*----- callbacks, called by xc_domain_save -----*/
@@ -1741,6 +1759,7 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     libxl__domain_suspend_state2 *dss2 = &dss->dss2;
 
     logdirty_init(&dss->logdirty);
+    dss->logdirty.ao = ao;
     libxl__xswait_init(&dss2->pvcontrol);
     libxl__ev_evtchn_init(&dss2->guest_evtchn);
     libxl__ev_xswatch_init(&dss2->guest_watch);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 437a9cd..ee3561c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2681,13 +2681,18 @@ typedef void libxl__domain_suspend_cb(libxl__egc*,
 typedef void libxl__save_device_model_cb(libxl__egc*,
                                          libxl__domain_suspend_state*, int rc);
 
-typedef struct libxl__logdirty_switch {
+typedef struct libxl__logdirty_switch libxl__logdirty_switch;
+struct libxl__logdirty_switch {
+    /* set by caller of libxl__domain_common_switch_qemu_logdirty */
+    libxl__ao *ao;
+    void (*callback)(libxl__egc *egc, libxl__logdirty_switch *lds, int rc);
+
     const char *cmd;
     const char *cmd_path;
     const char *ret_path;
     libxl__ev_xswatch watch;
     libxl__ev_time timeout;
-} libxl__logdirty_switch;
+};
 
 /*
  * libxl__domain_suspend_state is for saving guest, not
@@ -3029,6 +3034,9 @@ void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
 
 _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
                                (int domid, unsigned int enable, void *data);
+_hidden void libxl__domain_common_switch_qemu_logdirty
+                                (int domid, unsigned int enable,
+                                 libxl__logdirty_switch *lds, libxl__egc *egc);
 _hidden int libxl__toolstack_save(uint32_t domid, uint8_t **buf,
         uint32_t *len, void *data);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 07/45] Introduce a new internal API libxl__domain_unpause()
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (5 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 06/45] Update libxl__domain_suspend_common_switch_qemu_logdirty() " Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 08/45] Update libxl__domain_unpause() to support qemu-xen Wen Congyang
                   ` (39 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

The guest is paused after libxl_domain_create_restore().
Secondary vm is running in colo mode. So we need to unpause
the guest. The current API libxl_domain_unpause() is
not an internal API. Introduce a new API to support it.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c          | 21 +++++++++++++++------
 tools/libxl/libxl_internal.h |  1 +
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index b262309..50213a9 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -965,9 +965,8 @@ out:
     return AO_INPROGRESS;
 }
 
-int libxl_domain_unpause(libxl_ctx *ctx, uint32_t domid)
+int libxl__domain_unpause(libxl__gc *gc, uint32_t domid)
 {
-    GC_INIT(ctx);
     char *path;
     char *state;
     int ret, rc = 0;
@@ -987,12 +986,22 @@ int libxl_domain_unpause(libxl_ctx *ctx, uint32_t domid)
                                          NULL, NULL, NULL);
         }
     }
-    ret = xc_domain_unpause(ctx->xch, domid);
-    if (ret<0) {
-        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "unpausing domain %d", domid);
+
+    ret = xc_domain_unpause(CTX->xch, domid);
+    if (ret < 0) {
+        LOGE(ERROR, "unpausing domain %d", domid);
         rc = ERROR_FAIL;
     }
- out:
+
+out:
+    return rc;
+}
+
+int libxl_domain_unpause(libxl_ctx *ctx, uint32_t domid)
+{
+    GC_INIT(ctx);
+    int rc = libxl__domain_unpause(gc, domid);
+
     GC_FREE;
     return rc;
 }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ee3561c..c60466c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -997,6 +997,7 @@ _hidden void libxl__userdata_destroyall(libxl__gc *gc, uint32_t domid);
 
 _hidden int libxl__domain_resume(libxl__gc *gc, uint32_t domid,
                                  int suspend_cancel, int read_savefile);
+_hidden int libxl__domain_unpause(libxl__gc *gc, uint32_t domid);
 
 /* returns 0 or 1, or a libxl error code */
 _hidden int libxl__domain_pvcontrol_available(libxl__gc *gc, uint32_t domid);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 08/45] Update libxl__domain_unpause() to support qemu-xen
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (6 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 07/45] Introduce a new internal API libxl__domain_unpause() Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 09/45] support to resume uncooperative HVM guests Wen Congyang
                   ` (38 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Currently, libxl__domain_unpause() only supports
qemu-xen-traditional. Update it to support qemu-xen.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c          | 13 +++++--------
 tools/libxl/libxl_dom.c      | 25 +++++++++++++++++++++++++
 tools/libxl/libxl_internal.h |  2 ++
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 50213a9..c51fd63 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -967,8 +967,6 @@ out:
 
 int libxl__domain_unpause(libxl__gc *gc, uint32_t domid)
 {
-    char *path;
-    char *state;
     int ret, rc = 0;
 
     libxl_domain_type type = libxl__domain_type(gc, domid);
@@ -978,12 +976,11 @@ int libxl__domain_unpause(libxl__gc *gc, uint32_t domid)
     }
 
     if (type == LIBXL_DOMAIN_TYPE_HVM) {
-        path = libxl__sprintf(gc, "/local/domain/0/device-model/%d/state", domid);
-        state = libxl__xs_read(gc, XBT_NULL, path);
-        if (state != NULL && !strcmp(state, "paused")) {
-            libxl__qemu_traditional_cmd(gc, domid, "continue");
-            libxl__wait_for_device_model_deprecated(gc, domid, "running",
-                                         NULL, NULL, NULL);
+        rc = libxl__domain_unpause_device_model(gc, domid);
+        if (rc < 0) {
+            LOG(ERROR, "failed to unpause device model for domain %u:%d",
+                domid, rc);
+            goto out;
         }
     }
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 3e22322..55a579b 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -2008,6 +2008,31 @@ static void remus_teardown_done(libxl__egc *egc,
     dss->callback(egc, dss, rc);
 }
 
+int libxl__domain_unpause_device_model(libxl__gc *gc, uint32_t domid)
+{
+    char *path;
+    char *state;
+
+    switch (libxl__device_model_version_running(gc, domid)) {
+    case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL:
+        path = libxl__sprintf(gc, "/local/domain/0/device-model/%d/state", domid);
+        state = libxl__xs_read(gc, XBT_NULL, path);
+        if (state != NULL && !strcmp(state, "paused")) {
+            libxl__qemu_traditional_cmd(gc, domid, "continue");
+            libxl__wait_for_device_model_deprecated(gc, domid, "running",
+                                         NULL, NULL, NULL);
+        }
+    case LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN:
+        if (libxl__qmp_resume(gc, domid))
+            return ERROR_FAIL;
+        break;
+    default:
+        return ERROR_FAIL;
+    }
+
+    return 0;
+}
+
 /*==================== Miscellaneous ====================*/
 
 char *libxl__uuid2string(libxl__gc *gc, const libxl_uuid uuid)
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index c60466c..6bfe5e2 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -992,6 +992,8 @@ _hidden int libxl__toolstack_restore(uint32_t domid, const uint8_t *buf,
 _hidden int libxl__domain_resume_device_model(libxl__gc *gc,
                                               uint32_t domid,
                                               int read_savefile);
+_hidden int libxl__domain_unpause_device_model(libxl__gc *gc,
+                                               uint32_t domid);
 
 _hidden void libxl__userdata_destroyall(libxl__gc *gc, uint32_t domid);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 09/45] support to resume uncooperative HVM guests
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (7 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 08/45] Update libxl__domain_unpause() to support qemu-xen Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 10/45] update datecopier to support sending data only Wen Congyang
                   ` (37 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

For PVHVM, the hypercall return code is 0, and it can be resumed
in a new domain context.

For HVM, do nothing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_resume.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/libxc/xc_resume.c b/tools/libxc/xc_resume.c
index e67bebd..b862ce3 100644
--- a/tools/libxc/xc_resume.c
+++ b/tools/libxc/xc_resume.c
@@ -109,6 +109,21 @@ static int xc_domain_resume_cooperative(xc_interface *xch, uint32_t domid)
     return do_domctl(xch, &domctl);
 }
 
+static int xc_domain_resume_hvm(xc_interface *xch, uint32_t domid)
+{
+    DECLARE_DOMCTL;
+
+    /*
+     * If it is PVHVM, the hypercall return code is 0, and resume
+     * it in a new domain context.
+     *
+     * If it is a HVM, do nothing.
+     */
+    domctl.cmd = XEN_DOMCTL_resumedomain;
+    domctl.domain = domid;
+    return do_domctl(xch, &domctl);
+}
+
 static int xc_domain_resume_any(xc_interface *xch, uint32_t domid)
 {
     DECLARE_DOMCTL;
@@ -138,10 +153,7 @@ static int xc_domain_resume_any(xc_interface *xch, uint32_t domid)
      */
 #if defined(__i386__) || defined(__x86_64__)
     if ( info.hvm )
-    {
-        ERROR("Cannot resume uncooperative HVM guests");
-        return rc;
-    }
+        return xc_domain_resume_hvm(xch, domid);
 
     if ( xc_domain_get_guest_width(xch, domid, &dinfo->guest_width) != 0 )
     {
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 10/45] update datecopier to support sending data only
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (8 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 09/45] support to resume uncooperative HVM guests Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 11/45] introduce a new API to aync read data from fd Wen Congyang
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

datacopier is to read some data and write it out. If we
have some data to send it over network, we cannot use
datacopier. Update it to support this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_aoutils.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index b10d2e1..3e0c0ae 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -309,9 +309,11 @@ int libxl__datacopier_start(libxl__datacopier_state *dc)
 
     libxl__datacopier_init(dc);
 
-    rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
-                               dc->readfd, POLLIN);
-    if (rc) goto out;
+    if (dc->readfd >= 0) {
+        rc = libxl__ev_fd_register(gc, &dc->toread, datacopier_readable,
+                                   dc->readfd, POLLIN);
+        if (rc) goto out;
+    }
 
     rc = libxl__ev_fd_register(gc, &dc->towrite, datacopier_writable,
                                dc->writefd, POLLOUT);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 11/45] introduce a new API to aync read data from fd
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (9 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 10/45] update datecopier to support sending data only Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 12/45] move remus related codes to libxl_remus.c Wen Congyang
                   ` (35 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

In colo mode, we will read some data from an fd.
Introduce a new API to avoid redundant codes.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_aoutils.c  | 73 ++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h | 30 ++++++++++++++++++
 2 files changed, 103 insertions(+)

diff --git a/tools/libxl/libxl_aoutils.c b/tools/libxl/libxl_aoutils.c
index 3e0c0ae..2d36403 100644
--- a/tools/libxl/libxl_aoutils.c
+++ b/tools/libxl/libxl_aoutils.c
@@ -542,3 +542,76 @@ bool libxl__async_exec_inuse(const libxl__async_exec_state *aes)
     assert(time_inuse == child_inuse);
     return child_inuse;
 }
+
+
+/*----- data reader -----*/
+
+static void libxl__datareader_init(libxl__datareader_state *drs)
+{
+    assert(drs->ao);
+    libxl__ev_fd_init(&drs->toread);
+    drs->used = 0;
+}
+
+static void libxl__datareader_kill(libxl__datareader_state *drs)
+{
+    STATE_AO_GC(drs->ao);
+
+    libxl__ev_fd_deregister(gc, &drs->toread);
+}
+
+static void datareader_callback(libxl__egc *egc, libxl__datareader_state *drs,
+                                ssize_t size, int errnoval)
+{
+    libxl__datareader_kill(drs);
+    drs->callback(egc, drs, size, errnoval);
+}
+
+static void datareader_readable(libxl__egc *egc, libxl__ev_fd *ev,
+                                int fd, short events, short revents)
+{
+    libxl__datareader_state *drs = CONTAINER_OF(ev, *drs, toread);
+    STATE_AO_GC(drs->ao);
+    int r;
+
+    if (revents & ~POLLIN) {
+        LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN) on %s",
+            revents, drs->readwhat);
+        datareader_callback(egc, drs, -1, 0);
+        return;
+    }
+
+    assert(revents & POLLIN);
+    while (1) {
+        r = read(ev->fd, drs->buf + drs->used, drs->readsize - drs->used);
+        if (r < 0) {
+            if (errno == EINTR)
+                continue;
+            if (errno == EWOULDBLOCK)
+                break;
+            LOGE(ERROR, "error reading %s",
+                 drs->readwhat);
+            datareader_callback(egc, drs, 0, errno);
+            return;
+        }
+        if (r == 0) {
+            datareader_callback(egc, drs, drs->used, 0);
+            break;
+        }
+
+        drs->used += r;
+    }
+}
+
+int libxl__datareader_start(libxl__datareader_state *drs)
+{
+    int rc;
+    STATE_AO_GC(drs->ao);
+
+    libxl__datareader_init(drs);
+
+    rc = libxl__ev_fd_register(gc, &drs->toread, datareader_readable,
+                               drs->readfd, POLLIN);
+
+    return rc;
+}
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 6bfe5e2..0a615e8 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2092,6 +2092,36 @@ void libxl__async_exec_init(libxl__async_exec_state *aes);
 int libxl__async_exec_start(libxl__gc *gc, libxl__async_exec_state *aes);
 bool libxl__async_exec_inuse(const libxl__async_exec_state *aes);
 
+/*----- datareader: read data from one fd to buffer -----*/
+
+typedef struct libxl__datareader_state libxl__datareader_state;
+
+/*
+ * real_size>=1 means all data was read
+ * real_size==0 means failure happened when reading, errnoval is valid, logged
+ * real_size==-1 means some other internal failure, errnoval not valid, logged
+ * In all cases reader is killed before calling this callback
+ */
+typedef void libxl__datareader_callback(libxl__egc *egc,
+     libxl__datareader_state *drs, ssize_t real_size, int errnoval);
+
+struct libxl__datareader_state {
+    /* caller must fill these in, and they must all remain valid */
+    libxl__ao *ao;
+    int readfd;
+    ssize_t readsize;
+    /* for error msgs */
+    const char *readwhat;
+    libxl__datareader_callback *callback;
+    /* It must contain enough space to store readsize bytes */
+    void *buf;
+    /* remaining fields are private to datareader */
+    libxl__ev_fd toread;
+    ssize_t used;
+};
+
+_hidden int libxl__datareader_start(libxl__datareader_state *drs);
+
 /*----- device addition/removal -----*/
 
 typedef struct libxl__ao_device libxl__ao_device;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 12/45] move remus related codes to libxl_remus.c
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (10 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 11/45] introduce a new API to aync read data from fd Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 13/45] rename remus device to checkpoint device Wen Congyang
                   ` (34 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

libxl_domain_remus_start() is external API, and is not moved.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile      |   2 +-
 tools/libxl/libxl.c       |  57 +--------
 tools/libxl/libxl_dom.c   | 220 +-------------------------------
 tools/libxl/libxl_remus.c | 319 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_remus.h |  28 ++++
 5 files changed, 353 insertions(+), 273 deletions(-)
 create mode 100644 tools/libxl/libxl_remus.c
 create mode 100644 tools/libxl/libxl_remus.h

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index ba10ab7..69a6a1d 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -56,7 +56,7 @@ else
 LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
-LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_remus.o libxl_remus_device.o libxl_remus_disk_drbd.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index c51fd63..ff93af3 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -17,6 +17,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_remus.h"
 
 #define PAGE_TO_MEMKB(pages) ((pages) * 4)
 #define BACKEND_STRING_SIZE 5
@@ -781,11 +782,6 @@ out:
     GC_FREE;
     return ptr;
 }
-
-static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds, int rc);
-static void libxl__remus_setup_failed(libxl__egc *egc,
-                                      libxl__remus_devices_state *rds, int rc);
 static void remus_failover_cb(libxl__egc *egc,
                               libxl__domain_suspend_state *dss, int rc);
 
@@ -823,63 +819,14 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
 
     assert(info);
 
-    /* Convenience aliases */
-    libxl__remus_devices_state *const rds = &dss->rds;
-
-    if (info->netbuf) {
-        if (!libxl__netbuffer_enabled(gc)) {
-            LOG(ERROR, "Remus: No support for network buffering");
-            goto out;
-        }
-        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
-    }
-
-    if (info->diskbuf)
-        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;
-
-    rds->ao = ao;
-    rds->egc = egc;
-    rds->domid = domid;
-    rds->callback = libxl__remus_setup_done;
-
     /* Point of no return */
-    libxl__remus_devices_setup(egc, rds);
+    libxl__remus_setup(egc, dss);
     return AO_INPROGRESS;
 
  out:
     return AO_ABORT(rc);
 }
 
-static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds, int rc)
-{
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-    STATE_AO_GC(dss->ao);
-
-    if (!rc) {
-        libxl__domain_suspend(egc, dss);
-        return;
-    }
-
-    LOG(ERROR, "Remus: failed to setup device for guest with domid %u, rc %d",
-        dss->domid, rc);
-    rds->callback = libxl__remus_setup_failed;
-    libxl__remus_devices_teardown(egc, rds);
-}
-
-static void libxl__remus_setup_failed(libxl__egc *egc,
-                                      libxl__remus_devices_state *rds, int rc)
-{
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-    STATE_AO_GC(dss->ao);
-
-    if (rc)
-        LOG(ERROR, "Remus: failed to teardown device after setup failed"
-            " for guest with domid %u, rc %d", dss->domid, rc);
-
-    dss->callback(egc, dss, rc);
-}
-
 static void remus_failover_cb(libxl__egc *egc,
                               libxl__domain_suspend_state *dss, int rc)
 {
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 55a579b..f819846 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_remus.h"
 
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
@@ -1547,194 +1548,6 @@ static void domain_suspend_callback_common_done(libxl__egc *egc,
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
 }
 
-/*----- remus callbacks -----*/
-static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state2 *dss2, int ok);
-static void remus_devices_postsuspend_cb(libxl__egc *egc,
-                                         libxl__remus_devices_state *rds,
-                                         int rc);
-static void remus_devices_preresume_cb(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
-                                       int rc);
-
-static void libxl__remus_domain_suspend_callback(void *data)
-{
-    libxl__save_helper_state *shs = data;
-    libxl__egc *egc = shs->egc;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-
-    /* Convenience aliases */
-    libxl__domain_suspend_state2 *const dss2 = &dss->dss2;
-
-    dss2->callback_common_done = remus_domain_suspend_callback_common_done;
-    domain_suspend_callback_common(egc, dss2);
-}
-
-static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state2 *dss2, int ok)
-{
-    libxl__domain_suspend_state *dss = CONTAINER_OF(dss2, *dss, dss2);
-
-    if (!ok)
-        goto out;
-
-    libxl__remus_devices_state *const rds = &dss->rds;
-    rds->callback = remus_devices_postsuspend_cb;
-    libxl__remus_devices_postsuspend(egc, rds);
-    return;
-
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
-}
-
-static void remus_devices_postsuspend_cb(libxl__egc *egc,
-                                         libxl__remus_devices_state *rds,
-                                         int rc)
-{
-    int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-
-    if (rc)
-        goto out;
-
-    ok = 1;
-
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
-}
-
-static void libxl__remus_domain_resume_callback(void *data)
-{
-    libxl__save_helper_state *shs = data;
-    libxl__egc *egc = shs->egc;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    STATE_AO_GC(dss->ao);
-
-    libxl__remus_devices_state *const rds = &dss->rds;
-    rds->callback = remus_devices_preresume_cb;
-    libxl__remus_devices_preresume(egc, rds);
-}
-
-static void remus_devices_preresume_cb(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
-                                       int rc)
-{
-    int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-    STATE_AO_GC(dss->ao);
-
-    if (rc)
-        goto out;
-
-    /* Resumes the domain and the device model */
-    if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1, 0))
-        ok = 1;
-
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
-}
-
-/*----- remus asynchronous checkpoint callback -----*/
-
-static void remus_checkpoint_dm_saved(libxl__egc *egc,
-                                      libxl__domain_suspend_state *dss, int rc);
-static void remus_devices_commit_cb(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds,
-                                    int rc);
-static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
-                                  const struct timeval *requested_abs);
-
-static void libxl__remus_domain_checkpoint_callback(void *data)
-{
-    libxl__save_helper_state *shs = data;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
-    libxl__egc *egc = dss->shs.egc;
-    STATE_AO_GC(dss->ao);
-
-    /* This would go into tailbuf. */
-    if (dss->hvm) {
-        libxl__domain_save_device_model(egc, dss, remus_checkpoint_dm_saved);
-    } else {
-        remus_checkpoint_dm_saved(egc, dss, 0);
-    }
-}
-
-static void remus_checkpoint_dm_saved(libxl__egc *egc,
-                                      libxl__domain_suspend_state *dss, int rc)
-{
-    /* Convenience aliases */
-    libxl__remus_devices_state *const rds = &dss->rds;
-
-    STATE_AO_GC(dss->ao);
-
-    if (rc) {
-        LOG(ERROR, "Failed to save device model. Terminating Remus..");
-        goto out;
-    }
-
-    rds->callback = remus_devices_commit_cb;
-    libxl__remus_devices_commit(egc, rds);
-
-    return;
-
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
-}
-
-static void remus_devices_commit_cb(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds,
-                                    int rc)
-{
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-
-    STATE_AO_GC(dss->ao);
-
-    if (rc) {
-        LOG(ERROR, "Failed to do device commit op."
-            " Terminating Remus..");
-        goto out;
-    }
-
-    /*
-     * At this point, we have successfully checkpointed the guest and
-     * committed it at the backup. We'll come back after the checkpoint
-     * interval to checkpoint the guest again. Until then, let the guest
-     * continue execution.
-     */
-
-    /* Set checkpoint interval timeout */
-    rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
-                                     remus_next_checkpoint,
-                                     dss->interval);
-
-    if (rc) {
-        LOG(ERROR, "unable to register timeout for next epoch."
-            " Terminating Remus..");
-        goto out;
-    }
-
-    return;
-
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
-}
-
-static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
-                                  const struct timeval *requested_abs)
-{
-    libxl__domain_suspend_state *dss =
-                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
-
-    STATE_AO_GC(dss->ao);
-
-    /*
-     * Time to checkpoint the guest again. We return 1 to libxc
-     * (xc_domain_save.c). in order to continue executing the infinite loop
-     * (suspend, checkpoint, resume) in xc_domain_save().
-     */
-    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
-}
-
 /*----- main code for suspending, in order of execution -----*/
 void libxl__domain_suspend2(libxl__egc *egc,
                             libxl__domain_suspend_state2 *dss2)
@@ -1958,10 +1771,6 @@ static void save_device_model_datacopier_done(libxl__egc *egc,
     dss->save_dm_callback(egc, dss, our_rc);
 }
 
-static void remus_teardown_done(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
-                                       int rc);
-
 static void domain_suspend_done(libxl__egc *egc,
                         libxl__domain_suspend_state *dss, int rc)
 {
@@ -1977,34 +1786,11 @@ static void domain_suspend_done(libxl__egc *egc,
         xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
                            dss2->guest_evtchn.port, &dss2->guest_evtchn_lockfd);
 
-    if (!dss->remus) {
-        remus_teardown_done(egc, &dss->rds, rc);
+    if (dss->remus) {
+        libxl__remus_teardown(egc, dss, rc);
         return;
     }
 
-    /*
-     * With Remus, if we reach this point, it means either
-     * backup died or some network error occurred preventing us
-     * from sending checkpoints. Teardown the network buffers and
-     * release netlink resources.  This is an async op.
-     */
-    LOG(WARN, "Remus: Domain suspend terminated with rc %d,"
-        " teardown Remus devices...", rc);
-    dss->rds.callback = remus_teardown_done;
-    libxl__remus_devices_teardown(egc, &dss->rds);
-}
-
-static void remus_teardown_done(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
-                                       int rc)
-{
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
-    STATE_AO_GC(dss->ao);
-
-    if (rc)
-        LOG(ERROR, "Remus: failed to teardown device for guest with domid %u,"
-            " rc %d", dss->domid, rc);
-
     dss->callback(egc, dss, rc);
 }
 
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
new file mode 100644
index 0000000..9747a13
--- /dev/null
+++ b/tools/libxl/libxl_remus.c
@@ -0,0 +1,319 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_remus.h"
+
+
+/*----- remus: setup the environment -----*/
+static void libxl__remus_setup_done(libxl__egc *egc,
+                                    libxl__remus_devices_state *rds, int rc);
+static void libxl__remus_setup_failed(libxl__egc *egc,
+                                      libxl__remus_devices_state *rds, int rc);
+
+void libxl__remus_setup(libxl__egc *egc,
+                        libxl__domain_suspend_state *dss)
+{
+    /* Convenience aliases */
+    libxl__remus_devices_state *const rds = &dss->rds;
+    const libxl_domain_remus_info *const info = dss->remus;
+
+    STATE_AO_GC(dss->ao);
+
+    if (info->netbuf) {
+        if (!libxl__netbuffer_enabled(gc)) {
+            LOG(ERROR, "Remus: No support for network buffering");
+            goto out;
+        }
+        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
+    }
+
+    if (info->diskbuf)
+        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;
+
+    rds->ao = ao;
+    rds->egc = egc;
+    rds->domid = dss->domid;
+    rds->callback = libxl__remus_setup_done;
+
+    libxl__remus_devices_setup(egc, rds);
+    return;
+
+out:
+    libxl__remus_setup_failed(egc, rds, ERROR_FAIL);
+}
+
+static void libxl__remus_setup_done(libxl__egc *egc,
+                                    libxl__remus_devices_state *rds, int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (!rc) {
+        libxl__domain_suspend(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "Remus: failed to setup device for guest with domid %u, rc %d",
+        dss->domid, rc);
+    rds->callback = libxl__remus_setup_failed;
+    libxl__remus_devices_teardown(egc, rds);
+}
+
+static void libxl__remus_setup_failed(libxl__egc *egc,
+                                      libxl__remus_devices_state *rds,
+                                      int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        LOG(ERROR, "Remus: failed to teardown device after setup failed"
+            " for guest with domid %u, rc %d", dss->domid, rc);
+
+    dss->callback(egc, dss, rc);
+}
+
+
+/*----- remus: teardown the environment -----*/
+static void remus_teardown_done(libxl__egc *egc,
+                                libxl__remus_devices_state *rds,
+                                int rc);
+
+void libxl__remus_teardown(libxl__egc *egc,
+                           libxl__domain_suspend_state *dss,
+                           int rc)
+{
+    EGC_GC;
+
+    /*
+     * If we reach this point, it means either backup died or some
+     * network error occurred preventing us from sending checkpoints.
+     * Teardown the network buffers and release netlink resources.
+     * This is an async op.
+     */
+    LOG(WARN, "Remus: Domain suspend terminated with rc %d,"
+        " teardown Remus devices...", rc);
+    dss->rds.callback = remus_teardown_done;
+    libxl__remus_devices_teardown(egc, &dss->rds);
+}
+
+static void remus_teardown_done(libxl__egc *egc,
+                                libxl__remus_devices_state *rds,
+                                int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        LOG(ERROR, "Remus: failed to teardown device for guest with domid %u,"
+            " rc %d", dss->domid, rc);
+
+    dss->callback(egc, dss, rc);
+}
+
+
+/*----- remus: suspend the guest -----*/
+static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
+                                libxl__domain_suspend_state2 *dss2, int ok);
+static void remus_devices_postsuspend_cb(libxl__egc *egc,
+                                         libxl__remus_devices_state *rds,
+                                         int rc);
+
+void libxl__remus_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state2 *const dss2 = &dss->dss2;
+
+    dss2->callback_common_done = remus_domain_suspend_callback_common_done;
+    libxl__domain_suspend2(egc, dss2);
+}
+
+static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
+                                libxl__domain_suspend_state2 *dss2, int ok)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dss2, *dss, dss2);
+
+    if (!ok)
+        goto out;
+
+    libxl__remus_devices_state *const rds = &dss->rds;
+    rds->callback = remus_devices_postsuspend_cb;
+    libxl__remus_devices_postsuspend(egc, rds);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void remus_devices_postsuspend_cb(libxl__egc *egc,
+                                         libxl__remus_devices_state *rds,
+                                         int rc)
+{
+    int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+
+    if (rc)
+        goto out;
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/*----- remus: resume the guest -----*/
+static void remus_devices_preresume_cb(libxl__egc *egc,
+                                       libxl__remus_devices_state *rds,
+                                       int rc);
+
+void libxl__remus_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    STATE_AO_GC(dss->ao);
+
+    libxl__remus_devices_state *const rds = &dss->rds;
+    rds->callback = remus_devices_preresume_cb;
+    libxl__remus_devices_preresume(egc, rds);
+}
+
+static void remus_devices_preresume_cb(libxl__egc *egc,
+                                       libxl__remus_devices_state *rds,
+                                       int rc)
+{
+    int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        goto out;
+
+    /* Resumes the domain and the device model */
+    if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1, 0))
+        ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/*----- remus: wait a new checkpoint -----*/
+static void remus_checkpoint_dm_saved(libxl__egc *egc,
+                                      libxl__domain_suspend_state *dss, int rc);
+static void remus_devices_commit_cb(libxl__egc *egc,
+                                    libxl__remus_devices_state *rds,
+                                    int rc);
+static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                  const struct timeval *requested_abs);
+
+void libxl__remus_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    libxl__egc *egc = dss->shs.egc;
+    STATE_AO_GC(dss->ao);
+
+    /* This would go into tailbuf. */
+    if (dss->hvm) {
+        libxl__domain_save_device_model(egc, dss, remus_checkpoint_dm_saved);
+    } else {
+        remus_checkpoint_dm_saved(egc, dss, 0);
+    }
+}
+
+static void remus_checkpoint_dm_saved(libxl__egc *egc,
+                                      libxl__domain_suspend_state *dss, int rc)
+{
+    /* Convenience aliases */
+    libxl__remus_devices_state *const rds = &dss->rds;
+
+    STATE_AO_GC(dss->ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to save device model. Terminating Remus..");
+        goto out;
+    }
+
+    rds->callback = remus_devices_commit_cb;
+    libxl__remus_devices_commit(egc, rds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void remus_devices_commit_cb(libxl__egc *egc,
+                                    libxl__remus_devices_state *rds,
+                                    int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+
+    STATE_AO_GC(dss->ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to do device commit op."
+            " Terminating Remus..");
+        goto out;
+    }
+
+    /*
+     * At this point, we have successfully checkpointed the guest and
+     * committed it at the backup. We'll come back after the checkpoint
+     * interval to checkpoint the guest again. Until then, let the guest
+     * continue execution.
+     */
+
+    /* Set checkpoint interval timeout */
+    rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
+                                     remus_next_checkpoint,
+                                     dss->interval);
+
+    if (rc) {
+        LOG(ERROR, "unable to register timeout for next epoch."
+            " Terminating Remus..");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                  const struct timeval *requested_abs)
+{
+    libxl__domain_suspend_state *dss =
+                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
+
+    STATE_AO_GC(dss->ao);
+
+    /*
+     * Time to checkpoint the guest again. We return 1 to libxc
+     * (xc_domain_save.c). in order to continue executing the infinite loop
+     * (suspend, checkpoint, resume) in xc_domain_save().
+     */
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
+}
diff --git a/tools/libxl/libxl_remus.h b/tools/libxl/libxl_remus.h
new file mode 100644
index 0000000..53e5e81
--- /dev/null
+++ b/tools/libxl/libxl_remus.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef LIBXL_REMUS_H
+#define LIBXL_REMUS_H
+
+void libxl__remus_setup(libxl__egc *egc,
+                        libxl__domain_suspend_state *dss);
+void libxl__remus_teardown(libxl__egc *egc,
+                           libxl__domain_suspend_state *dss,
+                           int rc);
+void libxl__remus_domain_suspend_callback(void *data);
+void libxl__remus_domain_resume_callback(void *data);
+void libxl__remus_domain_checkpoint_callback(void *data);
+
+#endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 13/45] rename remus device to checkpoint device
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (11 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 12/45] move remus related codes to libxl_remus.c Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 14/45] adjust the indentation Wen Congyang
                   ` (33 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

This patch is auto generated by the following commands:
 1. git mv tools/libxl/libxl_remus_device.c tools/libxl/libxl_checkpoint_device.c
 2. perl -pi -e 's/libxl_remus_device/libxl_checkpoint_device/g' tools/libxl/Makefile
 3. perl -pi -e 's/\blibxl__remus_devices/libxl__checkpoint_devices/g' tools/libxl/*.[ch]
 4. perl -pi -e 's/\blibxl__remus_device\b/libxl__checkpoint_device/g' tools/libxl/*.[ch]
 5. perl -pi -e 's/\blibxl__remus_device_subkind_ops\b/libxl__checkpoint_device_subkind_ops/g' tools/libxl/*.[ch]
 6. perl -pi -e 's/\blibxl__remus_device_kind\b/libxl__checkpoint_device_kind/g' tools/libxl/*.[ch]
 7. perl -pi -e 's/\blibxl__remus_callback\b/libxl__checkpoint_callback/g' tools/libxl/*.[ch]
 8. perl -pi -e 's/\bremus_device_init\b/checkpoint_device_init/g' tools/libxl/*.[ch]
 9. perl -pi -e 's/\bremus_devices_setup\b/checkpoint_devices_setup/g' tools/libxl/*.[ch]
10. perl -pi -e 's/\bdefine_remus_checkpoint_api\b/define_checkpoint_api/g' tools/libxl/*.[ch]
11. perl -pi -e 's/\brds\b/cds/g' tools/libxl/*.[ch]
12. perl -pi -e 's/REMUS_DEVICE/CHECKPOINT_DEVICE/g' tools/libxl/*.[ch] tools/libxl/*.idl
13. perl -pi -e 's/REMUS_DEVOPS/CHECKPOINT_DEVOPS/g' tools/libxl/*.[ch] tools/libxl/*.idl
14. perl -pi -e 's/\bremus\b/checkpoint/g' tools/libxl/libxl_checkpoint_device.[ch]
15. perl -pi -e 's/\bremus device/checkpoint device/g' tools/libxl/libxl_internal.h
16. perl -pi -e 's/\bRemus device/checkpoint device/g' tools/libxl/libxl_internal.h
17. perl -pi -e 's/\bremus abstract/checkpoint abstract/g' tools/libxl/libxl_internal.h
18. perl -pi -e 's/\bremus invocation/checkpoint invocation/g' tools/libxl/libxl_internal.h
19. perl -pi -e 's/\blibxl__remus_device_\(/libxl__checkpoint_device_(/g' tools/libxl/libxl_internal.h

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile                               |   2 +-
 ...xl_remus_device.c => libxl_checkpoint_device.c} | 206 ++++++++++-----------
 tools/libxl/libxl_internal.h                       | 118 ++++++------
 tools/libxl/libxl_netbuffer.c                      | 118 ++++++------
 tools/libxl/libxl_nonetbuffer.c                    |  14 +-
 tools/libxl/libxl_remus.c                          |  80 ++++----
 tools/libxl/libxl_remus_disk_drbd.c                |  62 +++----
 tools/libxl/libxl_types.idl                        |   4 +-
 8 files changed, 302 insertions(+), 302 deletions(-)
 rename tools/libxl/{libxl_remus_device.c => libxl_checkpoint_device.c} (42%)

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 69a6a1d..5427461 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -56,7 +56,7 @@ else
 LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
-LIBXL_OBJS-y += libxl_remus.o libxl_remus_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_checkpoint_device.c
similarity index 42%
rename from tools/libxl/libxl_remus_device.c
rename to tools/libxl/libxl_checkpoint_device.c
index b19c372..0036858 100644
--- a/tools/libxl/libxl_remus_device.c
+++ b/tools/libxl/libxl_checkpoint_device.c
@@ -17,9 +17,9 @@
 
 #include "libxl_internal.h"
 
-extern const libxl__remus_device_subkind_ops remus_device_nic;
-extern const libxl__remus_device_subkind_ops remus_device_drbd_disk;
-static const libxl__remus_device_subkind_ops *remus_ops[] = {
+extern const libxl__checkpoint_device_subkind_ops remus_device_nic;
+extern const libxl__checkpoint_device_subkind_ops remus_device_drbd_disk;
+static const libxl__checkpoint_device_subkind_ops *remus_ops[] = {
     &remus_device_nic,
     &remus_device_drbd_disk,
     NULL,
@@ -27,13 +27,13 @@ static const libxl__remus_device_subkind_ops *remus_ops[] = {
 
 /*----- helper functions -----*/
 
-static int init_device_subkind(libxl__remus_devices_state *rds)
+static int init_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     int rc;
-    const libxl__remus_device_subkind_ops **ops;
+    const libxl__checkpoint_device_subkind_ops **ops;
 
     for (ops = remus_ops; *ops; ops++) {
-        rc = (*ops)->init(rds);
+        rc = (*ops)->init(cds);
         if (rc)
             goto out;
     }
@@ -44,12 +44,12 @@ out:
 
 }
 
-static void cleanup_device_subkind(libxl__remus_devices_state *rds)
+static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
-    const libxl__remus_device_subkind_ops **ops;
+    const libxl__checkpoint_device_subkind_ops **ops;
 
     for (ops = remus_ops; *ops; ops++)
-        (*ops)->cleanup(rds);
+        (*ops)->cleanup(cds);
 }
 
 /*----- setup() and teardown() -----*/
@@ -63,85 +63,85 @@ static void devices_teardown_cb(libxl__egc *egc,
                                 libxl__multidev *multidev,
                                 int rc);
 
-/* remus device setup and teardown */
+/* checkpoint device setup and teardown */
 
-static libxl__remus_device* remus_device_init(libxl__egc *egc,
-                                              libxl__remus_devices_state *rds,
-                                              libxl__remus_device_kind kind,
+static libxl__checkpoint_device* checkpoint_device_init(libxl__egc *egc,
+                                              libxl__checkpoint_devices_state *cds,
+                                              libxl__checkpoint_device_kind kind,
                                               void *libxl_dev)
 {
-    libxl__remus_device *dev = NULL;
+    libxl__checkpoint_device *dev = NULL;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
     GCNEW(dev);
     dev->backend_dev = libxl_dev;
     dev->kind = kind;
-    dev->rds = rds;
+    dev->cds = cds;
     dev->ops_index = -1;
 
     return dev;
 }
 
-static void remus_devices_setup(libxl__egc *egc,
-                                libxl__remus_devices_state *rds);
+static void checkpoint_devices_setup(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds);
 
-void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_devices_state *rds)
+void libxl__checkpoint_devices_setup(libxl__egc *egc, libxl__checkpoint_devices_state *cds)
 {
     int i, rc;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
-    rc = init_device_subkind(rds);
+    rc = init_device_subkind(cds);
     if (rc)
         goto out;
 
-    rds->num_devices = 0;
-    rds->num_nics = 0;
-    rds->num_disks = 0;
+    cds->num_devices = 0;
+    cds->num_nics = 0;
+    cds->num_disks = 0;
 
-    if (rds->device_kind_flags & LIBXL__REMUS_DEVICE_NIC)
-        rds->nics = libxl_device_nic_list(CTX, rds->domid, &rds->num_nics);
+    if (cds->device_kind_flags & LIBXL__CHECKPOINT_DEVICE_NIC)
+        cds->nics = libxl_device_nic_list(CTX, cds->domid, &cds->num_nics);
 
-    if (rds->device_kind_flags & LIBXL__REMUS_DEVICE_DISK)
-        rds->disks = libxl_device_disk_list(CTX, rds->domid, &rds->num_disks);
+    if (cds->device_kind_flags & LIBXL__CHECKPOINT_DEVICE_DISK)
+        cds->disks = libxl_device_disk_list(CTX, cds->domid, &cds->num_disks);
 
-    if (rds->num_nics == 0 && rds->num_disks == 0)
+    if (cds->num_nics == 0 && cds->num_disks == 0)
         goto out;
 
-    GCNEW_ARRAY(rds->dev, rds->num_nics + rds->num_disks);
+    GCNEW_ARRAY(cds->dev, cds->num_nics + cds->num_disks);
 
-    for (i = 0; i < rds->num_nics; i++) {
-        rds->dev[rds->num_devices++] = remus_device_init(egc, rds,
-                                                LIBXL__REMUS_DEVICE_NIC,
-                                                &rds->nics[i]);
+    for (i = 0; i < cds->num_nics; i++) {
+        cds->dev[cds->num_devices++] = checkpoint_device_init(egc, cds,
+                                                LIBXL__CHECKPOINT_DEVICE_NIC,
+                                                &cds->nics[i]);
     }
 
-    for (i = 0; i < rds->num_disks; i++) {
-        rds->dev[rds->num_devices++] = remus_device_init(egc, rds,
-                                                LIBXL__REMUS_DEVICE_DISK,
-                                                &rds->disks[i]);
+    for (i = 0; i < cds->num_disks; i++) {
+        cds->dev[cds->num_devices++] = checkpoint_device_init(egc, cds,
+                                                LIBXL__CHECKPOINT_DEVICE_DISK,
+                                                &cds->disks[i]);
     }
 
-    remus_devices_setup(egc, rds);
+    checkpoint_devices_setup(egc, cds);
 
     return;
 
 out:
-    rds->callback(egc, rds, rc);
+    cds->callback(egc, cds, rc);
 }
 
-static void remus_devices_setup(libxl__egc *egc,
-                                libxl__remus_devices_state *rds)
+static void checkpoint_devices_setup(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds)
 {
     int i, rc;
-    libxl__remus_device *dev;
+    libxl__checkpoint_device *dev;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
-    libxl__multidev_begin(ao, &rds->multidev);
-    rds->multidev.callback = devices_setup_cb;
-    for (i = 0; i < rds->num_devices; i++) {
-        dev = rds->dev[i];
+    libxl__multidev_begin(ao, &cds->multidev);
+    cds->multidev.callback = devices_setup_cb;
+    for (i = 0; i < cds->num_devices; i++) {
+        dev = cds->dev[i];
         if (dev->set_up)
             continue;
 
@@ -149,18 +149,18 @@ static void remus_devices_setup(libxl__egc *egc,
         do {
             dev->ops = remus_ops[++dev->ops_index];
             if (!dev->ops) {
-                rc = ERROR_REMUS_DEVICE_NOT_SUPPORTED;
+                rc = ERROR_CHECKPOINT_DEVICE_NOT_SUPPORTED;
                 goto out;
             }
         } while (dev->ops->kind != dev->kind);
 
-        libxl__multidev_prepare_with_aodev(&rds->multidev, &dev->aodev);
+        libxl__multidev_prepare_with_aodev(&cds->multidev, &dev->aodev);
         dev->ops->setup(dev);
     }
 
     rc = 0;
 out:
-    libxl__multidev_prepared(egc, &rds->multidev, rc);
+    libxl__multidev_prepared(egc, &cds->multidev, rc);
 }
 
 static void devices_setup_cb(libxl__egc *egc,
@@ -168,55 +168,55 @@ static void devices_setup_cb(libxl__egc *egc,
                              int rc)
 {
     int i;
-    libxl__remus_device *dev;
+    libxl__checkpoint_device *dev;
 
     STATE_AO_GC(multidev->ao);
 
     /* Convenience aliases */
-    libxl__remus_devices_state *const rds =
-                            CONTAINER_OF(multidev, *rds, multidev);
+    libxl__checkpoint_devices_state *const cds =
+                            CONTAINER_OF(multidev, *cds, multidev);
 
-    /* find the error that was not ERROR_REMUS_DEVOPS_DOES_NOT_MATCH */
-    for (i = 0; i < rds->num_devices; i++) {
-        dev = rds->dev[i];
+    /* find the error that was not ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH */
+    for (i = 0; i < cds->num_devices; i++) {
+        dev = cds->dev[i];
 
-        if (!dev->aodev.rc || dev->aodev.rc == ERROR_REMUS_DEVOPS_DOES_NOT_MATCH)
+        if (!dev->aodev.rc || dev->aodev.rc == ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH)
             continue;
 
         rc = dev->aodev.rc;
         goto out;
     }
 
-    /* if the error is still ERROR_REMUS_DEVOPS_DOES_NOT_MATCH, begin next iter */
-    if (rc == ERROR_REMUS_DEVOPS_DOES_NOT_MATCH) {
-        remus_devices_setup(egc, rds);
+    /* if the error is still ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH, begin next iter */
+    if (rc == ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH) {
+        checkpoint_devices_setup(egc, cds);
         return;
     }
 
 out:
-    rds->callback(egc, rds, rc);
+    cds->callback(egc, cds, rc);
 }
 
-void libxl__remus_devices_teardown(libxl__egc *egc,
-                                   libxl__remus_devices_state *rds)
+void libxl__checkpoint_devices_teardown(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds)
 {
     int i;
-    libxl__remus_device *dev;
+    libxl__checkpoint_device *dev;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
-    libxl__multidev_begin(ao, &rds->multidev);
-    rds->multidev.callback = devices_teardown_cb;
-    for (i = 0; i < rds->num_devices; i++) {
-        dev = rds->dev[i];
+    libxl__multidev_begin(ao, &cds->multidev);
+    cds->multidev.callback = devices_teardown_cb;
+    for (i = 0; i < cds->num_devices; i++) {
+        dev = cds->dev[i];
         if (!dev->ops || !dev->set_up)
             continue;
 
-        libxl__multidev_prepare_with_aodev(&rds->multidev, &dev->aodev);
+        libxl__multidev_prepare_with_aodev(&cds->multidev, &dev->aodev);
         dev->ops->teardown(dev);
     }
 
-    libxl__multidev_prepared(egc, &rds->multidev, 0);
+    libxl__multidev_prepared(egc, &cds->multidev, 0);
 }
 
 static void devices_teardown_cb(libxl__egc *egc,
@@ -228,26 +228,26 @@ static void devices_teardown_cb(libxl__egc *egc,
     STATE_AO_GC(multidev->ao);
 
     /* Convenience aliases */
-    libxl__remus_devices_state *const rds =
-                            CONTAINER_OF(multidev, *rds, multidev);
+    libxl__checkpoint_devices_state *const cds =
+                            CONTAINER_OF(multidev, *cds, multidev);
 
     /* clean nic */
-    for (i = 0; i < rds->num_nics; i++)
-        libxl_device_nic_dispose(&rds->nics[i]);
-    free(rds->nics);
-    rds->nics = NULL;
-    rds->num_nics = 0;
+    for (i = 0; i < cds->num_nics; i++)
+        libxl_device_nic_dispose(&cds->nics[i]);
+    free(cds->nics);
+    cds->nics = NULL;
+    cds->num_nics = 0;
 
     /* clean disk */
-    for (i = 0; i < rds->num_disks; i++)
-        libxl_device_disk_dispose(&rds->disks[i]);
-    free(rds->disks);
-    rds->disks = NULL;
-    rds->num_disks = 0;
+    for (i = 0; i < cds->num_disks; i++)
+        libxl_device_disk_dispose(&cds->disks[i]);
+    free(cds->disks);
+    cds->disks = NULL;
+    cds->num_disks = 0;
 
-    cleanup_device_subkind(rds);
+    cleanup_device_subkind(cds);
 
-    rds->callback(egc, rds, rc);
+    cds->callback(egc, cds, rc);
 }
 
 /*----- checkpointing APIs -----*/
@@ -260,33 +260,33 @@ static void devices_checkpoint_cb(libxl__egc *egc,
 
 /* API implementations */
 
-#define define_remus_checkpoint_api(api)                                \
-void libxl__remus_devices_##api(libxl__egc *egc,                        \
-                                libxl__remus_devices_state *rds)        \
+#define define_checkpoint_api(api)                                \
+void libxl__checkpoint_devices_##api(libxl__egc *egc,                        \
+                                libxl__checkpoint_devices_state *cds)        \
 {                                                                       \
     int i;                                                              \
-    libxl__remus_device *dev;                                           \
+    libxl__checkpoint_device *dev;                                           \
                                                                         \
-    STATE_AO_GC(rds->ao);                                               \
+    STATE_AO_GC(cds->ao);                                               \
                                                                         \
-    libxl__multidev_begin(ao, &rds->multidev);                          \
-    rds->multidev.callback = devices_checkpoint_cb;                     \
-    for (i = 0; i < rds->num_devices; i++) {                            \
-        dev = rds->dev[i];                                              \
+    libxl__multidev_begin(ao, &cds->multidev);                          \
+    cds->multidev.callback = devices_checkpoint_cb;                     \
+    for (i = 0; i < cds->num_devices; i++) {                            \
+        dev = cds->dev[i];                                              \
         if (!dev->set_up || !dev->ops->api)                             \
             continue;                                                   \
-        libxl__multidev_prepare_with_aodev(&rds->multidev, &dev->aodev);\
+        libxl__multidev_prepare_with_aodev(&cds->multidev, &dev->aodev);\
         dev->ops->api(dev);                                             \
     }                                                                   \
                                                                         \
-    libxl__multidev_prepared(egc, &rds->multidev, 0);                   \
+    libxl__multidev_prepared(egc, &cds->multidev, 0);                   \
 }
 
-define_remus_checkpoint_api(postsuspend);
+define_checkpoint_api(postsuspend);
 
-define_remus_checkpoint_api(preresume);
+define_checkpoint_api(preresume);
 
-define_remus_checkpoint_api(commit);
+define_checkpoint_api(commit);
 
 static void devices_checkpoint_cb(libxl__egc *egc,
                                   libxl__multidev *multidev,
@@ -295,8 +295,8 @@ static void devices_checkpoint_cb(libxl__egc *egc,
     STATE_AO_GC(multidev->ao);
 
     /* Convenience aliases */
-    libxl__remus_devices_state *const rds =
-                            CONTAINER_OF(multidev, *rds, multidev);
+    libxl__checkpoint_devices_state *const cds =
+                            CONTAINER_OF(multidev, *cds, multidev);
 
-    rds->callback(egc, rds, rc);
+    cds->callback(egc, cds, rc);
 }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 0a615e8..b3b726c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2517,9 +2517,9 @@ typedef struct libxl__save_helper_state {
                       * marshalling and xc callback functions */
 } libxl__save_helper_state;
 
-/*----- remus device related state structure -----*/
+/*----- checkpoint device related state structure -----*/
 /*
- * The abstract Remus device layer exposes a common
+ * The abstract checkpoint device layer exposes a common
  * set of API to [external] libxl for manipulating devices attached to
  * a guest protected by Remus. The device layer also exposes a set of
  * [internal] interfaces that every device type must implement.
@@ -2527,39 +2527,39 @@ typedef struct libxl__save_helper_state {
  * The following API are exposed to libxl:
  *
  * One-time configuration operations:
- *  +libxl__remus_devices_setup
+ *  +libxl__checkpoint_devices_setup
  *    > Enable output buffering for NICs, setup disk replication, etc.
- *  +libxl__remus_devices_teardown
+ *  +libxl__checkpoint_devices_teardown
  *    > Disable output buffering and disk replication; teardown any
  *       associated external setups like qdiscs for NICs.
  *
  * Operations executed every checkpoint (in order of invocation):
- *  +libxl__remus_devices_postsuspend
- *  +libxl__remus_devices_preresume
- *  +libxl__remus_devices_commit
+ *  +libxl__checkpoint_devices_postsuspend
+ *  +libxl__checkpoint_devices_preresume
+ *  +libxl__checkpoint_devices_commit
  *
  * Each device type needs to implement the interfaces specified in
- * the libxl__remus_device_subkind_ops if it wishes to support Remus.
+ * the libxl__checkpoint_device_subkind_ops if it wishes to support Remus.
  *
- * The high-level control flow through the Remus device layer is shown below:
+ * The high-level control flow through the checkpoint device layer is shown below:
  *
  * xl remus
  *  |->  libxl_domain_remus_start
- *    |-> libxl__remus_devices_setup
- *      |-> Per-checkpoint libxl__remus_devices_[postsuspend,preresume,commit]
+ *    |-> libxl__checkpoint_devices_setup
+ *      |-> Per-checkpoint libxl__checkpoint_devices_[postsuspend,preresume,commit]
  *        ...
  *        |-> On backup failure, network error or other internal errors:
- *            libxl__remus_devices_teardown
+ *            libxl__checkpoint_devices_teardown
  */
 
-typedef enum libxl__remus_device_kind {
-    LIBXL__REMUS_DEVICE_NIC  = (1 << 0),
-    LIBXL__REMUS_DEVICE_DISK = (1 << 1),
-} libxl__remus_device_kind;
+typedef enum libxl__checkpoint_device_kind {
+    LIBXL__CHECKPOINT_DEVICE_NIC  = (1 << 0),
+    LIBXL__CHECKPOINT_DEVICE_DISK = (1 << 1),
+} libxl__checkpoint_device_kind;
 
-typedef struct libxl__remus_device libxl__remus_device;
-typedef struct libxl__remus_devices_state libxl__remus_devices_state;
-typedef struct libxl__remus_device_subkind_ops libxl__remus_device_subkind_ops;
+typedef struct libxl__checkpoint_device libxl__checkpoint_device;
+typedef struct libxl__checkpoint_devices_state libxl__checkpoint_devices_state;
+typedef struct libxl__checkpoint_device_subkind_ops libxl__checkpoint_device_subkind_ops;
 
 /*
  * Interfaces to be implemented by every device type that wishes to
@@ -2569,17 +2569,17 @@ typedef struct libxl__remus_device_subkind_ops libxl__remus_device_subkind_ops;
  * synchronous and call dev->aodev.callback directly (as the last
  * thing they do).
  */
-struct libxl__remus_device_subkind_ops {
+struct libxl__checkpoint_device_subkind_ops {
     /* the device kind this ops belongs to... */
-    libxl__remus_device_kind kind;
+    libxl__checkpoint_device_kind kind;
 
     /*
      * init() and cleanup() relate to the subkind-specific state in
      * the libxl ctx, not to any specific device.
      * Synchronous. cleanup() cannot fail.
      */
-    int (*init)(libxl__remus_devices_state *rds);
-    void (*cleanup)(libxl__remus_devices_state *rds);
+    int (*init)(libxl__checkpoint_devices_state *cds);
+    void (*cleanup)(libxl__checkpoint_devices_state *cds);
 
     /*
      * Checkpoint operations. May be NULL, meaning the op is not
@@ -2588,12 +2588,12 @@ struct libxl__remus_device_subkind_ops {
      * Asynchronous.
      */
 
-    void (*postsuspend)(libxl__remus_device *dev);
-    void (*preresume)(libxl__remus_device *dev);
-    void (*commit)(libxl__remus_device *dev);
+    void (*postsuspend)(libxl__checkpoint_device *dev);
+    void (*preresume)(libxl__checkpoint_device *dev);
+    void (*commit)(libxl__checkpoint_device *dev);
 
     /*
-     * setup() and teardown() are refer to the actual remus device.
+     * setup() and teardown() are refer to the actual checkpoint device.
      * Asynchronous.
      * teardown is called even if setup fails.
      */
@@ -2602,40 +2602,40 @@ struct libxl__remus_device_subkind_ops {
      * device. If matched, the device will then be managed with this set of
      * subkind operations.
      * Yields 0 if the device successfully set up.
-     * REMUS_DEVOPS_DOES_NOT_MATCH if the ops does not match the device.
+     * CHECKPOINT_DEVOPS_DOES_NOT_MATCH if the ops does not match the device.
      * any other rc indicates failure.
      */
-    void (*setup)(libxl__remus_device *dev);
-    void (*teardown)(libxl__remus_device *dev);
+    void (*setup)(libxl__checkpoint_device *dev);
+    void (*teardown)(libxl__checkpoint_device *dev);
 };
 
-typedef void libxl__remus_callback(libxl__egc *,
-                                   libxl__remus_devices_state *, int rc);
+typedef void libxl__checkpoint_callback(libxl__egc *,
+                                   libxl__checkpoint_devices_state *, int rc);
 
 /*
- * State associated with a remus invocation, including parameters
- * passed to the remus abstract device layer by the remus
+ * State associated with a checkpoint invocation, including parameters
+ * passed to the checkpoint abstract device layer by the remus
  * save/restore machinery.
  */
-struct libxl__remus_devices_state {
-    /*---- must be set by caller of libxl__remus_device_(setup|teardown) ----*/
+struct libxl__checkpoint_devices_state {
+    /*---- must be set by caller of libxl__checkpoint_device_(setup|teardown) ----*/
 
     libxl__ao *ao;
     libxl__egc *egc;
     uint32_t domid;
-    libxl__remus_callback *callback;
+    libxl__checkpoint_callback *callback;
     int device_kind_flags;
 
     /*----- private for abstract layer only -----*/
 
     int num_devices;
     /*
-     * this array is allocated before setup the remus devices by the
-     * remus abstract layer.
+     * this array is allocated before setup the checkpoint devices by the
+     * checkpoint abstract layer.
      * the size of this array is 'num_devices', which is the total number
      * of libxl nic devices and disk devices(num_nics + num_disks).
      */
-    libxl__remus_device **dev;
+    libxl__checkpoint_device **dev;
 
     libxl_device_nic *nics;
     int num_nics;
@@ -2657,9 +2657,9 @@ struct libxl__remus_devices_state {
 
 /*
  * Information about a single device being handled by remus.
- * Allocated by the remus abstract layer.
+ * Allocated by the checkpoint abstract layer.
  */
-struct libxl__remus_device {
+struct libxl__checkpoint_device {
     /*----- shared between abstract and concrete layers -----*/
     /*
      * if this is true, that means the subkind ops matched the
@@ -2668,11 +2668,11 @@ struct libxl__remus_device {
      */
     int set_up;
 
-    /*----- set by remus device abstruct layer -----*/
-    /* libxl__device_* which this remus device related to */
+    /*----- set by checkpoint device abstruct layer -----*/
+    /* libxl__device_* which this checkpoint device related to */
     const void *backend_dev;
-    libxl__remus_device_kind kind;
-    libxl__remus_devices_state *rds;
+    libxl__checkpoint_device_kind kind;
+    libxl__checkpoint_devices_state *cds;
     libxl__ao_device aodev;
 
     /*----- private for abstract layer only -----*/
@@ -2683,7 +2683,7 @@ struct libxl__remus_device {
      * individual devices.
      */
     int ops_index;
-    const libxl__remus_device_subkind_ops *ops;
+    const libxl__checkpoint_device_subkind_ops *ops;
 
     /*----- private for concrete (device-specific) layer -----*/
 
@@ -2691,17 +2691,17 @@ struct libxl__remus_device {
     void *concrete_data;
 };
 
-/* the following 5 APIs are async ops, call rds->callback when done */
-_hidden void libxl__remus_devices_setup(libxl__egc *egc,
-                                        libxl__remus_devices_state *rds);
-_hidden void libxl__remus_devices_teardown(libxl__egc *egc,
-                                           libxl__remus_devices_state *rds);
-_hidden void libxl__remus_devices_postsuspend(libxl__egc *egc,
-                                              libxl__remus_devices_state *rds);
-_hidden void libxl__remus_devices_preresume(libxl__egc *egc,
-                                            libxl__remus_devices_state *rds);
-_hidden void libxl__remus_devices_commit(libxl__egc *egc,
-                                         libxl__remus_devices_state *rds);
+/* the following 5 APIs are async ops, call cds->callback when done */
+_hidden void libxl__checkpoint_devices_setup(libxl__egc *egc,
+                                        libxl__checkpoint_devices_state *cds);
+_hidden void libxl__checkpoint_devices_teardown(libxl__egc *egc,
+                                           libxl__checkpoint_devices_state *cds);
+_hidden void libxl__checkpoint_devices_postsuspend(libxl__egc *egc,
+                                              libxl__checkpoint_devices_state *cds);
+_hidden void libxl__checkpoint_devices_preresume(libxl__egc *egc,
+                                            libxl__checkpoint_devices_state *cds);
+_hidden void libxl__checkpoint_devices_commit(libxl__egc *egc,
+                                         libxl__checkpoint_devices_state *cds);
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
@@ -2765,7 +2765,7 @@ struct libxl__domain_suspend_state {
     libxl__domain_suspend_state2 dss2;
     int hvm;
     int xcflags;
-    libxl__remus_devices_state rds;
+    libxl__checkpoint_devices_state cds;
     libxl__ev_time checkpoint_timeout; /* used for Remus checkpoint */
     int interval; /* checkpoint interval (for Remus) */
     libxl__save_helper_state shs;
diff --git a/tools/libxl/libxl_netbuffer.c b/tools/libxl/libxl_netbuffer.c
index e1d02af..385922f 100644
--- a/tools/libxl/libxl_netbuffer.c
+++ b/tools/libxl/libxl_netbuffer.c
@@ -40,21 +40,21 @@ int libxl__netbuffer_enabled(libxl__gc *gc)
 
 /*----- init() and cleanup() -----*/
 
-static int nic_init(libxl__remus_devices_state *rds)
+static int nic_init(libxl__checkpoint_devices_state *cds)
 {
     int rc, ret;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
-    rds->nlsock = nl_socket_alloc();
-    if (!rds->nlsock) {
+    cds->nlsock = nl_socket_alloc();
+    if (!cds->nlsock) {
         LOG(ERROR, "cannot allocate nl socket");
         rc = ERROR_FAIL;
         goto out;
     }
 
-    ret = nl_connect(rds->nlsock, NETLINK_ROUTE);
+    ret = nl_connect(cds->nlsock, NETLINK_ROUTE);
     if (ret) {
         LOG(ERROR, "failed to open netlink socket: %s",
             nl_geterror(ret));
@@ -63,7 +63,7 @@ static int nic_init(libxl__remus_devices_state *rds)
     }
 
     /* get list of all qdiscs installed on network devs. */
-    ret = rtnl_qdisc_alloc_cache(rds->nlsock, &rds->qdisc_cache);
+    ret = rtnl_qdisc_alloc_cache(cds->nlsock, &cds->qdisc_cache);
     if (ret) {
         LOG(ERROR, "failed to allocate qdisc cache: %s",
             nl_geterror(ret));
@@ -72,9 +72,9 @@ static int nic_init(libxl__remus_devices_state *rds)
     }
 
     if (dss->remus->netbufscript) {
-        rds->netbufscript = libxl__strdup(gc, dss->remus->netbufscript);
+        cds->netbufscript = libxl__strdup(gc, dss->remus->netbufscript);
     } else {
-        rds->netbufscript = GCSPRINTF("%s/remus-netbuf-setup",
+        cds->netbufscript = GCSPRINTF("%s/remus-netbuf-setup",
                                       libxl__xen_script_dir_path());
     }
 
@@ -84,22 +84,22 @@ out:
     return rc;
 }
 
-static void nic_cleanup(libxl__remus_devices_state *rds)
+static void nic_cleanup(libxl__checkpoint_devices_state *cds)
 {
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
     /* free qdisc cache */
-    if (rds->qdisc_cache) {
-        nl_cache_clear(rds->qdisc_cache);
-        nl_cache_free(rds->qdisc_cache);
-        rds->qdisc_cache = NULL;
+    if (cds->qdisc_cache) {
+        nl_cache_clear(cds->qdisc_cache);
+        nl_cache_free(cds->qdisc_cache);
+        cds->qdisc_cache = NULL;
     }
 
     /* close & free nlsock */
-    if (rds->nlsock) {
-        nl_close(rds->nlsock);
-        nl_socket_free(rds->nlsock);
-        rds->nlsock = NULL;
+    if (cds->nlsock) {
+        nl_close(cds->nlsock);
+        nl_socket_free(cds->nlsock);
+        cds->nlsock = NULL;
     }
 }
 
@@ -113,17 +113,17 @@ static void nic_cleanup(libxl__remus_devices_state *rds)
  * it must ONLY be used for remus because if driver domains
  * were in use it would constitute a security vulnerability.
  */
-static const char *get_vifname(libxl__remus_device *dev,
+static const char *get_vifname(libxl__checkpoint_device *dev,
                                const libxl_device_nic *nic)
 {
     const char *vifname = NULL;
     const char *path;
     int rc;
 
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     /* Convenience aliases */
-    const uint32_t domid = dev->rds->domid;
+    const uint32_t domid = dev->cds->domid;
 
     path = GCSPRINTF("%s/backend/vif/%d/%d/vifname",
                      libxl__xs_get_dompath(gc, 0), domid, nic->devid);
@@ -146,19 +146,19 @@ static void free_qdisc(libxl__remus_device_nic *remus_nic)
     remus_nic->qdisc = NULL;
 }
 
-static int init_qdisc(libxl__remus_devices_state *rds,
+static int init_qdisc(libxl__checkpoint_devices_state *cds,
                       libxl__remus_device_nic *remus_nic)
 {
     int rc, ret, ifindex;
     struct rtnl_link *ifb = NULL;
     struct rtnl_qdisc *qdisc = NULL;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
     /* Now that we have brought up REMUS_IFB device with plug qdisc for
      * this vif, so we need to refill the qdisc cache.
      */
-    ret = nl_cache_refill(rds->nlsock, rds->qdisc_cache);
+    ret = nl_cache_refill(cds->nlsock, cds->qdisc_cache);
     if (ret) {
         LOG(ERROR, "cannot refill qdisc cache: %s", nl_geterror(ret));
         rc = ERROR_FAIL;
@@ -166,7 +166,7 @@ static int init_qdisc(libxl__remus_devices_state *rds,
     }
 
     /* get a handle to the REMUS_IFB interface */
-    ret = rtnl_link_get_kernel(rds->nlsock, 0, remus_nic->ifb, &ifb);
+    ret = rtnl_link_get_kernel(cds->nlsock, 0, remus_nic->ifb, &ifb);
     if (ret) {
         LOG(ERROR, "cannot obtain handle for %s: %s", remus_nic->ifb,
             nl_geterror(ret));
@@ -189,7 +189,7 @@ static int init_qdisc(libxl__remus_devices_state *rds,
      * There is no need to explicitly free this qdisc as its just a
      * reference from the qdisc cache we allocated earlier.
      */
-    qdisc = rtnl_qdisc_get_by_parent(rds->qdisc_cache, ifindex, TC_H_ROOT);
+    qdisc = rtnl_qdisc_get_by_parent(cds->qdisc_cache, ifindex, TC_H_ROOT);
     if (qdisc) {
         const char *tc_kind = rtnl_tc_get_kind(TC_CAST(qdisc));
         /* Sanity check: Ensure that the root qdisc is a plug qdisc. */
@@ -233,19 +233,19 @@ static void netbuf_teardown_script_cb(libxl__egc *egc,
  * $REMUS_IFB (for teardown)
  * setup/teardown as command line arg.
  */
-static void setup_async_exec(libxl__remus_device *dev, char *op)
+static void setup_async_exec(libxl__checkpoint_device *dev, char *op)
 {
     int arraysize, nr = 0;
     char **env = NULL, **args = NULL;
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
-    libxl__remus_devices_state *rds = dev->rds;
+    libxl__checkpoint_devices_state *cds = dev->cds;
     libxl__async_exec_state *aes = &dev->aodev.aes;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
     /* Convenience aliases */
-    char *const script = libxl__strdup(gc, rds->netbufscript);
-    const uint32_t domid = rds->domid;
+    char *const script = libxl__strdup(gc, cds->netbufscript);
+    const uint32_t domid = cds->domid;
     const int dev_id = remus_nic->devid;
     const char *const vif = remus_nic->vif;
     const char *const ifb = remus_nic->ifb;
@@ -271,7 +271,7 @@ static void setup_async_exec(libxl__remus_device *dev, char *op)
     args[nr++] = NULL;
     assert(nr == arraysize);
 
-    aes->ao = dev->rds->ao;
+    aes->ao = dev->cds->ao;
     aes->what = GCSPRINTF("%s %s", args[0], args[1]);
     aes->env = env;
     aes->args = args;
@@ -288,13 +288,13 @@ static void setup_async_exec(libxl__remus_device *dev, char *op)
 
 /* setup() and teardown() */
 
-static void nic_setup(libxl__remus_device *dev)
+static void nic_setup(libxl__checkpoint_device *dev)
 {
     int rc;
     libxl__remus_device_nic *remus_nic;
     const libxl_device_nic *nic = dev->backend_dev;
 
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     /*
      * thers's no subkind of nic devices, so nic ops is always matched
@@ -320,7 +320,7 @@ static void nic_setup(libxl__remus_device *dev)
 
 out:
     dev->aodev.rc = rc;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
 /*
@@ -332,16 +332,16 @@ static void netbuf_setup_script_cb(libxl__egc *egc,
                                    int status)
 {
     libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
-    libxl__remus_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
-    libxl__remus_devices_state *rds = dev->rds;
+    libxl__checkpoint_devices_state *cds = dev->cds;
     const char *out_path_base, *hotplug_error = NULL;
     int rc;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
     /* Convenience aliases */
-    const uint32_t domid = rds->domid;
+    const uint32_t domid = cds->domid;
     const int devid = remus_nic->devid;
     const char *const vif = remus_nic->vif;
     const char **const ifb = &remus_nic->ifb;
@@ -375,7 +375,7 @@ static void netbuf_setup_script_cb(libxl__egc *egc,
 
     if (hotplug_error) {
         LOG(ERROR, "netbuf script %s setup failed for vif %s: %s",
-            rds->netbufscript, vif, hotplug_error);
+            cds->netbufscript, vif, hotplug_error);
         rc = ERROR_FAIL;
         goto out;
     }
@@ -386,17 +386,17 @@ static void netbuf_setup_script_cb(libxl__egc *egc,
     }
 
     LOG(DEBUG, "%s will buffer packets from vif %s", *ifb, vif);
-    rc = init_qdisc(rds, remus_nic);
+    rc = init_qdisc(cds, remus_nic);
 
 out:
     aodev->rc = rc;
     aodev->callback(egc, aodev);
 }
 
-static void nic_teardown(libxl__remus_device *dev)
+static void nic_teardown(libxl__checkpoint_device *dev)
 {
     int rc;
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     setup_async_exec(dev, "teardown");
 
@@ -408,7 +408,7 @@ static void nic_teardown(libxl__remus_device *dev)
 
 out:
     dev->aodev.rc = rc;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
 static void netbuf_teardown_script_cb(libxl__egc *egc,
@@ -417,7 +417,7 @@ static void netbuf_teardown_script_cb(libxl__egc *egc,
 {
     int rc;
     libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
-    libxl__remus_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
 
     if (status)
@@ -442,12 +442,12 @@ enum {
 /* API implementations */
 
 static int remus_netbuf_op(libxl__remus_device_nic *remus_nic,
-                           libxl__remus_devices_state *rds,
+                           libxl__checkpoint_devices_state *cds,
                            int buffer_op)
 {
     int rc, ret;
 
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
     if (buffer_op == tc_buffer_start)
         ret = rtnl_qdisc_plug_buffer(remus_nic->qdisc);
@@ -459,7 +459,7 @@ static int remus_netbuf_op(libxl__remus_device_nic *remus_nic,
         goto out;
     }
 
-    ret = rtnl_qdisc_add(rds->nlsock, remus_nic->qdisc, NLM_F_REQUEST);
+    ret = rtnl_qdisc_add(cds->nlsock, remus_nic->qdisc, NLM_F_REQUEST);
     if (ret) {
         rc = ERROR_FAIL;
         goto out;
@@ -476,34 +476,34 @@ out:
     return rc;
 }
 
-static void nic_postsuspend(libxl__remus_device *dev)
+static void nic_postsuspend(libxl__checkpoint_device *dev)
 {
     int rc;
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
 
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
-    rc = remus_netbuf_op(remus_nic, dev->rds, tc_buffer_start);
+    rc = remus_netbuf_op(remus_nic, dev->cds, tc_buffer_start);
 
     dev->aodev.rc = rc;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
-static void nic_commit(libxl__remus_device *dev)
+static void nic_commit(libxl__checkpoint_device *dev)
 {
     int rc;
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
 
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
-    rc = remus_netbuf_op(remus_nic, dev->rds, tc_buffer_release);
+    rc = remus_netbuf_op(remus_nic, dev->cds, tc_buffer_release);
 
     dev->aodev.rc = rc;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
-const libxl__remus_device_subkind_ops remus_device_nic = {
-    .kind = LIBXL__REMUS_DEVICE_NIC,
+const libxl__checkpoint_device_subkind_ops remus_device_nic = {
+    .kind = LIBXL__CHECKPOINT_DEVICE_NIC,
     .init = nic_init,
     .cleanup = nic_cleanup,
     .setup = nic_setup,
diff --git a/tools/libxl/libxl_nonetbuffer.c b/tools/libxl/libxl_nonetbuffer.c
index 28a8326..8380952 100644
--- a/tools/libxl/libxl_nonetbuffer.c
+++ b/tools/libxl/libxl_nonetbuffer.c
@@ -22,26 +22,26 @@ int libxl__netbuffer_enabled(libxl__gc *gc)
     return 0;
 }
 
-static void nic_setup(libxl__remus_device *dev)
+static void nic_setup(libxl__checkpoint_device *dev)
 {
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     dev->aodev.rc = ERROR_FAIL;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
-static int nic_init(libxl__remus_devices_state *rds)
+static int nic_init(libxl__checkpoint_devices_state *cds)
 {
     return 0;
 }
 
-static void nic_cleanup(libxl__remus_devices_state *rds)
+static void nic_cleanup(libxl__checkpoint_devices_state *cds)
 {
     return;
 }
 
-const libxl__remus_device_subkind_ops remus_device_nic = {
-    .kind = LIBXL__REMUS_DEVICE_NIC,
+const libxl__checkpoint_device_subkind_ops remus_device_nic = {
+    .kind = LIBXL__CHECKPOINT_DEVICE_NIC,
     .init = nic_init,
     .cleanup = nic_cleanup,
     .setup = nic_setup,
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index 9747a13..a04c05e 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -21,15 +21,15 @@
 
 /*----- remus: setup the environment -----*/
 static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds, int rc);
+                                    libxl__checkpoint_devices_state *cds, int rc);
 static void libxl__remus_setup_failed(libxl__egc *egc,
-                                      libxl__remus_devices_state *rds, int rc);
+                                      libxl__checkpoint_devices_state *cds, int rc);
 
 void libxl__remus_setup(libxl__egc *egc,
                         libxl__domain_suspend_state *dss)
 {
     /* Convenience aliases */
-    libxl__remus_devices_state *const rds = &dss->rds;
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
     const libxl_domain_remus_info *const info = dss->remus;
 
     STATE_AO_GC(dss->ao);
@@ -39,28 +39,28 @@ void libxl__remus_setup(libxl__egc *egc,
             LOG(ERROR, "Remus: No support for network buffering");
             goto out;
         }
-        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
+        cds->device_kind_flags |= LIBXL__CHECKPOINT_DEVICE_NIC;
     }
 
     if (info->diskbuf)
-        rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;
+        cds->device_kind_flags |= LIBXL__CHECKPOINT_DEVICE_DISK;
 
-    rds->ao = ao;
-    rds->egc = egc;
-    rds->domid = dss->domid;
-    rds->callback = libxl__remus_setup_done;
+    cds->ao = ao;
+    cds->egc = egc;
+    cds->domid = dss->domid;
+    cds->callback = libxl__remus_setup_done;
 
-    libxl__remus_devices_setup(egc, rds);
+    libxl__checkpoint_devices_setup(egc, cds);
     return;
 
 out:
-    libxl__remus_setup_failed(egc, rds, ERROR_FAIL);
+    libxl__remus_setup_failed(egc, cds, ERROR_FAIL);
 }
 
 static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds, int rc)
+                                    libxl__checkpoint_devices_state *cds, int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
     STATE_AO_GC(dss->ao);
 
     if (!rc) {
@@ -70,15 +70,15 @@ static void libxl__remus_setup_done(libxl__egc *egc,
 
     LOG(ERROR, "Remus: failed to setup device for guest with domid %u, rc %d",
         dss->domid, rc);
-    rds->callback = libxl__remus_setup_failed;
-    libxl__remus_devices_teardown(egc, rds);
+    cds->callback = libxl__remus_setup_failed;
+    libxl__checkpoint_devices_teardown(egc, cds);
 }
 
 static void libxl__remus_setup_failed(libxl__egc *egc,
-                                      libxl__remus_devices_state *rds,
+                                      libxl__checkpoint_devices_state *cds,
                                       int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -91,7 +91,7 @@ static void libxl__remus_setup_failed(libxl__egc *egc,
 
 /*----- remus: teardown the environment -----*/
 static void remus_teardown_done(libxl__egc *egc,
-                                libxl__remus_devices_state *rds,
+                                libxl__checkpoint_devices_state *cds,
                                 int rc);
 
 void libxl__remus_teardown(libxl__egc *egc,
@@ -108,15 +108,15 @@ void libxl__remus_teardown(libxl__egc *egc,
      */
     LOG(WARN, "Remus: Domain suspend terminated with rc %d,"
         " teardown Remus devices...", rc);
-    dss->rds.callback = remus_teardown_done;
-    libxl__remus_devices_teardown(egc, &dss->rds);
+    dss->cds.callback = remus_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &dss->cds);
 }
 
 static void remus_teardown_done(libxl__egc *egc,
-                                libxl__remus_devices_state *rds,
+                                libxl__checkpoint_devices_state *cds,
                                 int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -131,7 +131,7 @@ static void remus_teardown_done(libxl__egc *egc,
 static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
                                 libxl__domain_suspend_state2 *dss2, int ok);
 static void remus_devices_postsuspend_cb(libxl__egc *egc,
-                                         libxl__remus_devices_state *rds,
+                                         libxl__checkpoint_devices_state *cds,
                                          int rc);
 
 void libxl__remus_domain_suspend_callback(void *data)
@@ -155,9 +155,9 @@ static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
     if (!ok)
         goto out;
 
-    libxl__remus_devices_state *const rds = &dss->rds;
-    rds->callback = remus_devices_postsuspend_cb;
-    libxl__remus_devices_postsuspend(egc, rds);
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    cds->callback = remus_devices_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
     return;
 
 out:
@@ -165,11 +165,11 @@ out:
 }
 
 static void remus_devices_postsuspend_cb(libxl__egc *egc,
-                                         libxl__remus_devices_state *rds,
+                                         libxl__checkpoint_devices_state *cds,
                                          int rc)
 {
     int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
 
     if (rc)
         goto out;
@@ -183,7 +183,7 @@ out:
 
 /*----- remus: resume the guest -----*/
 static void remus_devices_preresume_cb(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
+                                       libxl__checkpoint_devices_state *cds,
                                        int rc);
 
 void libxl__remus_domain_resume_callback(void *data)
@@ -193,17 +193,17 @@ void libxl__remus_domain_resume_callback(void *data)
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
     STATE_AO_GC(dss->ao);
 
-    libxl__remus_devices_state *const rds = &dss->rds;
-    rds->callback = remus_devices_preresume_cb;
-    libxl__remus_devices_preresume(egc, rds);
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    cds->callback = remus_devices_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, cds);
 }
 
 static void remus_devices_preresume_cb(libxl__egc *egc,
-                                       libxl__remus_devices_state *rds,
+                                       libxl__checkpoint_devices_state *cds,
                                        int rc)
 {
     int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -222,7 +222,7 @@ out:
 static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc);
 static void remus_devices_commit_cb(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds,
+                                    libxl__checkpoint_devices_state *cds,
                                     int rc);
 static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
                                   const struct timeval *requested_abs);
@@ -246,7 +246,7 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc)
 {
     /* Convenience aliases */
-    libxl__remus_devices_state *const rds = &dss->rds;
+    libxl__checkpoint_devices_state *const cds = &dss->cds;
 
     STATE_AO_GC(dss->ao);
 
@@ -255,8 +255,8 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc,
         goto out;
     }
 
-    rds->callback = remus_devices_commit_cb;
-    libxl__remus_devices_commit(egc, rds);
+    cds->callback = remus_devices_commit_cb;
+    libxl__checkpoint_devices_commit(egc, cds);
 
     return;
 
@@ -265,10 +265,10 @@ out:
 }
 
 static void remus_devices_commit_cb(libxl__egc *egc,
-                                    libxl__remus_devices_state *rds,
+                                    libxl__checkpoint_devices_state *cds,
                                     int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
 
     STATE_AO_GC(dss->ao);
 
diff --git a/tools/libxl/libxl_remus_disk_drbd.c b/tools/libxl/libxl_remus_disk_drbd.c
index 59db54f..5928187 100644
--- a/tools/libxl/libxl_remus_disk_drbd.c
+++ b/tools/libxl/libxl_remus_disk_drbd.c
@@ -27,13 +27,13 @@ typedef struct libxl__remus_drbd_disk {
 } libxl__remus_drbd_disk;
 
 /*----- helper functions, for async calls -----*/
-static void drbd_async_call(libxl__remus_device *dev,
-                            void func(libxl__remus_device *),
+static void drbd_async_call(libxl__checkpoint_device *dev,
+                            void func(libxl__checkpoint_device *),
                             libxl__ev_child_callback callback)
 {
     int pid = -1, rc;
     libxl__ao_device *aodev = &dev->aodev;
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     /* Fork and call */
     pid = libxl__ev_child_fork(gc, &aodev->child, callback);
@@ -54,21 +54,21 @@ static void drbd_async_call(libxl__remus_device *dev,
 
 out:
     aodev->rc = rc;
-    aodev->callback(dev->rds->egc, aodev);
+    aodev->callback(dev->cds->egc, aodev);
 }
 
 /*----- init() and cleanup() -----*/
-static int drbd_init(libxl__remus_devices_state *rds)
+static int drbd_init(libxl__checkpoint_devices_state *cds)
 {
-    STATE_AO_GC(rds->ao);
+    STATE_AO_GC(cds->ao);
 
-    rds->drbd_probe_script = GCSPRINTF("%s/block-drbd-probe",
+    cds->drbd_probe_script = GCSPRINTF("%s/block-drbd-probe",
                                        libxl__xen_script_dir_path());
 
     return 0;
 }
 
-static void drbd_cleanup(libxl__remus_devices_state *rds)
+static void drbd_cleanup(libxl__checkpoint_devices_state *cds)
 {
     return;
 }
@@ -82,21 +82,21 @@ static void match_async_exec_cb(libxl__egc *egc,
 
 /* implementations */
 
-static void match_async_exec(libxl__egc *egc, libxl__remus_device *dev);
+static void match_async_exec(libxl__egc *egc, libxl__checkpoint_device *dev);
 
-static void drbd_setup(libxl__remus_device *dev)
+static void drbd_setup(libxl__checkpoint_device *dev)
 {
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
-    match_async_exec(dev->rds->egc, dev);
+    match_async_exec(dev->cds->egc, dev);
 }
 
-static void match_async_exec(libxl__egc *egc, libxl__remus_device *dev)
+static void match_async_exec(libxl__egc *egc, libxl__checkpoint_device *dev)
 {
     int arraysize, nr = 0, rc;
     const libxl_device_disk *disk = dev->backend_dev;
     libxl__async_exec_state *aes = &dev->aodev.aes;
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     /* setup env & args */
     arraysize = 1;
@@ -107,12 +107,12 @@ static void match_async_exec(libxl__egc *egc, libxl__remus_device *dev)
     arraysize = 3;
     nr = 0;
     GCNEW_ARRAY(aes->args, arraysize);
-    aes->args[nr++] = dev->rds->drbd_probe_script;
+    aes->args[nr++] = dev->cds->drbd_probe_script;
     aes->args[nr++] = disk->pdev_path;
     aes->args[nr++] = NULL;
     assert(nr <= arraysize);
 
-    aes->ao = dev->rds->ao;
+    aes->ao = dev->cds->ao;
     aes->what = GCSPRINTF("%s %s", aes->args[0], aes->args[1]);
     aes->timeout_ms = LIBXL_HOTPLUG_TIMEOUT * 1000;
     aes->callback = match_async_exec_cb;
@@ -137,14 +137,14 @@ static void match_async_exec_cb(libxl__egc *egc,
 {
     int rc;
     libxl__ao_device *aodev = CONTAINER_OF(aes, *aodev, aes);
-    libxl__remus_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
     libxl__remus_drbd_disk *drbd_disk;
     const libxl_device_disk *disk = dev->backend_dev;
 
     STATE_AO_GC(aodev->ao);
 
     if (status) {
-        rc = ERROR_REMUS_DEVOPS_DOES_NOT_MATCH;
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
         goto out;
     }
 
@@ -167,14 +167,14 @@ out:
     aodev->callback(egc, aodev);
 }
 
-static void drbd_teardown(libxl__remus_device *dev)
+static void drbd_teardown(libxl__checkpoint_device *dev)
 {
     libxl__remus_drbd_disk *drbd_disk = dev->concrete_data;
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     close(drbd_disk->ctl_fd);
     dev->aodev.rc = 0;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
 /*----- checkpointing APIs -----*/
@@ -187,9 +187,9 @@ static void chekpoint_async_call_done(libxl__egc *egc,
 /* API implementations */
 
 /* this op will not wait and block, so implement as sync op */
-static void drbd_postsuspend(libxl__remus_device *dev)
+static void drbd_postsuspend(libxl__checkpoint_device *dev)
 {
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     libxl__remus_drbd_disk *rdd = dev->concrete_data;
 
@@ -199,20 +199,20 @@ static void drbd_postsuspend(libxl__remus_device *dev)
     }
 
     dev->aodev.rc = 0;
-    dev->aodev.callback(dev->rds->egc, &dev->aodev);
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
 }
 
 
-static void drbd_preresume_async(libxl__remus_device *dev);
+static void drbd_preresume_async(libxl__checkpoint_device *dev);
 
-static void drbd_preresume(libxl__remus_device *dev)
+static void drbd_preresume(libxl__checkpoint_device *dev)
 {
-    STATE_AO_GC(dev->rds->ao);
+    STATE_AO_GC(dev->cds->ao);
 
     drbd_async_call(dev, drbd_preresume_async, chekpoint_async_call_done);
 }
 
-static void drbd_preresume_async(libxl__remus_device *dev)
+static void drbd_preresume_async(libxl__checkpoint_device *dev)
 {
     libxl__remus_drbd_disk *rdd = dev->concrete_data;
     int ackwait = rdd->ackwait;
@@ -231,7 +231,7 @@ static void chekpoint_async_call_done(libxl__egc *egc,
 {
     int rc;
     libxl__ao_device *aodev = CONTAINER_OF(child, *aodev, child);
-    libxl__remus_device *dev = CONTAINER_OF(aodev, *dev, aodev);
+    libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
     libxl__remus_drbd_disk *rdd = dev->concrete_data;
 
     STATE_AO_GC(aodev->ao);
@@ -249,8 +249,8 @@ out:
     aodev->callback(egc, aodev);
 }
 
-const libxl__remus_device_subkind_ops remus_device_drbd_disk = {
-    .kind = LIBXL__REMUS_DEVICE_DISK,
+const libxl__checkpoint_device_subkind_ops remus_device_drbd_disk = {
+    .kind = LIBXL__CHECKPOINT_DEVICE_DISK,
     .init = drbd_init,
     .cleanup = drbd_cleanup,
     .setup = drbd_setup,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 21ac7f6..1945155 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -58,8 +58,8 @@ libxl_error = Enumeration("error", [
     (-12, "OSEVENT_REG_FAIL"),
     (-13, "BUFFERFULL"),
     (-14, "UNKNOWN_CHILD"),
-    (-15, "REMUS_DEVOPS_DOES_NOT_MATCH"),
-    (-16, "REMUS_DEVICE_NOT_SUPPORTED"),
+    (-15, "CHECKPOINT_DEVOPS_DOES_NOT_MATCH"),
+    (-16, "CHECKPOINT_DEVICE_NOT_SUPPORTED"),
     ], value_namespace = "")
 
 libxl_domain_type = Enumeration("domain_type", [
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 14/45] adjust the indentation
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (12 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 13/45] rename remus device to checkpoint device Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 15/45] don't touch remus in checkpoint_device Wen Congyang
                   ` (32 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_checkpoint_device.c | 31 ++++++++++++++++++-------------
 tools/libxl/libxl_internal.h          | 18 ++++++++++--------
 tools/libxl/libxl_remus.c             | 12 ++++++++----
 3 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/tools/libxl/libxl_checkpoint_device.c b/tools/libxl/libxl_checkpoint_device.c
index 0036858..a57b9e4 100644
--- a/tools/libxl/libxl_checkpoint_device.c
+++ b/tools/libxl/libxl_checkpoint_device.c
@@ -66,9 +66,9 @@ static void devices_teardown_cb(libxl__egc *egc,
 /* checkpoint device setup and teardown */
 
 static libxl__checkpoint_device* checkpoint_device_init(libxl__egc *egc,
-                                              libxl__checkpoint_devices_state *cds,
-                                              libxl__checkpoint_device_kind kind,
-                                              void *libxl_dev)
+                libxl__checkpoint_devices_state *cds,
+                libxl__checkpoint_device_kind kind,
+                void *libxl_dev)
 {
     libxl__checkpoint_device *dev = NULL;
 
@@ -83,9 +83,10 @@ static libxl__checkpoint_device* checkpoint_device_init(libxl__egc *egc,
 }
 
 static void checkpoint_devices_setup(libxl__egc *egc,
-                                libxl__checkpoint_devices_state *cds);
+                                     libxl__checkpoint_devices_state *cds);
 
-void libxl__checkpoint_devices_setup(libxl__egc *egc, libxl__checkpoint_devices_state *cds)
+void libxl__checkpoint_devices_setup(libxl__egc *egc,
+                                     libxl__checkpoint_devices_state *cds)
 {
     int i, rc;
 
@@ -131,7 +132,7 @@ out:
 }
 
 static void checkpoint_devices_setup(libxl__egc *egc,
-                                libxl__checkpoint_devices_state *cds)
+                                     libxl__checkpoint_devices_state *cds)
 {
     int i, rc;
     libxl__checkpoint_device *dev;
@@ -180,14 +181,18 @@ static void devices_setup_cb(libxl__egc *egc,
     for (i = 0; i < cds->num_devices; i++) {
         dev = cds->dev[i];
 
-        if (!dev->aodev.rc || dev->aodev.rc == ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH)
+        if (!dev->aodev.rc ||
+            dev->aodev.rc == ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH)
             continue;
 
         rc = dev->aodev.rc;
         goto out;
     }
 
-    /* if the error is still ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH, begin next iter */
+    /*
+     * if the error is still ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH,
+     * begin next iter
+     */
     if (rc == ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH) {
         checkpoint_devices_setup(egc, cds);
         return;
@@ -198,7 +203,7 @@ out:
 }
 
 void libxl__checkpoint_devices_teardown(libxl__egc *egc,
-                                   libxl__checkpoint_devices_state *cds)
+                                        libxl__checkpoint_devices_state *cds)
 {
     int i;
     libxl__checkpoint_device *dev;
@@ -260,12 +265,12 @@ static void devices_checkpoint_cb(libxl__egc *egc,
 
 /* API implementations */
 
-#define define_checkpoint_api(api)                                \
-void libxl__checkpoint_devices_##api(libxl__egc *egc,                        \
-                                libxl__checkpoint_devices_state *cds)        \
+#define define_checkpoint_api(api)                                      \
+void libxl__checkpoint_devices_##api(libxl__egc *egc,                   \
+                                libxl__checkpoint_devices_state *cds)   \
 {                                                                       \
     int i;                                                              \
-    libxl__checkpoint_device *dev;                                           \
+    libxl__checkpoint_device *dev;                                      \
                                                                         \
     STATE_AO_GC(cds->ao);                                               \
                                                                         \
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index b3b726c..766868c 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2541,7 +2541,8 @@ typedef struct libxl__save_helper_state {
  * Each device type needs to implement the interfaces specified in
  * the libxl__checkpoint_device_subkind_ops if it wishes to support Remus.
  *
- * The high-level control flow through the checkpoint device layer is shown below:
+ * The high-level control flow through the checkpoint device layer is shown
+ * below:
  *
  * xl remus
  *  |->  libxl_domain_remus_start
@@ -2610,7 +2611,8 @@ struct libxl__checkpoint_device_subkind_ops {
 };
 
 typedef void libxl__checkpoint_callback(libxl__egc *,
-                                   libxl__checkpoint_devices_state *, int rc);
+                                        libxl__checkpoint_devices_state *,
+                                        int rc);
 
 /*
  * State associated with a checkpoint invocation, including parameters
@@ -2618,7 +2620,7 @@ typedef void libxl__checkpoint_callback(libxl__egc *,
  * save/restore machinery.
  */
 struct libxl__checkpoint_devices_state {
-    /*---- must be set by caller of libxl__checkpoint_device_(setup|teardown) ----*/
+    /*-- must be set by caller of libxl__checkpoint_device_(setup|teardown) --*/
 
     libxl__ao *ao;
     libxl__egc *egc;
@@ -2693,15 +2695,15 @@ struct libxl__checkpoint_device {
 
 /* the following 5 APIs are async ops, call cds->callback when done */
 _hidden void libxl__checkpoint_devices_setup(libxl__egc *egc,
-                                        libxl__checkpoint_devices_state *cds);
+                libxl__checkpoint_devices_state *cds);
 _hidden void libxl__checkpoint_devices_teardown(libxl__egc *egc,
-                                           libxl__checkpoint_devices_state *cds);
+                libxl__checkpoint_devices_state *cds);
 _hidden void libxl__checkpoint_devices_postsuspend(libxl__egc *egc,
-                                              libxl__checkpoint_devices_state *cds);
+                libxl__checkpoint_devices_state *cds);
 _hidden void libxl__checkpoint_devices_preresume(libxl__egc *egc,
-                                            libxl__checkpoint_devices_state *cds);
+                libxl__checkpoint_devices_state *cds);
 _hidden void libxl__checkpoint_devices_commit(libxl__egc *egc,
-                                         libxl__checkpoint_devices_state *cds);
+                libxl__checkpoint_devices_state *cds);
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index a04c05e..ca205a7 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -21,9 +21,11 @@
 
 /*----- remus: setup the environment -----*/
 static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__checkpoint_devices_state *cds, int rc);
+                                    libxl__checkpoint_devices_state *cds,
+                                    int rc);
 static void libxl__remus_setup_failed(libxl__egc *egc,
-                                      libxl__checkpoint_devices_state *cds, int rc);
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
 
 void libxl__remus_setup(libxl__egc *egc,
                         libxl__domain_suspend_state *dss)
@@ -58,7 +60,8 @@ out:
 }
 
 static void libxl__remus_setup_done(libxl__egc *egc,
-                                    libxl__checkpoint_devices_state *cds, int rc)
+                                    libxl__checkpoint_devices_state *cds,
+                                    int rc)
 {
     libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
     STATE_AO_GC(dss->ao);
@@ -220,7 +223,8 @@ out:
 
 /*----- remus: wait a new checkpoint -----*/
 static void remus_checkpoint_dm_saved(libxl__egc *egc,
-                                      libxl__domain_suspend_state *dss, int rc);
+                                      libxl__domain_suspend_state *dss,
+                                      int rc);
 static void remus_devices_commit_cb(libxl__egc *egc,
                                     libxl__checkpoint_devices_state *cds,
                                     int rc);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 15/45] don't touch remus in checkpoint_device
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (13 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 14/45] adjust the indentation Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 16/45] Update libxl_save_msgs_gen.pl to support return data from xl to xc Wen Congyang
                   ` (31 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Checkpoint device is an abstract layer to do checkpoint.
COLO can also use it to do checkpoint. But there is
still some codes in checkpoint device which touch remus:
1. remus_ops: we use remus ops directly in checkpoint
   device. Store it in checkpoint device state.
2. concrete layer's private member: add a new structure
   remus state, and move them to remus state.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c                   |  2 +-
 tools/libxl/libxl_checkpoint_device.c | 14 +++-------
 tools/libxl/libxl_dom.c               |  3 +--
 tools/libxl/libxl_internal.h          | 37 ++++++++++++++++---------
 tools/libxl/libxl_netbuffer.c         | 51 ++++++++++++++++++++---------------
 tools/libxl/libxl_remus.c             | 50 +++++++++++++++++++++-------------
 tools/libxl/libxl_remus.h             |  5 ++--
 tools/libxl/libxl_remus_disk_drbd.c   |  9 ++++---
 8 files changed, 97 insertions(+), 74 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index ff93af3..9a8fd16 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -820,7 +820,7 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     assert(info);
 
     /* Point of no return */
-    libxl__remus_setup(egc, dss);
+    libxl__remus_setup(egc, &dss->rs);
     return AO_INPROGRESS;
 
  out:
diff --git a/tools/libxl/libxl_checkpoint_device.c b/tools/libxl/libxl_checkpoint_device.c
index a57b9e4..f73db0e 100644
--- a/tools/libxl/libxl_checkpoint_device.c
+++ b/tools/libxl/libxl_checkpoint_device.c
@@ -17,14 +17,6 @@
 
 #include "libxl_internal.h"
 
-extern const libxl__checkpoint_device_subkind_ops remus_device_nic;
-extern const libxl__checkpoint_device_subkind_ops remus_device_drbd_disk;
-static const libxl__checkpoint_device_subkind_ops *remus_ops[] = {
-    &remus_device_nic,
-    &remus_device_drbd_disk,
-    NULL,
-};
-
 /*----- helper functions -----*/
 
 static int init_device_subkind(libxl__checkpoint_devices_state *cds)
@@ -32,7 +24,7 @@ static int init_device_subkind(libxl__checkpoint_devices_state *cds)
     int rc;
     const libxl__checkpoint_device_subkind_ops **ops;
 
-    for (ops = remus_ops; *ops; ops++) {
+    for (ops = cds->ops; *ops; ops++) {
         rc = (*ops)->init(cds);
         if (rc)
             goto out;
@@ -48,7 +40,7 @@ static void cleanup_device_subkind(libxl__checkpoint_devices_state *cds)
 {
     const libxl__checkpoint_device_subkind_ops **ops;
 
-    for (ops = remus_ops; *ops; ops++)
+    for (ops = cds->ops; *ops; ops++)
         (*ops)->cleanup(cds);
 }
 
@@ -148,7 +140,7 @@ static void checkpoint_devices_setup(libxl__egc *egc,
 
         /* find avaliable ops */
         do {
-            dev->ops = remus_ops[++dev->ops_index];
+            dev->ops = cds->ops[++dev->ops_index];
             if (!dev->ops) {
                 rc = ERROR_CHECKPOINT_DEVICE_NOT_SUPPORTED;
                 goto out;
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index f819846..4e71ec5 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -1605,7 +1605,6 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     dss2->save_dm = 1;
 
     if (r_info != NULL) {
-        dss->interval = r_info->interval;
         if (r_info->compression)
             dss->xcflags |= XCFLAGS_CHECKPOINT_COMPRESS;
     }
@@ -1787,7 +1786,7 @@ static void domain_suspend_done(libxl__egc *egc,
                            dss2->guest_evtchn.port, &dss2->guest_evtchn_lockfd);
 
     if (dss->remus) {
-        libxl__remus_teardown(egc, dss, rc);
+        libxl__remus_teardown(egc, &dss->rs, rc);
         return;
     }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 766868c..4d37cb4 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2627,6 +2627,8 @@ struct libxl__checkpoint_devices_state {
     uint32_t domid;
     libxl__checkpoint_callback *callback;
     int device_kind_flags;
+    /* The ops must be pointer array, and the last ops must be NULL */
+    const libxl__checkpoint_device_subkind_ops **ops;
 
     /*----- private for abstract layer only -----*/
 
@@ -2645,16 +2647,6 @@ struct libxl__checkpoint_devices_state {
     int num_disks;
 
     libxl__multidev multidev;
-
-    /*----- private for concrete (device-specific) layer only -----*/
-
-    /* private for nic device subkind ops */
-    char *netbufscript;
-    struct nl_sock *nlsock;
-    struct nl_cache *qdisc_cache;
-
-    /* private for drbd disk subkind ops */
-    char *drbd_probe_script;
 };
 
 /*
@@ -2704,6 +2696,27 @@ _hidden void libxl__checkpoint_devices_preresume(libxl__egc *egc,
                 libxl__checkpoint_devices_state *cds);
 _hidden void libxl__checkpoint_devices_commit(libxl__egc *egc,
                 libxl__checkpoint_devices_state *cds);
+
+/*----- Remus related state structure -----*/
+typedef struct libxl__remus_state libxl__remus_state;
+struct libxl__remus_state {
+    /* private */
+    libxl__ev_time checkpoint_timeout; /* used for Remus checkpoint */
+    int interval; /* checkpoint interval */
+
+    /* abstract layer */
+    libxl__checkpoint_devices_state cds;
+
+    /*----- private for concrete (device-specific) layer only -----*/
+    /* private for nic device subkind ops */
+    char *netbufscript;
+    struct nl_sock *nlsock;
+    struct nl_cache *qdisc_cache;
+
+    /* private for drbd disk subkind ops */
+    char *drbd_probe_script;
+};
+
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
@@ -2767,9 +2780,7 @@ struct libxl__domain_suspend_state {
     libxl__domain_suspend_state2 dss2;
     int hvm;
     int xcflags;
-    libxl__checkpoint_devices_state cds;
-    libxl__ev_time checkpoint_timeout; /* used for Remus checkpoint */
-    int interval; /* checkpoint interval (for Remus) */
+    libxl__remus_state rs;
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
     /* private for libxl__domain_save_device_model */
diff --git a/tools/libxl/libxl_netbuffer.c b/tools/libxl/libxl_netbuffer.c
index 385922f..7944d43 100644
--- a/tools/libxl/libxl_netbuffer.c
+++ b/tools/libxl/libxl_netbuffer.c
@@ -43,18 +43,19 @@ int libxl__netbuffer_enabled(libxl__gc *gc)
 static int nic_init(libxl__checkpoint_devices_state *cds)
 {
     int rc, ret;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rs, *dss, rs);
 
     STATE_AO_GC(cds->ao);
 
-    cds->nlsock = nl_socket_alloc();
-    if (!cds->nlsock) {
+    rs->nlsock = nl_socket_alloc();
+    if (!rs->nlsock) {
         LOG(ERROR, "cannot allocate nl socket");
         rc = ERROR_FAIL;
         goto out;
     }
 
-    ret = nl_connect(cds->nlsock, NETLINK_ROUTE);
+    ret = nl_connect(rs->nlsock, NETLINK_ROUTE);
     if (ret) {
         LOG(ERROR, "failed to open netlink socket: %s",
             nl_geterror(ret));
@@ -63,7 +64,7 @@ static int nic_init(libxl__checkpoint_devices_state *cds)
     }
 
     /* get list of all qdiscs installed on network devs. */
-    ret = rtnl_qdisc_alloc_cache(cds->nlsock, &cds->qdisc_cache);
+    ret = rtnl_qdisc_alloc_cache(rs->nlsock, &rs->qdisc_cache);
     if (ret) {
         LOG(ERROR, "failed to allocate qdisc cache: %s",
             nl_geterror(ret));
@@ -72,10 +73,10 @@ static int nic_init(libxl__checkpoint_devices_state *cds)
     }
 
     if (dss->remus->netbufscript) {
-        cds->netbufscript = libxl__strdup(gc, dss->remus->netbufscript);
+        rs->netbufscript = libxl__strdup(gc, dss->remus->netbufscript);
     } else {
-        cds->netbufscript = GCSPRINTF("%s/remus-netbuf-setup",
-                                      libxl__xen_script_dir_path());
+        rs->netbufscript = GCSPRINTF("%s/remus-netbuf-setup",
+                                     libxl__xen_script_dir_path());
     }
 
     rc = 0;
@@ -86,20 +87,22 @@ out:
 
 static void nic_cleanup(libxl__checkpoint_devices_state *cds)
 {
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
+
     STATE_AO_GC(cds->ao);
 
     /* free qdisc cache */
-    if (cds->qdisc_cache) {
-        nl_cache_clear(cds->qdisc_cache);
-        nl_cache_free(cds->qdisc_cache);
-        cds->qdisc_cache = NULL;
+    if (rs->qdisc_cache) {
+        nl_cache_clear(rs->qdisc_cache);
+        nl_cache_free(rs->qdisc_cache);
+        rs->qdisc_cache = NULL;
     }
 
     /* close & free nlsock */
-    if (cds->nlsock) {
-        nl_close(cds->nlsock);
-        nl_socket_free(cds->nlsock);
-        cds->nlsock = NULL;
+    if (rs->nlsock) {
+        nl_close(rs->nlsock);
+        nl_socket_free(rs->nlsock);
+        rs->nlsock = NULL;
     }
 }
 
@@ -152,13 +155,14 @@ static int init_qdisc(libxl__checkpoint_devices_state *cds,
     int rc, ret, ifindex;
     struct rtnl_link *ifb = NULL;
     struct rtnl_qdisc *qdisc = NULL;
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
 
     STATE_AO_GC(cds->ao);
 
     /* Now that we have brought up REMUS_IFB device with plug qdisc for
      * this vif, so we need to refill the qdisc cache.
      */
-    ret = nl_cache_refill(cds->nlsock, cds->qdisc_cache);
+    ret = nl_cache_refill(rs->nlsock, rs->qdisc_cache);
     if (ret) {
         LOG(ERROR, "cannot refill qdisc cache: %s", nl_geterror(ret));
         rc = ERROR_FAIL;
@@ -166,7 +170,7 @@ static int init_qdisc(libxl__checkpoint_devices_state *cds,
     }
 
     /* get a handle to the REMUS_IFB interface */
-    ret = rtnl_link_get_kernel(cds->nlsock, 0, remus_nic->ifb, &ifb);
+    ret = rtnl_link_get_kernel(rs->nlsock, 0, remus_nic->ifb, &ifb);
     if (ret) {
         LOG(ERROR, "cannot obtain handle for %s: %s", remus_nic->ifb,
             nl_geterror(ret));
@@ -189,7 +193,7 @@ static int init_qdisc(libxl__checkpoint_devices_state *cds,
      * There is no need to explicitly free this qdisc as its just a
      * reference from the qdisc cache we allocated earlier.
      */
-    qdisc = rtnl_qdisc_get_by_parent(cds->qdisc_cache, ifindex, TC_H_ROOT);
+    qdisc = rtnl_qdisc_get_by_parent(rs->qdisc_cache, ifindex, TC_H_ROOT);
     if (qdisc) {
         const char *tc_kind = rtnl_tc_get_kind(TC_CAST(qdisc));
         /* Sanity check: Ensure that the root qdisc is a plug qdisc. */
@@ -240,11 +244,12 @@ static void setup_async_exec(libxl__checkpoint_device *dev, char *op)
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
     libxl__checkpoint_devices_state *cds = dev->cds;
     libxl__async_exec_state *aes = &dev->aodev.aes;
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
 
     STATE_AO_GC(cds->ao);
 
     /* Convenience aliases */
-    char *const script = libxl__strdup(gc, cds->netbufscript);
+    char *const script = libxl__strdup(gc, rs->netbufscript);
     const uint32_t domid = cds->domid;
     const int dev_id = remus_nic->devid;
     const char *const vif = remus_nic->vif;
@@ -335,6 +340,7 @@ static void netbuf_setup_script_cb(libxl__egc *egc,
     libxl__checkpoint_device *dev = CONTAINER_OF(aodev, *dev, aodev);
     libxl__remus_device_nic *remus_nic = dev->concrete_data;
     libxl__checkpoint_devices_state *cds = dev->cds;
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
     const char *out_path_base, *hotplug_error = NULL;
     int rc;
 
@@ -375,7 +381,7 @@ static void netbuf_setup_script_cb(libxl__egc *egc,
 
     if (hotplug_error) {
         LOG(ERROR, "netbuf script %s setup failed for vif %s: %s",
-            cds->netbufscript, vif, hotplug_error);
+            rs->netbufscript, vif, hotplug_error);
         rc = ERROR_FAIL;
         goto out;
     }
@@ -446,6 +452,7 @@ static int remus_netbuf_op(libxl__remus_device_nic *remus_nic,
                            int buffer_op)
 {
     int rc, ret;
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
 
     STATE_AO_GC(cds->ao);
 
@@ -459,7 +466,7 @@ static int remus_netbuf_op(libxl__remus_device_nic *remus_nic,
         goto out;
     }
 
-    ret = rtnl_qdisc_add(cds->nlsock, remus_nic->qdisc, NLM_F_REQUEST);
+    ret = rtnl_qdisc_add(rs->nlsock, remus_nic->qdisc, NLM_F_REQUEST);
     if (ret) {
         rc = ERROR_FAIL;
         goto out;
diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c
index ca205a7..383b1d2 100644
--- a/tools/libxl/libxl_remus.c
+++ b/tools/libxl/libxl_remus.c
@@ -18,6 +18,13 @@
 #include "libxl_internal.h"
 #include "libxl_remus.h"
 
+extern const libxl__checkpoint_device_subkind_ops remus_device_nic;
+extern const libxl__checkpoint_device_subkind_ops remus_device_drbd_disk;
+static const libxl__checkpoint_device_subkind_ops *remus_ops[] = {
+    &remus_device_nic,
+    &remus_device_drbd_disk,
+    NULL,
+};
 
 /*----- remus: setup the environment -----*/
 static void libxl__remus_setup_done(libxl__egc *egc,
@@ -27,11 +34,12 @@ static void libxl__remus_setup_failed(libxl__egc *egc,
                                       libxl__checkpoint_devices_state *cds,
                                       int rc);
 
-void libxl__remus_setup(libxl__egc *egc,
-                        libxl__domain_suspend_state *dss)
+void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs)
 {
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rs, *dss, rs);
+
     /* Convenience aliases */
-    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    libxl__checkpoint_devices_state *const cds = &rs->cds;
     const libxl_domain_remus_info *const info = dss->remus;
 
     STATE_AO_GC(dss->ao);
@@ -51,19 +59,21 @@ void libxl__remus_setup(libxl__egc *egc,
     cds->egc = egc;
     cds->domid = dss->domid;
     cds->callback = libxl__remus_setup_done;
+    cds->ops = remus_ops;
+    rs->interval = info->interval;
 
     libxl__checkpoint_devices_setup(egc, cds);
     return;
 
 out:
-    libxl__remus_setup_failed(egc, cds, ERROR_FAIL);
+    dss->callback(egc, dss, ERROR_FAIL);
 }
 
 static void libxl__remus_setup_done(libxl__egc *egc,
                                     libxl__checkpoint_devices_state *cds,
                                     int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
     STATE_AO_GC(dss->ao);
 
     if (!rc) {
@@ -81,7 +91,7 @@ static void libxl__remus_setup_failed(libxl__egc *egc,
                                       libxl__checkpoint_devices_state *cds,
                                       int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -98,9 +108,11 @@ static void remus_teardown_done(libxl__egc *egc,
                                 int rc);
 
 void libxl__remus_teardown(libxl__egc *egc,
-                           libxl__domain_suspend_state *dss,
+                           libxl__remus_state *rs,
                            int rc)
 {
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rs, *dss, rs);
+
     EGC_GC;
 
     /*
@@ -111,15 +123,15 @@ void libxl__remus_teardown(libxl__egc *egc,
      */
     LOG(WARN, "Remus: Domain suspend terminated with rc %d,"
         " teardown Remus devices...", rc);
-    dss->cds.callback = remus_teardown_done;
-    libxl__checkpoint_devices_teardown(egc, &dss->cds);
+    dss->rs.cds.callback = remus_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &dss->rs.cds);
 }
 
 static void remus_teardown_done(libxl__egc *egc,
                                 libxl__checkpoint_devices_state *cds,
                                 int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -158,7 +170,7 @@ static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
     if (!ok)
         goto out;
 
-    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    libxl__checkpoint_devices_state *const cds = &dss->rs.cds;
     cds->callback = remus_devices_postsuspend_cb;
     libxl__checkpoint_devices_postsuspend(egc, cds);
     return;
@@ -172,7 +184,7 @@ static void remus_devices_postsuspend_cb(libxl__egc *egc,
                                          int rc)
 {
     int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
 
     if (rc)
         goto out;
@@ -196,7 +208,7 @@ void libxl__remus_domain_resume_callback(void *data)
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
     STATE_AO_GC(dss->ao);
 
-    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    libxl__checkpoint_devices_state *const cds = &dss->rs.cds;
     cds->callback = remus_devices_preresume_cb;
     libxl__checkpoint_devices_preresume(egc, cds);
 }
@@ -206,7 +218,7 @@ static void remus_devices_preresume_cb(libxl__egc *egc,
                                        int rc)
 {
     int ok = 0;
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
     STATE_AO_GC(dss->ao);
 
     if (rc)
@@ -250,7 +262,7 @@ static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc)
 {
     /* Convenience aliases */
-    libxl__checkpoint_devices_state *const cds = &dss->cds;
+    libxl__checkpoint_devices_state *const cds = &dss->rs.cds;
 
     STATE_AO_GC(dss->ao);
 
@@ -272,7 +284,7 @@ static void remus_devices_commit_cb(libxl__egc *egc,
                                     libxl__checkpoint_devices_state *cds,
                                     int rc)
 {
-    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(cds, *dss, rs.cds);
 
     STATE_AO_GC(dss->ao);
 
@@ -290,9 +302,9 @@ static void remus_devices_commit_cb(libxl__egc *egc,
      */
 
     /* Set checkpoint interval timeout */
-    rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
+    rc = libxl__ev_time_register_rel(gc, &dss->rs.checkpoint_timeout,
                                      remus_next_checkpoint,
-                                     dss->interval);
+                                     dss->rs.interval);
 
     if (rc) {
         LOG(ERROR, "unable to register timeout for next epoch."
@@ -310,7 +322,7 @@ static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
                                   const struct timeval *requested_abs)
 {
     libxl__domain_suspend_state *dss =
-                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
+                            CONTAINER_OF(ev, *dss, rs.checkpoint_timeout);
 
     STATE_AO_GC(dss->ao);
 
diff --git a/tools/libxl/libxl_remus.h b/tools/libxl/libxl_remus.h
index 53e5e81..15bbbe8 100644
--- a/tools/libxl/libxl_remus.h
+++ b/tools/libxl/libxl_remus.h
@@ -16,10 +16,9 @@
 #ifndef LIBXL_REMUS_H
 #define LIBXL_REMUS_H
 
-void libxl__remus_setup(libxl__egc *egc,
-                        libxl__domain_suspend_state *dss);
+void libxl__remus_setup(libxl__egc *egc, libxl__remus_state *rs);
 void libxl__remus_teardown(libxl__egc *egc,
-                           libxl__domain_suspend_state *dss,
+                           libxl__remus_state *rs,
                            int rc);
 void libxl__remus_domain_suspend_callback(void *data);
 void libxl__remus_domain_resume_callback(void *data);
diff --git a/tools/libxl/libxl_remus_disk_drbd.c b/tools/libxl/libxl_remus_disk_drbd.c
index 5928187..89465b6 100644
--- a/tools/libxl/libxl_remus_disk_drbd.c
+++ b/tools/libxl/libxl_remus_disk_drbd.c
@@ -60,10 +60,12 @@ out:
 /*----- init() and cleanup() -----*/
 static int drbd_init(libxl__checkpoint_devices_state *cds)
 {
+    libxl__remus_state *rs = CONTAINER_OF(cds, *rs, cds);
+
     STATE_AO_GC(cds->ao);
 
-    cds->drbd_probe_script = GCSPRINTF("%s/block-drbd-probe",
-                                       libxl__xen_script_dir_path());
+    rs->drbd_probe_script = GCSPRINTF("%s/block-drbd-probe",
+                                      libxl__xen_script_dir_path());
 
     return 0;
 }
@@ -96,6 +98,7 @@ static void match_async_exec(libxl__egc *egc, libxl__checkpoint_device *dev)
     int arraysize, nr = 0, rc;
     const libxl_device_disk *disk = dev->backend_dev;
     libxl__async_exec_state *aes = &dev->aodev.aes;
+    libxl__remus_state *rs = CONTAINER_OF(dev->cds, *rs, cds);
     STATE_AO_GC(dev->cds->ao);
 
     /* setup env & args */
@@ -107,7 +110,7 @@ static void match_async_exec(libxl__egc *egc, libxl__checkpoint_device *dev)
     arraysize = 3;
     nr = 0;
     GCNEW_ARRAY(aes->args, arraysize);
-    aes->args[nr++] = dev->cds->drbd_probe_script;
+    aes->args[nr++] = rs->drbd_probe_script;
     aes->args[nr++] = disk->pdev_path;
     aes->args[nr++] = NULL;
     assert(nr <= arraysize);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 16/45] Update libxl_save_msgs_gen.pl to support return data from xl to xc
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (14 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 15/45] don't touch remus in checkpoint_device Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 17/45] Allow slave sends data to master Wen Congyang
                   ` (30 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

 Currently, all callbacks return an integer value or void. We cannot
 return some data to xc via callback. Update libxl_save_msgs_gen.pl
 to support this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_internal.h       |  3 ++
 tools/libxl/libxl_save_callout.c   | 31 ++++++++++++++++++
 tools/libxl/libxl_save_helper.c    | 17 ++++++++++
 tools/libxl/libxl_save_msgs_gen.pl | 65 ++++++++++++++++++++++++++++++++++----
 4 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 4d37cb4..e3a8947 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3076,6 +3076,9 @@ _hidden void libxl__xc_domain_save_done(libxl__egc*, void *dss_void,
  * When they are ready to indicate completion, they call this. */
 void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
                            libxl__save_helper_state *shs, int return_value);
+void libxl__xc_domain_saverestore_async_callback_done_with_data(libxl__egc *egc,
+                           libxl__save_helper_state *shs,
+                           const void *data, uint64_t size);
 
 
 _hidden void libxl__domain_suspend_common_switch_qemu_logdirty
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 1c9f806..0c09d94 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -145,6 +145,15 @@ void libxl__xc_domain_saverestore_async_callback_done(libxl__egc *egc,
     shs->egc = 0;
 }
 
+void libxl__xc_domain_saverestore_async_callback_done_with_data(libxl__egc *egc,
+                           libxl__save_helper_state *shs,
+                           const void *data, uint64_t size)
+{
+    shs->egc = egc;
+    libxl__srm_callout_sendreply_data(data, size, shs);
+    shs->egc = 0;
+}
+
 /*----- helper execution -----*/
 
 static void run_helper(libxl__egc *egc, libxl__save_helper_state *shs,
@@ -370,6 +379,28 @@ void libxl__srm_callout_sendreply(int r, void *user)
         helper_failed(egc, shs, ERROR_FAIL);
 }
 
+void libxl__srm_callout_sendreply_data(const void *data, uint64_t size, void *user)
+{
+    libxl__save_helper_state *shs = user;
+    libxl__egc *egc = shs->egc;
+    STATE_AO_GC(shs->ao);
+    int errnoval;
+
+    errnoval = libxl_write_exactly(CTX, libxl__carefd_fd(shs->pipes[0]),
+                                   &size, sizeof(size), shs->stdin_what,
+                                   "callback return data length");
+    if (errnoval)
+        goto out;
+
+    errnoval = libxl_write_exactly(CTX, libxl__carefd_fd(shs->pipes[0]),
+                                   data, size, shs->stdin_what,
+                                   "callback return data");
+
+out:
+    if (errnoval)
+        helper_failed(egc, shs, ERROR_FAIL);
+}
+
 void libxl__srm_callout_callback_log(uint32_t level, uint32_t errnoval,
                   const char *context, const char *formatted, void *user)
 {
diff --git a/tools/libxl/libxl_save_helper.c b/tools/libxl/libxl_save_helper.c
index 74826a1..44c5807 100644
--- a/tools/libxl/libxl_save_helper.c
+++ b/tools/libxl/libxl_save_helper.c
@@ -155,6 +155,23 @@ int helper_getreply(void *user)
     return v;
 }
 
+uint8_t *helper_getreply_data(void *user)
+{
+    uint64_t size;
+    int r = read_exactly(0, &size, sizeof(size));
+    uint8_t *data;
+
+    if (r <= 0)
+        exit(-2);
+
+    data = helper_allocbuf(size, user);
+    r = read_exactly(0, data, size);
+    if (r <= 0)
+        exit(-2);
+
+    return data;
+}
+
 /*----- other callbacks -----*/
 
 static int toolstack_save_fd;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 6b4b65e..41ee000 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -15,6 +15,7 @@ our @msgs = (
     #         and its null-ness needs to be passed through to the helper's xc
     #   W  - needs a return value; callback is synchronous
     #   A  - needs a return value; callback is asynchronous
+    #   B  - return value is an pointer
     [  1, 'sr',     "log",                   [qw(uint32_t level
                                                  uint32_t errnoval
                                                  STRING context
@@ -99,23 +100,28 @@ our $libxl = "libxl__srm";
 our $callback = "${libxl}_callout_callback";
 our $receiveds = "${libxl}_callout_received";
 our $sendreply = "${libxl}_callout_sendreply";
+our $sendreply_data = "${libxl}_callout_sendreply_data";
 our $getcallbacks = "${libxl}_callout_get_callbacks";
 our $enumcallbacks = "${libxl}_callout_enumcallbacks";
 sub cbtype ($) { "${libxl}_".$_[0]."_autogen_callbacks"; };
 
 f_decl($sendreply, 'callout', 'void', "(int r, void *user)");
+f_decl($sendreply_data, 'callout', 'void',
+       "(const void *data, uint64_t size, void *user)");
 
 our $helper = "helper";
 our $encode = "${helper}_stub";
 our $allocbuf = "${helper}_allocbuf";
 our $transmit = "${helper}_transmitmsg";
 our $getreply = "${helper}_getreply";
+our $getreply_data = "${helper}_getreply_data";
 our $setcallbacks = "${helper}_setcallbacks";
 
 f_decl($allocbuf, 'helper', 'unsigned char *', '(int len, void *user)');
 f_decl($transmit, 'helper', 'void',
        '(unsigned char *msg_freed, int len, void *user)');
 f_decl($getreply, 'helper', 'int', '(void *user)');
+f_decl($getreply_data, 'helper', 'uint8_t *', '(void *user)');
 
 sub typeid ($) { my ($t) = @_; $t =~ s/\W/_/; return $t; };
 
@@ -259,12 +265,36 @@ foreach my $msginfo (@msgs) {
 
     $f_more_sr->("    case $msgnum: { /* $name */\n");
     if ($flags =~ m/W/) {
-        $f_more_sr->("        int r;\n");
+        if ($flags =~ m/B/) {
+            $f_more_sr->("        uint8_t *data;\n".
+                         "        uint64_t size;\n");
+        } else {
+            $f_more_sr->("        int r;\n");
+        }
     }
 
-    my $c_rtype_helper = $flags =~ m/[WA]/ ? 'int' : 'void';
-    my $c_rtype_callout = $flags =~ m/W/ ? 'int' : 'void';
+    my $c_rtype_helper;
+    if ($flags =~ m/[WA]/) {
+        if ($flags =~ m/B/) {
+            $c_rtype_helper = 'uint8_t *'
+        } else {
+            $c_rtype_helper = 'int'
+        }
+    } else {
+        $c_rtype_helper = 'void';
+    }
+    my $c_rtype_callout;
+    if ($flags =~ m/W/) {
+        if ($flags =~ m/B/) {
+            $c_rtype_callout = 'uint8_t *';
+        } else {
+            $c_rtype_callout = 'int';
+        }
+    } else {
+        $c_rtype_callout = 'void';
+    }
     my $c_decl = '(';
+    my $c_helper_decl = '';
     my $c_callback_args = '';
 
     f_more("${encode}_$name",
@@ -305,7 +335,15 @@ END_ALWAYS
         f_more("${encode}_$name", "	${typeid}_put(buf, &len, $c_args);\n");
     }
     $f_more_sr->($c_recv);
+    $c_helper_decl = $c_decl;
+    if ($flags =~ m/W/ and $flags =~ m/B/) {
+        $c_decl .= "uint64_t *size, "
+    }
     $c_decl .= "void *user)";
+    $c_helper_decl .= "void *user)";
+    if ($flags =~ m/W/ and $flags =~ m/B/) {
+        $c_callback_args .= "&size, "
+    }
     $c_callback_args .= "user";
 
     $f_more_sr->("        if (msg != endmsg) return 0;\n");
@@ -326,10 +364,12 @@ END_ALWAYS
     my $c_make_callback = "$c_callback($c_callback_args)";
     if ($flags !~ m/W/) {
 	$f_more_sr->("        $c_make_callback;\n");
+    } elsif ($flags =~ m/B/) {
+        $f_more_sr->("        data = $c_make_callback;\n".
+                     "        $sendreply_data(data, size, user);\n");
     } else {
         $f_more_sr->("        r = $c_make_callback;\n".
                      "        $sendreply(r, user);\n");
-	f_decl($sendreply, 'callout', 'void', '(int r, void *user)');
     }
     if ($flags =~ m/x/) {
         my $c_v = "(1u<<$msgnum)";
@@ -340,7 +380,7 @@ END_ALWAYS
     }
     $f_more_sr->("        return 1;\n    }\n\n");
     f_decl("${callback}_$name", 'callout', $c_rtype_callout, $c_decl);
-    f_decl("${encode}_$name", 'helper', $c_rtype_helper, $c_decl);
+    f_decl("${encode}_$name", 'helper', $c_rtype_helper, $c_helper_decl);
     f_more("${encode}_$name",
 "        if (buf) break;
         buf = ${helper}_allocbuf(len, user);
@@ -352,12 +392,23 @@ END_ALWAYS
     ${transmit}(buf, len, user);
 ");
     if ($flags =~ m/[WA]/) {
-	f_more("${encode}_$name",
-               (<<END_ALWAYS.($debug ? <<END_DEBUG : '').<<END_ALWAYS));
+        if ($flags =~ m/B/) {
+            f_more("${encode}_$name",
+                   (<<END_ALWAYS.($debug ? <<END_DEBUG : '')));
+    uint8_t *r = ${helper}_getreply_data(user);
+END_ALWAYS
+    fprintf(stderr,"libxl-save-helper: $name got reply data\\n");
+END_DEBUG
+        } else {
+            f_more("${encode}_$name",
+                   (<<END_ALWAYS.($debug ? <<END_DEBUG : '')));
     int r = ${helper}_getreply(user);
 END_ALWAYS
     fprintf(stderr,"libxl-save-helper: $name got reply %d\\n",r);
 END_DEBUG
+    }
+
+    f_more("${encode}_$name", (<<END_ALWAYS));
     return r;
 END_ALWAYS
     }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 17/45] Allow slave sends data to master
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (15 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 16/45] Update libxl_save_msgs_gen.pl to support return data from xl to xc Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 18/45] secondary vm suspend/resume/checkpoint code Wen Congyang
                   ` (29 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

In colo mode, slave needs to send data to master, but the io_fd
only can be written in master, and only can be read in slave.
Save recv_fd in domain_suspend_state, and send_fd in
domain_create_state.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c          |  2 +-
 tools/libxl/libxl.h          |  3 ++-
 tools/libxl/libxl_create.c   | 14 ++++++++++----
 tools/libxl/libxl_internal.h |  2 ++
 tools/libxl/libxl_types.idl  |  7 +++++++
 tools/libxl/xl_cmdimpl.c     |  7 +++++++
 6 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 9a8fd16..f35029a 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -811,7 +811,7 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     dss->callback = remus_failover_cb;
     dss->domid = domid;
     dss->fd = send_fd;
-    /* TODO do something with recv_fd */
+    dss->recv_fd = recv_fd;
     dss->type = type;
     dss->live = 1;
     dss->debug = 0;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 0495db7..c9f6ec0 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -812,7 +812,8 @@ int static inline libxl_domain_create_restore_0x040200(
     LIBXL_EXTERNAL_CALLERS_ONLY
 {
     libxl_domain_restore_params params;
-    params.checkpointed_stream = 0;
+    params.checkpointed_stream = LIBXL_CHECKPOINTED_STREAM_NONE;
+    params.send_fd = -1;
 
     return libxl_domain_create_restore(
         ctx, d_config, domid, restore_fd, &params, ao_how, aop_console_how);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 9b66294..e29a107 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1398,8 +1398,8 @@ static void domain_create_cb(libxl__egc *egc,
                              int rc, uint32_t domid);
 
 static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
-                            uint32_t *domid,
-                            int restore_fd, int checkpointed_stream,
+                            uint32_t *domid, int restore_fd,
+                            int send_fd, int checkpointed_stream,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
@@ -1410,6 +1410,7 @@ static int do_domain_create(libxl_ctx *ctx, libxl_domain_config *d_config,
     cdcs->dcs.ao = ao;
     cdcs->dcs.guest_config = d_config;
     cdcs->dcs.restore_fd = restore_fd;
+    cdcs->dcs.send_fd = send_fd;
     cdcs->dcs.callback = domain_create_cb;
     cdcs->dcs.checkpointed_stream = checkpointed_stream;
     libxl__ao_progress_gethow(&cdcs->dcs.aop_console_how, aop_console_how);
@@ -1438,7 +1439,7 @@ int libxl_domain_create_new(libxl_ctx *ctx, libxl_domain_config *d_config,
                             const libxl_asyncop_how *ao_how,
                             const libxl_asyncprogress_how *aop_console_how)
 {
-    return do_domain_create(ctx, d_config, domid, -1, 0,
+    return do_domain_create(ctx, d_config, domid, -1, -1, 0,
                             ao_how, aop_console_how);
 }
 
@@ -1448,7 +1449,12 @@ int libxl_domain_create_restore(libxl_ctx *ctx, libxl_domain_config *d_config,
                                 const libxl_asyncop_how *ao_how,
                                 const libxl_asyncprogress_how *aop_console_how)
 {
-    return do_domain_create(ctx, d_config, domid, restore_fd,
+    int send_fd = -1;
+
+    if (params->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO)
+        send_fd = params->send_fd;
+
+    return do_domain_create(ctx, d_config, domid, restore_fd, send_fd,
                             params->checkpointed_stream, ao_how, aop_console_how);
 }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index e3a8947..b04e4b9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2772,6 +2772,7 @@ struct libxl__domain_suspend_state {
 
     uint32_t domid;
     int fd;
+    int recv_fd;
     libxl_domain_type type;
     int live;
     int debug;
@@ -3036,6 +3037,7 @@ struct libxl__domain_create_state {
     libxl__ao *ao;
     libxl_domain_config *guest_config;
     int restore_fd;
+    int send_fd;
     libxl__domain_create_cb *callback;
     libxl_asyncprogress_how aop_console_how;
     /* private to domain_create */
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 1945155..ea51d1a 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -177,6 +177,12 @@ libxl_vendor_device = Enumeration("vendor_device", [
     (0, "NONE"),
     (1, "XENSERVER"),
     ])
+
+libxl_checkpointed_stream = Enumeration("checkpointed_stream", [
+    (0, "NONE"),
+    (1, "REMUS"),
+    (2, "COLO"),
+    ], init_val = 0)
 #
 # Complex libxl types
 #
@@ -303,6 +309,7 @@ libxl_domain_create_info = Struct("domain_create_info",[
 
 libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
+    ("send_fd", integer),
     ])
 
 libxl_domain_sched_params = Struct("domain_sched_params",[
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index b859f31..ccb46ab 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -151,6 +151,7 @@ struct domain_create {
     const char *extra_config; /* extra config string */
     const char *restore_file;
     int migrate_fd; /* -1 means none */
+    int send_fd; /* -1 means none */
     char **migration_domname_r; /* from malloc */
 };
 
@@ -2033,6 +2034,7 @@ static uint32_t create_domain(struct domain_create *dom_info)
     void *config_data = 0;
     int config_len = 0;
     int restore_fd = -1;
+    int send_fd = -1;
     const libxl_asyncprogress_how *autoconnect_console_how;
     struct save_file_header hdr;
 
@@ -2049,6 +2051,7 @@ static uint32_t create_domain(struct domain_create *dom_info)
         if (migrate_fd >= 0) {
             restore_source = "<incoming migration stream>";
             restore_fd = migrate_fd;
+            send_fd = dom_info->send_fd;
         } else {
             restore_source = restore_file;
             restore_fd = open(restore_file, O_RDONLY);
@@ -2210,6 +2213,7 @@ start:
     if ( restoring ) {
         libxl_domain_restore_params params;
         params.checkpointed_stream = dom_info->checkpointed_stream;
+        params.send_fd = send_fd;
         ret = libxl_domain_create_restore(ctx, &d_config,
                                           &domid, restore_fd,
                                           &params,
@@ -3747,6 +3751,7 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.monitor = monitor;
     dom_info.paused = 1;
     dom_info.migrate_fd = recv_fd;
+    dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = remus;
 
@@ -3917,6 +3922,7 @@ int main_restore(int argc, char **argv)
     dom_info.config_file = config_file;
     dom_info.restore_file = checkpoint_file;
     dom_info.migrate_fd = -1;
+    dom_info.send_fd = -1;
     dom_info.vnc = vnc;
     dom_info.vncautopass = vncautopass;
     dom_info.console_autoconnect = console_autoconnect;
@@ -4356,6 +4362,7 @@ int main_create(int argc, char **argv)
     dom_info.config_file = filename;
     dom_info.extra_config = extra_config;
     dom_info.migrate_fd = -1;
+    dom_info.send_fd = -1;
     dom_info.vnc = vnc;
     dom_info.vncautopass = vncautopass;
     dom_info.console_autoconnect = console_autoconnect;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 18/45] secondary vm suspend/resume/checkpoint code
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (16 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 17/45] Allow slave sends data to master Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 19/45] primary vm suspend/get_dirty_pfn/resume/checkpoint code Wen Congyang
                   ` (28 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Secondary vm is running in colo mode. So we will do
the following things again and again:
1. Resume secondary vm
   a. Send LIBXL_COLO_SVM_READY to master
   b. If it is resumed the first time, call libxl__xc_domain_restore_done()
      to build the secondary vm. We should also enable secondary vm's logdirty.
      Otherwise, call libxl__domain_resume() to resume secondary vm.
   c. Send LIBXL_COLO_SVM_RESUMED to master
2. Wait a new checkpoint
   a. Read LIBXL_COLO_NEW_CHECKPOINT from master
3. Suspend secondary vm
   a. Suspend secondary vm
   b. Get secondary vm's dirty page information
   c. Send LIBXL_COLO_SVM_SUSPENDED to master
   d. Send secondary vm's dirty page information to master(count + pfn list)

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xenguest.h             |  20 +
 tools/libxl/Makefile               |   1 +
 tools/libxl/libxl_colo.h           |  38 ++
 tools/libxl/libxl_colo_restore.c   | 883 +++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_create.c         | 116 ++++-
 tools/libxl/libxl_dom.c            |   2 +-
 tools/libxl/libxl_internal.h       |  22 +
 tools/libxl/libxl_save_callout.c   |   6 +-
 tools/libxl/libxl_save_msgs_gen.pl |   6 +-
 9 files changed, 1087 insertions(+), 7 deletions(-)
 create mode 100644 tools/libxl/libxl_colo.h
 create mode 100644 tools/libxl/libxl_colo_restore.c

diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 40bbac8..d3061c7 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -91,6 +91,26 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
 /* callbacks provided by xc_domain_restore */
 struct restore_callbacks {
+    /* Called after a new checkpoint to suspend the guest.
+     */
+    int (*suspend)(void* data);
+
+    /* Called after the secondary vm is ready to resume.
+     * Callback function resumes the guest & the device model,
+     *  returns to xc_domain_restore.
+     */
+    int (*postcopy)(void* data);
+
+    /* callback to wait a new checkpoint
+     *
+     * returns:
+     * 0: terminate checkpointing gracefully
+     * 1: take another checkpoint */
+    int (*checkpoint)(void* data);
+
+    /* Enable qemu-dm logging dirty pages to xen */
+    int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
+
     /* callback to restore toolstack specific data */
     int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
             uint32_t size, void* data);
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 5427461..c026bdd 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -57,6 +57,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_colo_restore.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
new file mode 100644
index 0000000..91df275
--- /dev/null
+++ b/tools/libxl/libxl_colo.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef LIBXL_COLO_H
+#define LIBXL_COLO_H
+
+/*
+ * values to control suspend/resume primary vm and secondary vm
+ * at the same time
+ */
+enum {
+    LIBXL_COLO_NEW_CHECKPOINT = 1,
+    LIBXL_COLO_SVM_SUSPENDED,
+    LIBXL_COLO_SVM_READY,
+    LIBXL_COLO_SVM_RESUMED,
+};
+
+extern void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
+                                     int ret, int retval, int errnoval);
+extern void libxl__colo_restore_setup(libxl__egc *egc,
+                                      libxl__colo_restore_state *crs);
+extern void libxl__colo_restore_teardown(libxl__egc *egc,
+                                         libxl__colo_restore_state *crs,
+                                         int rc);
+
+#endif
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
new file mode 100644
index 0000000..ebbd6b9
--- /dev/null
+++ b/tools/libxl/libxl_colo_restore.c
@@ -0,0 +1,883 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+#include "xg_private.h"
+#include "xc_bitops.h"
+
+enum {
+    LIBXL_COLO_SETUPED,
+    LIBXL_COLO_SUSPENDED,
+    LIBXL_COLO_RESUMED,
+};
+
+typedef struct libxl__colo_restore_checkpoint_state libxl__colo_restore_checkpoint_state;
+struct libxl__colo_restore_checkpoint_state {
+    xc_hypercall_buffer_t _dirty_bitmap;
+    xc_hypercall_buffer_t *dirty_bitmap;
+    unsigned long p2m_size;
+    libxl__domain_suspend_state2 dss2;
+    /* for sending data to master */
+    libxl__datacopier_state dc;
+    /* for reading data from master */
+    libxl__datareader_state drs;
+    uint8_t section;
+    libxl__logdirty_switch lds;
+    libxl__colo_restore_state *crs;
+    int status;
+
+    void (*callback)(libxl__egc *,
+                     libxl__colo_restore_checkpoint_state *,
+                     int);
+
+    /*
+     * 0: secondary vm's dirty bitmap for domain @domid
+     * 1: secondary vm is ready(domain @domid)
+     * 2: secondary vm is resumed(domain @domid)
+     */
+    const char *copywhat[3];
+};
+
+
+static void libxl__colo_restore_domain_resume_callback(void *data);
+static void libxl__colo_restore_domain_checkpoint_callback(void *data);
+static void libxl__colo_restore_domain_suspend_callback(void *data);
+
+/* ===================== colo: common functions ===================== */
+static void colo_enable_logdirty(libxl__colo_restore_state *crs, libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid,
+                          XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY,
+                          NULL, 0, NULL, 0, NULL) < 0) {
+        LOG(ERROR, "cannot enable secondary vm's logdirty");
+        lds->callback(egc, lds, ERROR_FAIL);
+        return;
+    }
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(domid, 1, lds, egc);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_disable_logdirty(libxl__colo_restore_state *crs,
+                                  libxl__egc *egc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const uint32_t domid = crs->domid;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    /* we need to know which pages are dirty to restore the guest */
+    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_OFF,
+                          NULL, 0, NULL, 0, NULL) < 0)
+        LOG(WARN, "cannot disable secondary vm's logdirty");
+
+    if (crs->hvm) {
+        libxl__domain_common_switch_qemu_logdirty(domid, 0, lds, egc);
+        return;
+    }
+
+    lds->callback(egc, lds, 0);
+}
+
+static void colo_resume_vm(libxl__egc *egc,
+                          libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (!crs->saved_cb) {
+        /* TODO: sync mmu for hvm? */
+        rc = libxl__domain_resume(gc, crs->domid, 0, 1);
+        if (rc)
+            LOG(ERROR, "cannot resume secondary vm");
+
+        crcs->callback(egc, crcs, rc);
+        return;
+    }
+
+    /*
+     * TODO: get store mfn and console mfn
+     *  We should call the callback restore_results in
+     *  xc_domain_restore() before resuming the guest.
+     */
+    libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
+
+    return;
+}
+
+
+/* ================ colo: setup restore environment ================ */
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid);
+
+static int init_dss2(libxl__domain_suspend_state2 *dss2)
+{
+    int rc = ERROR_FAIL;
+    libxl_domain_type type;
+
+    STATE_AO_GC(dss2->ao);
+
+    type = libxl__domain_type(gc, dss2->domid);
+    if (type == LIBXL_DOMAIN_TYPE_INVALID)
+        goto out;
+
+    libxl__xswait_init(&dss2->pvcontrol);
+    libxl__ev_evtchn_init(&dss2->guest_evtchn);
+    libxl__ev_xswatch_init(&dss2->guest_watch);
+    libxl__ev_time_init(&dss2->guest_timeout);
+
+    if (type == LIBXL_DOMAIN_TYPE_HVM)
+        dss2->hvm = 1;
+    else
+        dss2->hvm = 0;
+
+    dss2->guest_evtchn.port = -1;
+    dss2->guest_evtchn_lockfd = -1;
+    dss2->guest_responded = 0;
+    dss2->dm_savefile = libxl__device_model_savefile(gc, dss2->domid);
+    dss2->save_dm = 0;
+
+    /* Secondary vm is not created, so we cannot get evtchn port */
+
+    rc = 0;
+
+out:
+    return rc;
+}
+
+void libxl__colo_restore_setup(libxl__egc *egc,
+                               libxl__colo_restore_state *crs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    libxl__colo_restore_checkpoint_state *crcs;
+    DECLARE_HYPERCALL_BUFFER(unsigned long, dirty_bitmap);
+    int rc = ERROR_FAIL;
+    int bsize;
+
+    /* Convenience aliases */
+    libxl__srm_restore_autogen_callbacks *const callbacks =
+        &dcs->shs.callbacks.restore.a;
+    const int domid = crs->domid;
+
+    STATE_AO_GC(crs->ao);
+
+    GCNEW(crcs);
+    crs->crcs = crcs;
+    crcs->crs = crs;
+
+    crcs->p2m_size = xc_domain_maximum_gpfn(CTX->xch, domid) + 1;
+
+    crcs->copywhat[0] = GCSPRINTF("secondary vm's dirty bitmap for domain %"PRIu32,
+                                  domid);
+    crcs->copywhat[1] = GCSPRINTF("secondary vm is ready(domain %"PRIu32")",
+                                  domid);
+    crcs->copywhat[2] = GCSPRINTF("secondary vm is resumed(domain %"PRIu32")",
+                                  domid);
+
+    bsize = bitmap_size(crcs->p2m_size);
+    dirty_bitmap = xc_hypercall_buffer_alloc_pages(CTX->xch, dirty_bitmap,
+                                                   NRPAGES(bsize));
+    if (!dirty_bitmap) {
+        rc = ERROR_NOMEM;
+        goto err;
+    }
+    memset(dirty_bitmap, 0, bsize);
+    crcs->_dirty_bitmap = *HYPERCALL_BUFFER(dirty_bitmap);
+    crcs->dirty_bitmap = &crcs->_dirty_bitmap;
+
+    /* setup dss2 */
+    crcs->dss2.ao = ao;
+    crcs->dss2.domid = domid;
+    if (init_dss2(&crcs->dss2))
+        goto err_init_dss2;
+
+    callbacks->suspend = libxl__colo_restore_domain_suspend_callback;
+    callbacks->postcopy = libxl__colo_restore_domain_resume_callback;
+    callbacks->checkpoint = libxl__colo_restore_domain_checkpoint_callback;
+
+    /*
+     * Secondary vm is running in colo mode, so we need to call
+     * libxl__xc_domain_restore_done() to create secondary vm.
+     * But we will exit in domain_create_cb(). So replace the
+     * callback here.
+     */
+    crs->saved_cb = dcs->callback;
+    dcs->callback = libxl__colo_domain_create_cb;
+    crcs->status = LIBXL_COLO_SETUPED;
+
+    logdirty_init(&crcs->lds);
+    crcs->lds.ao = ao;
+
+    rc = 0;
+
+out:
+    crs->callback(egc, crs, rc);
+    return;
+
+err_init_dss2:
+    xc_hypercall_buffer_free_pages(CTX->xch, dirty_bitmap, NRPAGES(bsize));
+    crcs->dirty_bitmap = NULL;
+err:
+    goto out;
+}
+
+static void libxl__colo_domain_create_cb(libxl__egc *egc,
+                                         libxl__domain_create_state *dcs,
+                                         int rc, uint32_t domid)
+{
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    crcs->callback(egc, crcs, rc);
+}
+
+
+/* ================ colo: teardown restore environment ================ */
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc);
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc);
+
+static void do_failover(libxl__egc *egc, libxl__colo_restore_state *crs)
+{
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+
+    /* Convenience aliases */
+    const int status = crcs->status;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+
+    STATE_AO_GC(crs->ao);
+
+    switch(status) {
+    case LIBXL_COLO_SETUPED:
+        /* We don't enable logdirty now */
+        colo_resume_vm(egc, crcs);
+        return;
+    case LIBXL_COLO_SUSPENDED:
+    case LIBXL_COLO_RESUMED:
+        /* disable logdirty first */
+        lds->callback = colo_disable_logdirty_done;
+        colo_disable_logdirty(crs, egc);
+        return;
+    default:
+        LOG(ERROR, "invalid status: %d", status);
+        crcs->callback(egc, crcs, ERROR_FAIL);
+    }
+}
+
+void libxl__colo_restore_teardown(libxl__egc *egc,
+                                  libxl__colo_restore_state *crs,
+                                  int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = crs->crcs;
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, crcs->dirty_bitmap);
+    int bsize = bitmap_size(crcs->p2m_size);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    EGC_GC;
+
+    if (!dirty_bitmap)
+        goto do_failover;
+
+    xc_hypercall_buffer_free_pages(CTX->xch, dirty_bitmap, NRPAGES(bsize));
+
+do_failover:
+    if (!rc) {
+        crcs->callback = do_failover_done;
+        do_failover(egc, crs);
+        return;
+    }
+
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+    }
+    crs->callback(egc, crs, rc);
+}
+
+static void do_failover_done(libxl__egc *egc,
+                             libxl__colo_restore_checkpoint_state* crcs,
+                             int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc)
+        LOG(ERROR, "cannot do failover");
+
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+    }
+
+    crs->callback(egc, crs, rc);
+}
+
+static void colo_disable_logdirty_done(libxl__egc *egc,
+                                       libxl__logdirty_switch *lds,
+                                       int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+
+    STATE_AO_GC(lds->ao);
+
+    if (rc)
+        LOG(WARN, "cannot disable logdirty");
+
+    if (crcs->status == LIBXL_COLO_SUSPENDED) {
+        colo_resume_vm(egc, crcs);
+        return;
+    }
+
+    /* If we cannot disable logdirty, we still can do failover */
+    crcs->callback(egc, crcs, 0);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. resume
+ * 2. checkpoint
+ * 3. suspend
+ */
+static void colo_common_send_data_done(libxl__egc *egc,
+                                       libxl__datacopier_state *dc,
+                                       int onwrite, int errnoval);
+/* ===================== colo: resume secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm:
+ *  1. write LIBXL_COLO_SVM_READY
+ *  2. resume secondary vm
+ *  3. write LIBXL_COLO_SVM_RESUMED
+ */
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc);
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc);
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs);
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int retval);
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc);
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc);
+
+static void libxl__colo_restore_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+    uint8_t section = LIBXL_COLO_SVM_READY;
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = &dcs->crs;
+    const int send_fd = crs->send_fd;
+    libxl__datacopier_state *const dc = &crcs->dc;
+
+    STATE_AO_GC(crs->ao);
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[1];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_send_data_done;
+    crcs->callback = colo_send_svm_ready_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is ready */
+    libxl__datacopier_prefixdata(shs->egc, dc, &section, sizeof(section));
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_send_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_restore_checkpoint_state *crcs,
+                                     int rc)
+{
+    crcs->callback = colo_resume_vm_done;
+    colo_resume_vm(egc, crcs);
+
+    return;
+}
+
+static void colo_resume_vm_done(libxl__egc *egc,
+                                libxl__colo_restore_checkpoint_state *crcs,
+                                int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__logdirty_switch *const lds = &crcs->lds;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot resume secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_RESUMED;
+
+    /* avoid calling libxl__xc_domain_restore_done() more than once */
+    if (crs->saved_cb) {
+        dcs->callback = crs->saved_cb;
+        crs->saved_cb = NULL;
+
+        lds->callback = colo_enable_logdirty_done;
+        colo_enable_logdirty(crs, egc);
+        return;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_write_svm_resumed(libxl__egc *egc,
+                                   libxl__colo_restore_checkpoint_state *crcs)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    uint8_t section = LIBXL_COLO_SVM_RESUMED;
+    int rc;
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    const int send_fd = crs->send_fd;
+    libxl__datacopier_state *const dc = &crcs->dc;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[2];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_send_data_done;
+    /* TODO: configure network */
+    crcs->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is resumed */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_enable_logdirty_done(libxl__egc *egc,
+                                      libxl__logdirty_switch *lds,
+                                      int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__save_helper_state *const shs = &dcs->shs;
+    const uint32_t domid = crs->domid;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        /*
+         * log-dirty already enabled? There's no test op,
+         * so attempt to disable then reenable it
+         */
+        lds->callback = colo_reenable_logdirty;
+        colo_disable_logdirty(crs, egc);
+        return;
+    }
+
+    /* We have enabled secondary vm's logdirty, so we can unpause it now */
+    rc = libxl__domain_unpause(gc, domid);
+    if (rc) {
+        LOG(ERROR, "cannot unpause secondary vm");
+        goto out;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_reenable_logdirty(libxl__egc *egc,
+                                   libxl__logdirty_switch *lds,
+                                   int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__colo_restore_state *const crs = crcs->crs;
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    lds->callback = colo_reenable_logdirty_done;
+    colo_enable_logdirty(crs, egc);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_reenable_logdirty_done(libxl__egc *egc,
+                                        libxl__logdirty_switch *lds,
+                                        int rc)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(lds, *crcs, lds);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+    const uint32_t domid = crcs->crs->domid;
+
+    STATE_AO_GC(crcs->crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "cannot enable logdirty");
+        goto out;
+    }
+
+    /* We have enabled secondary vm's logdirty, so we can unpause it now */
+    rc = libxl__domain_unpause(gc, domid);
+    if (rc) {
+        LOG(ERROR, "cannot unpause secondary vm");
+        goto out;
+    }
+
+    colo_write_svm_resumed(egc, crcs);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__datareader_state *drs,
+                                  ssize_t real_size, int errnoval);
+
+static void libxl__colo_restore_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    /* Convenience aliases */
+    const int recv_fd = dcs->crs.recv_fd;
+    libxl__datareader_state *const drs = &crcs->drs;
+
+    STATE_AO_GC(dcs->crs.ao);
+
+    memset(drs, 0, sizeof(*drs));
+    drs->ao = ao;
+    drs->readfd = recv_fd;
+    drs->readsize = sizeof(crcs->section);
+    drs->readwhat = "colo stream";
+    drs->callback = colo_stream_read_done;
+    drs->buf = &crcs->section;
+
+    if (libxl__datareader_start(drs)) {
+        LOG(ERROR, "libxl__datareader_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, 0);
+}
+
+static void colo_stream_read_done(libxl__egc *egc,
+                                  libxl__datareader_state *drs,
+                                  ssize_t real_size, int errnoval)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(drs, *crcs, drs);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok = 0;
+
+    /* Convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    STATE_AO_GC(drs->ao);
+
+    if (real_size < drs->readsize) {
+        LOG(ERROR, "reading data fails: %lld", (long long)real_size);
+        goto out;
+    }
+
+    if (crcs->section != LIBXL_COLO_NEW_CHECKPOINT) {
+        LOG(ERROR, "invalid section: %d", crcs->section);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, ok);
+}
+
+
+/* ===================== colo: suspend secondary vm ===================== */
+/*
+ * Do the following things when resuming secondary vm:
+ *  1. suspend secondary vm
+ *  2. get secondary vm's dirty page information
+ *  3. send LIBXL_COLO_SVM_SUSPENDED
+ *  4. send secondary vm's dirty page information(count + pfn list)
+ */
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state2 *dss2,
+                                 int ok);
+static void colo_append_pfn_type(libxl__egc *egc,
+                                 libxl__datacopier_state *dc,
+                                 unsigned long *dirty_bitmap,
+                                 unsigned long p2m_size);
+
+static void libxl__colo_restore_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_create_state *dcs = CONTAINER_OF(shs, *dcs, shs);
+    libxl__colo_restore_checkpoint_state *crcs = dcs->crs.crcs;
+
+    STATE_AO_GC(dcs->ao);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state2 *const dss2 = &crcs->dss2;
+
+    /* suspend secondary vm */
+    dss2->callback_common_done = colo_suspend_vm_done;
+
+    libxl__domain_suspend2(shs->egc, dss2);
+}
+
+static void colo_suspend_vm_done(libxl__egc *egc,
+                                 libxl__domain_suspend_state2 *dss2,
+                                 int ok)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(dss2, *crcs, dss2);
+    libxl__colo_restore_state *crs = crcs->crs;
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, crcs->dirty_bitmap);
+    uint8_t section = LIBXL_COLO_SVM_SUSPENDED;
+    int i, rc;
+    uint64_t count;
+
+    /* Convenience aliases */
+    const int send_fd = crs->send_fd;
+    const unsigned long p2m_size = crcs->p2m_size;
+    const uint32_t domid = crs->domid;
+    libxl__datacopier_state *const dc = &crcs->dc;
+
+    STATE_AO_GC(crs->ao);
+
+    if (!ok) {
+        LOG(ERROR, "cannot suspend secondary vm");
+        goto out;
+    }
+
+    crcs->status = LIBXL_COLO_SUSPENDED;
+
+    /*
+     * Secondary vm is running, so there are some dirty pages
+     * that are non-dirty in master. Get dirty bitmap and
+     * send it to master.
+     */
+    if (xc_shadow_control(CTX->xch, domid, XEN_DOMCTL_SHADOW_OP_CLEAN,
+                          HYPERCALL_BUFFER(dirty_bitmap), p2m_size,
+                          NULL, 0, NULL) != p2m_size) {
+        LOG(ERROR, "getting secondary vm's dirty bitmap fails");
+        goto out;
+    }
+
+    count = 0;
+    for (i = 0; i < p2m_size; i++) {
+        if (test_bit(i, dirty_bitmap))
+            count++;
+    }
+
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = crcs->copywhat[0];
+    dc->writewhat = "colo stream";
+    dc->callback = colo_common_send_data_done;
+    crcs->callback = NULL;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell master that secondary vm is suspended */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+
+    /* send dirty pages to master */
+    libxl__datacopier_prefixdata(egc, dc, &count, sizeof(count));
+    colo_append_pfn_type(egc, dc, dirty_bitmap, p2m_size);
+    return;
+
+out:
+    ok = 0;
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, ok);
+}
+
+static void colo_append_pfn_type(libxl__egc *egc,
+                                 libxl__datacopier_state *dc,
+                                 unsigned long *dirty_bitmap,
+                                 unsigned long p2m_size)
+{
+    int i, count;
+    /* Hack, buf->buf is private member... */
+    libxl__datacopier_buf *buf = NULL;
+    int max_batch = sizeof(buf->buf) / sizeof(uint64_t);
+    int buf_size = max_batch * sizeof(uint64_t);
+    uint64_t *pfn;
+
+    STATE_AO_GC(dc->ao);
+
+    pfn = libxl__zalloc(NOGC, buf_size);
+
+    count = 0;
+    for (i = 0; i < p2m_size; i++) {
+        if (!test_bit(i, dirty_bitmap))
+            continue;
+
+        pfn[count++] = i;
+        if (count == max_batch) {
+            libxl__datacopier_prefixdata(egc, dc, pfn, buf_size);
+            count = 0;
+        }
+    }
+
+    if (count)
+        libxl__datacopier_prefixdata(egc, dc, pfn, count * sizeof(uint64_t));
+
+    free(pfn);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_send_data_done(libxl__egc *egc,
+                                       libxl__datacopier_state *dc,
+                                       int onwrite, int errnoval)
+{
+    libxl__colo_restore_checkpoint_state *crcs = CONTAINER_OF(dc, *crcs, dc);
+    libxl__domain_create_state *dcs = CONTAINER_OF(crcs->crs, *dcs, crs);
+    int ok;
+    STATE_AO_GC(dc->ao);
+
+    if (onwrite == -1) {
+        LOG(ERROR, "sending data fails");
+        ok = 0;
+        goto out;
+    }
+
+    if (errnoval) {
+        /* failure happens when reading/writing, do failover? */
+        ok = 2;
+        goto out;
+    }
+
+    if (!crcs->callback) {
+        /* Everythins is OK */
+        ok = 1;
+        goto out;
+    }
+
+    crcs->callback(egc, crcs, 0);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dcs->shs, ok);
+}
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index e29a107..fef9b36 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -19,6 +19,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_arch.h"
+#include "libxl_colo.h"
 
 #include <xc_dom.h>
 #include <xenguest.h>
@@ -898,6 +899,96 @@ static void domcreate_console_available(libxl__egc *egc,
                                         dcs->aop_console_how.for_event));
 }
 
+static void libxl__colo_restore_teardown_done(libxl__egc *egc,
+                                              libxl__colo_restore_state *crs,
+                                              int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    STATE_AO_GC(crs->ao);
+
+    /* convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+    const int domid = crs->domid;
+    const libxl_ctx *const ctx = libxl__gc_owner(gc);
+    xc_interface *const xch = ctx->xch;
+
+    if (!rc)
+        /* failover, no need to destroy the secondary vm */
+        goto out;
+
+    if (shs->retval)
+        /*
+         * shs->retval stores the return value of xc_domain_restore().
+         * If it is not 0, we have destroyed the secondary vm in
+         * xc_domain_restore();
+         */
+        goto out;
+
+    xc_domain_destroy(xch, domid);
+
+out:
+    dcs->callback(egc, dcs, rc, crs->domid);
+}
+
+void libxl__colo_restore_done(libxl__egc *egc, void *dcs_void,
+                              int ret, int retval, int errnoval)
+{
+    libxl__domain_create_state *dcs = dcs_void;
+    int rc = 1;
+
+    /* convenience aliases */
+    libxl__colo_restore_state *const crs = &dcs->crs;
+    STATE_AO_GC(crs->ao);
+
+    /* teardown and failover */
+    crs->callback = libxl__colo_restore_teardown_done;
+
+    if (ret == 0 && retval == 0)
+        rc = 0;
+
+    LOG(INFO, "%s", rc ? "colo fails" : "failover");
+    libxl__colo_restore_teardown(egc, crs, rc);
+}
+
+static void libxl__colo_restore_cp_done(libxl__egc *egc,
+                                        libxl__colo_restore_state *crs,
+                                        int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+    int ok = 0;
+
+    /* convenience aliases */
+    libxl__save_helper_state *const shs = &dcs->shs;
+
+    if (!rc)
+        ok = 1;
+
+    libxl__xc_domain_saverestore_async_callback_done(shs->egc, shs, ok);
+}
+
+static void libxl__colo_restore_setup_done(libxl__egc *egc,
+                                           libxl__colo_restore_state *crs,
+                                           int rc)
+{
+    libxl__domain_create_state *dcs = CONTAINER_OF(crs, *dcs, crs);
+
+    /* convenience aliases */
+    const int hvm = crs->hvm;
+    const int superpages = crs->superpages;
+    const int pae = crs->pae;
+    STATE_AO_GC(crs->ao);
+
+    if (rc) {
+        LOG(ERROR, "colo restore setup fails: %d", rc);
+        libxl__xc_domain_restore_done(egc, dcs, rc, 0, 0);
+        return;
+    }
+
+    crs->callback = libxl__colo_restore_cp_done;
+    libxl__xc_domain_restore(egc, dcs,
+                             hvm, pae, superpages);
+}
+
 static void domcreate_bootloader_done(libxl__egc *egc,
                                       libxl__bootloader_state *bl,
                                       int rc)
@@ -913,6 +1004,8 @@ static void domcreate_bootloader_done(libxl__egc *egc,
     libxl__domain_build_state *const state = &dcs->build_state;
     libxl__srm_restore_autogen_callbacks *const callbacks =
         &dcs->shs.callbacks.restore.a;
+    const int checkpointed_stream = dcs->checkpointed_stream;
+    libxl__colo_restore_state *const crs = &dcs->crs;
 
     if (rc) {
         domcreate_rebuild_done(egc, dcs, rc);
@@ -941,6 +1034,13 @@ static void domcreate_bootloader_done(libxl__egc *egc,
 
     /* Restore */
 
+    /* COLO only supports HVM now */
+    if (info->type != LIBXL_DOMAIN_TYPE_HVM &&
+        checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
     rc = libxl__build_pre(gc, domid, d_config, state);
     if (rc)
         goto out;
@@ -963,8 +1063,20 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         rc = ERROR_INVAL;
         goto out;
     }
-    libxl__xc_domain_restore(egc, dcs,
-                             hvm, pae, superpages);
+
+    if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
+        crs->ao = ao;
+        crs->domid = domid;
+        crs->send_fd = dcs->send_fd;
+        crs->recv_fd = restore_fd;
+        crs->hvm = hvm;
+        crs->superpages = superpages;
+        crs->pae = pae;
+        crs->callback = libxl__colo_restore_setup_done;
+        libxl__colo_restore_setup(egc, crs);
+    } else
+        libxl__xc_domain_restore(egc, dcs,
+                                 hvm, pae, superpages);
     return;
 
  out:
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 4e71ec5..769952c 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -862,7 +862,7 @@ static void switch_logdirty_xswatch(libxl__egc *egc, libxl__ev_xswatch*,
 static void switch_logdirty_done(libxl__egc *egc,
                                  libxl__logdirty_switch *lds, int ok);
 
-static void logdirty_init(libxl__logdirty_switch *lds)
+void logdirty_init(libxl__logdirty_switch *lds)
 {
     lds->cmd_path = 0;
     libxl__ev_xswatch_init(&lds->watch);
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index b04e4b9..d2e3176 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2741,6 +2741,7 @@ struct libxl__logdirty_switch {
     libxl__ev_xswatch watch;
     libxl__ev_time timeout;
 };
+_hidden void logdirty_init(libxl__logdirty_switch *lds);
 
 /*
  * libxl__domain_suspend_state is for saving guest, not
@@ -3032,6 +3033,26 @@ typedef void libxl__domain_create_cb(libxl__egc *egc,
                                      libxl__domain_create_state*,
                                      int rc, uint32_t domid);
 
+/* colo related structure */
+typedef struct libxl__colo_restore_state libxl__colo_restore_state;
+typedef void libxl__colo_callback(libxl__egc *,
+                                  libxl__colo_restore_state *, int rc);
+struct libxl__colo_restore_state {
+    /* must set by caller of libxl__colo_(setup|teardown) */
+    libxl__ao *ao;
+    uint32_t domid;
+    int send_fd;
+    int recv_fd;
+    int hvm;
+    int pae;
+    int superpages;
+    libxl__colo_callback *callback;
+
+    /* private, colo restore checkpoint state */
+    libxl__domain_create_cb *saved_cb;
+    void *crcs;
+};
+
 struct libxl__domain_create_state {
     /* filled in by user */
     libxl__ao *ao;
@@ -3044,6 +3065,7 @@ struct libxl__domain_create_state {
     int guest_domid;
     int checkpointed_stream;
     libxl__domain_build_state build_state;
+    libxl__colo_restore_state crs;
     libxl__bootloader_state bl;
     libxl__stub_dm_spawn_state dmss;
         /* If we're not doing stubdom, we use only dmss.dm,
diff --git a/tools/libxl/libxl_save_callout.c b/tools/libxl/libxl_save_callout.c
index 0c09d94..e251181 100644
--- a/tools/libxl/libxl_save_callout.c
+++ b/tools/libxl/libxl_save_callout.c
@@ -15,6 +15,7 @@
 #include "libxl_osdeps.h"
 
 #include "libxl_internal.h"
+#include "libxl_colo.h"
 
 /* stream_fd is as from the caller (eventually, the application).
  * It may be 0, 1 or 2, in which case we need to dup it elsewhere.
@@ -65,7 +66,10 @@ void libxl__xc_domain_restore(libxl__egc *egc, libxl__domain_create_state *dcs,
     dcs->shs.ao = ao;
     dcs->shs.domid = domid;
     dcs->shs.recv_callback = libxl__srm_callout_received_restore;
-    dcs->shs.completion_callback = libxl__xc_domain_restore_done;
+    if (dcs->checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO)
+        dcs->shs.completion_callback = libxl__colo_restore_done;
+    else
+        dcs->shs.completion_callback = libxl__xc_domain_restore_done;
     dcs->shs.caller_state = dcs;
     dcs->shs.need_results = 1;
     dcs->shs.toolstack_data_file = 0;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 41ee000..0239cac 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -24,9 +24,9 @@ our @msgs = (
                                                  STRING doing_what),
                                                 'unsigned long', 'done',
                                                 'unsigned long', 'total'] ],
-    [  3, 'scxA',   "suspend", [] ],
-    [  4, 'scxA',   "postcopy", [] ],
-    [  5, 'scxA',   "checkpoint", [] ],
+    [  3, 'srcxA',   "suspend", [] ],
+    [  4, 'srcxA',   "postcopy", [] ],
+    [  5, 'srcxA',   "checkpoint", [] ],
     [  6, 'scxA',   "switch_qemu_logdirty",  [qw(int domid
                                               unsigned enable)] ],
     #                toolstack_save          done entirely `by hand'
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 19/45] primary vm suspend/get_dirty_pfn/resume/checkpoint code
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (17 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 18/45] secondary vm suspend/resume/checkpoint code Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 20/45] xc_domain_save: flush cache before calling callbacks->postcopy() in colo mode Wen Congyang
                   ` (27 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We will do the following things again and again:
1. Suspend primary vm
   a. Suspend primary vm
   b. do postsuspend
   c. Read LIBXL_COLO_SVM_SUSPENDED to master
   d. Read secondary vm's dirty page information to master(count + pfn list)
2. Get dirty pfn list
   a. Return secondary vm's dirty pfn list
3. Resume primary vm
   a. Read LIBXL_COLO_SVM_READY from slave
   b. Do presume
   c. Resume primary vm
   d. Read LIBXL_COLO_SVM_RESUMED from slave
4. Wait a new checkpoint
    a. Wait a new checkpoint(not implemented)
    b. Send LIBXL_COLO_NEW_CHECKPOINT to slave

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xenguest.h             |  12 +
 tools/libxl/Makefile               |   2 +-
 tools/libxl/libxl.c                |   6 +-
 tools/libxl/libxl_colo.h           |  10 +
 tools/libxl/libxl_colo_save.c      | 608 +++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_dom.c            |  13 +-
 tools/libxl/libxl_internal.h       |  32 +-
 tools/libxl/libxl_save_msgs_gen.pl |   1 +
 tools/libxl/libxl_types.idl        |   1 +
 9 files changed, 677 insertions(+), 8 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_save.c

diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index d3061c7..1aeaad2 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -72,6 +72,18 @@ struct save_callbacks {
      */
     int (*toolstack_save)(uint32_t domid, uint8_t **buf, uint32_t *len, void *data);
 
+    /* Called after the guest is suspended.
+     *
+     * returns the list of dirty pfn:
+     *  struct {
+     *      uint64_t count;
+     *      uint64_t pfn[];
+     *  };
+     *
+     *  Note: the caller must free the return value.
+     */
+    uint8_t *(*get_dirty_pfn)(void *data);
+
     /* to be provided as the last argument to each callback function */
     void* data;
 };
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index c026bdd..1c32ae2 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -57,7 +57,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
 LIBXL_OBJS-y += libxl_remus.o libxl_checkpoint_device.o libxl_remus_disk_drbd.o
-LIBXL_OBJS-y += libxl_colo_restore.o
+LIBXL_OBJS-y += libxl_colo_restore.o libxl_colo_save.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index f35029a..e0817e8 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -18,6 +18,7 @@
 
 #include "libxl_internal.h"
 #include "libxl_remus.h"
+#include "libxl_colo.h"
 
 #define PAGE_TO_MEMKB(pages) ((pages) * 4)
 #define BACKEND_STRING_SIZE 5
@@ -820,7 +821,10 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     assert(info);
 
     /* Point of no return */
-    libxl__remus_setup(egc, &dss->rs);
+    if (info->colo)
+        libxl__colo_save_setup(egc, &dss->css);
+    else
+        libxl__remus_setup(egc, &dss->rs);
     return AO_INPROGRESS;
 
  out:
diff --git a/tools/libxl/libxl_colo.h b/tools/libxl/libxl_colo.h
index 91df275..26a2563 100644
--- a/tools/libxl/libxl_colo.h
+++ b/tools/libxl/libxl_colo.h
@@ -35,4 +35,14 @@ extern void libxl__colo_restore_teardown(libxl__egc *egc,
                                          libxl__colo_restore_state *crs,
                                          int rc);
 
+extern void libxl__colo_save_domain_suspend_callback(void *data);
+extern void libxl__colo_save_domain_resume_callback(void *data);
+extern void libxl__colo_save_domain_checkpoint_callback(void *data);
+extern void libxl__colo_save_get_dirty_pfn_callback(void *data);
+extern void libxl__colo_save_setup(libxl__egc *egc,
+                                   libxl__colo_save_state *css);
+extern void libxl__colo_save_teardown(libxl__egc *egc,
+                                      libxl__colo_save_state *css,
+                                      int rc);
+
 #endif
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
new file mode 100644
index 0000000..6675d3d
--- /dev/null
+++ b/tools/libxl/libxl_colo_save.c
@@ -0,0 +1,608 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+#include "libxl_colo.h"
+
+static const libxl__checkpoint_device_subkind_ops *colo_ops[] = {
+    NULL,
+};
+
+/* ================= colo: setup save environment ================= */
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc);
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc);
+
+void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &css->cds;
+
+    STATE_AO_GC(dss->ao);
+
+    if (dss->type != LIBXL_DOMAIN_TYPE_HVM) {
+        LOG(ERROR, "COLO only supports hvm now");
+        goto out;
+    }
+
+    css->send_fd = dss->fd;
+    css->recv_fd = dss->recv_fd;
+    css->svm_running = false;
+
+    /* TODO: disk/nic support */
+    cds->device_kind_flags = 0;
+    cds->ops = colo_ops;
+    cds->callback = colo_save_setup_done;
+    cds->ao = ao;
+    cds->domid = dss->domid;
+
+    libxl__checkpoint_devices_setup(egc, &css->cds);
+
+    return;
+
+out:
+    libxl__ao_complete(egc, ao, ERROR_FAIL);
+}
+
+static void colo_save_setup_done(libxl__egc *egc,
+                                 libxl__checkpoint_devices_state *cds,
+                                 int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    STATE_AO_GC(cds->ao);
+
+    if (!rc) {
+        libxl__domain_suspend(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "COLO: failed to setup device for guest with domid %u",
+        dss->domid);
+    css->cds.callback = colo_save_setup_failed;
+    libxl__checkpoint_devices_teardown(egc, &css->cds);
+}
+
+static void colo_save_setup_failed(libxl__egc *egc,
+                                   libxl__checkpoint_devices_state *cds,
+                                   int rc)
+{
+    STATE_AO_GC(cds->ao);
+
+    if (rc)
+        LOG(ERROR, "COLO: failed to teardown device after setup failed"
+            " for guest with domid %u, rc %d", cds->domid, rc);
+
+    libxl__ao_complete(egc, ao, rc);
+}
+
+
+/* ================= colo: teardown save environment ================= */
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc);
+
+void libxl__colo_save_teardown(libxl__egc *egc,
+                               libxl__colo_save_state *css,
+                               int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    LOG(WARN, "COLO: Domain suspend terminated with rc %d,"
+        " teardown COLO devices...", rc);
+    dss->css.cds.callback = colo_teardown_done;
+    libxl__checkpoint_devices_teardown(egc, &dss->css.cds);
+    return;
+}
+
+static void colo_teardown_done(libxl__egc *egc,
+                               libxl__checkpoint_devices_state *cds,
+                               int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    dss->callback(egc, dss, rc);
+}
+
+/*
+ * checkpoint callbacks are called in the following order:
+ * 1. suspend
+ * 2. resume
+ * 3. checkpoint
+ */
+static void colo_common_read_done(libxl__egc *egc,
+                                  libxl__datareader_state *drs,
+                                  ssize_t real_size, int errnoval);
+/* ===================== colo: suspend primary vm ===================== */
+/*
+ * Do the following things when suspending primary vm:
+ * 1. suspend primary vm
+ * 2. do postsuspend
+ * 3. read LIBXL_COLO_SVM_SUSPENDED
+ * 4. read secondary vm's dirty pages
+ */
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state2 *dss2,
+                                         int ok);
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc);
+static void colo_read_pfn(libxl__egc *egc, libxl__colo_save_state *css);
+
+void libxl__colo_save_domain_suspend_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    /* Convenience aliases */
+    libxl__domain_suspend_state2 *dss2 = &dss->dss2;
+
+    dss2->callback_common_done = colo_suspend_primary_vm_done;
+    libxl__domain_suspend2(egc, dss2);
+}
+
+static void colo_suspend_primary_vm_done(libxl__egc *egc,
+                                         libxl__domain_suspend_state2 *dss2,
+                                         int ok)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(dss2, *dss, dss2);
+
+    STATE_AO_GC(dss2->ao);
+
+    if (!ok) {
+        LOG(ERROR, "cannot suspend primary vm");
+        goto out;
+    }
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->css.cds;
+
+    cds->callback = colo_postsuspend_cb;
+    libxl__checkpoint_devices_postsuspend(egc, cds);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void colo_postsuspend_cb(libxl__egc *egc,
+                                libxl__checkpoint_devices_state *cds,
+                                int rc)
+{
+    int ok = 0;
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__datareader_state *const drs = &css->drs;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "postsuspend fails");
+        goto out;
+    }
+
+    if (!css->svm_running) {
+        ok = 1;
+        goto out;
+    }
+
+    /*
+     * read LIBXL_COLO_SVM_SUSPENDED and the count of
+     * secondary vm's dirty pages.
+     */
+    memset(drs, 0, sizeof(*drs));
+    drs->ao = ao;
+    drs->readfd = css->recv_fd;
+    drs->readsize = sizeof(css->temp_buff);
+    drs->readwhat = "colo stream";
+    drs->callback = colo_common_read_done;
+    drs->buf = css->temp_buff;
+    css->callback = colo_read_pfn;
+
+    if (libxl__datareader_start(drs)) {
+        LOG(ERROR, "libxl__datareader_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void colo_read_pfn(libxl__egc *egc, libxl__colo_save_state *css)
+{
+    int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    STATE_AO_GC(css->cds.ao);
+
+    /* Convenience aliases */
+    libxl__datareader_state *const drs = &css->drs;
+
+    assert(!css->buff);
+    css->section = css->temp_buff[0];
+    css->count = *(uint64_t *)(&css->temp_buff[1]);
+
+    if (css->section != LIBXL_COLO_SVM_SUSPENDED) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_SUSPENDED);
+        goto out;
+    }
+
+    css->buff = libxl__zalloc(NOGC, sizeof(uint64_t) * (css->count + 1));
+    css->buff[0] = css->count;
+
+    if (css->count == 0) {
+        /* no dirty pages */
+        ok = 1;
+        goto out;
+    }
+
+    /* read the pfn of secondary vm's dirty pages */
+    memset(drs, 0, sizeof(*drs));
+    drs->ao = ao;
+    drs->readfd = css->recv_fd;
+    drs->readsize = css->count * sizeof(uint64_t);
+    drs->readwhat = "colo stream";
+    drs->callback = colo_common_read_done;
+    drs->buf = css->buff + 1;
+    css->callback = NULL;
+
+    if (libxl__datareader_start(drs)) {
+        LOG(ERROR, "libxl__datareader_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/* ===================== colo: get dirty pfn ===================== */
+void libxl__colo_save_get_dirty_pfn_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    uint64_t size;
+
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+
+    assert(css->buff);
+    size = sizeof(uint64_t) * (css->count + 1);
+
+    libxl__xc_domain_saverestore_async_callback_done_with_data(egc, shs,
+                                                               (uint8_t *)css->buff,
+                                                               size);
+    free(css->buff);
+    css->buff = NULL;
+}
+
+
+/* ===================== colo: resume primary vm ===================== */
+/*
+ * Do the following things when resuming primary vm:
+ *  1. read LIBXL_COLO_SVM_READY
+ *  2. do preresume
+ *  3. resume primary vm
+ *  4. read LIBXL_COLO_SVM_RESUMED
+ */
+static void colo_preresume_dm_saved(libxl__egc *egc,
+                                    libxl__domain_suspend_state *dss, int rc);
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css);
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc);
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css);
+
+void libxl__colo_save_domain_resume_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__egc *egc = shs->egc;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+
+    /* This would go into tailbuf. */
+    if (dss->hvm) {
+        libxl__domain_save_device_model(egc, dss, colo_preresume_dm_saved);
+    } else {
+        colo_preresume_dm_saved(egc, dss, 0);
+    }
+
+    return;
+}
+
+static void colo_preresume_dm_saved(libxl__egc *egc,
+                                    libxl__domain_suspend_state *dss, int rc)
+{
+    /* Convenience aliases */
+    libxl__colo_save_state *const css = &dss->css;
+    libxl__datareader_state *const drs = &css->drs;
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to save device model. Terminating COLO..");
+        goto out;
+    }
+
+    /* read LIBXL_COLO_SVM_READY */
+    memset(drs, 0, sizeof(*drs));
+    drs->ao = ao;
+    drs->readfd = css->recv_fd;
+    drs->readsize = sizeof(css->section);
+    drs->readwhat = "colo stream";
+    drs->callback = colo_common_read_done;
+    drs->buf = &css->section;
+    css->callback = colo_read_svm_ready_done;
+
+    if (libxl__datareader_start(drs)) {
+        LOG(ERROR, "libxl__datareader_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_read_svm_ready_done(libxl__egc *egc,
+                                     libxl__colo_save_state *css)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (css->section != LIBXL_COLO_SVM_READY) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_READY);
+        goto out;
+    }
+
+    css->svm_running = true;
+    css->cds.callback = colo_preresume_cb;
+    libxl__checkpoint_devices_preresume(egc, &css->cds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_preresume_cb(libxl__egc *egc,
+                              libxl__checkpoint_devices_state *cds,
+                              int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    /* Convenience aliases */
+    libxl__datareader_state *const drs = &css->drs;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "preresume fails");
+        goto out;
+    }
+
+    /* Resumes the domain and the device model */
+    if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1, 0)) {
+        LOG(ERROR, "cannot resume primary vm");
+        goto out;
+    }
+
+    /* read LIBXL_COLO_SVM_RESUMED */
+    memset(drs, 0, sizeof(*drs));
+    drs->ao = ao;
+    drs->readfd = css->recv_fd;
+    drs->readsize = sizeof(css->section);
+    drs->readwhat = "colo stream";
+    drs->callback = colo_common_read_done;
+    drs->buf = &css->section;
+    css->callback = colo_read_svm_resumed_done;
+
+    if (libxl__datareader_start(drs)) {
+        LOG(ERROR, "libxl__datareader_start() fails");
+        goto out;
+    }
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_read_svm_resumed_done(libxl__egc *egc,
+                                       libxl__colo_save_state *css)
+{
+    int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(css->cds.ao);
+
+    if (css->section != LIBXL_COLO_SVM_RESUMED) {
+        LOG(ERROR, "invalid section: %d, expected: %d",
+            css->section, LIBXL_COLO_SVM_RESUMED);
+        goto out;
+    }
+
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/* ===================== colo: wait new checkpoint ===================== */
+/*
+ * Do the following things:
+ * 1. do commit
+ * 2. wait for a new checkpoint
+ * 3. write LIBXL_COLO_NEW_CHECKPOINT
+ */
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc);
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc);
+static void colo_send_data_done(libxl__egc *egc,
+                                libxl__datacopier_state *dc,
+                                int onwrite, int errnoval);
+
+void libxl__colo_save_domain_checkpoint_callback(void *data)
+{
+    libxl__save_helper_state *shs = data;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
+    libxl__egc *egc = dss->shs.egc;
+
+    /* Convenience aliases */
+    libxl__checkpoint_devices_state *const cds = &dss->css.cds;
+
+    cds->callback = colo_device_commit_cb;
+    libxl__checkpoint_devices_commit(egc, cds);
+}
+
+static void colo_device_commit_cb(libxl__egc *egc,
+                                  libxl__checkpoint_devices_state *cds,
+                                  int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc) {
+        LOG(ERROR, "commit fails");
+        goto out;
+    }
+
+    /* TODO: wait a new checkpoint */
+    colo_start_new_checkpoint(egc, cds, 0);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_start_new_checkpoint(libxl__egc *egc,
+                                      libxl__checkpoint_devices_state *cds,
+                                      int rc)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(cds, *css, cds);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    uint8_t section = LIBXL_COLO_NEW_CHECKPOINT;
+
+    /* Convenience aliases */
+    libxl__datacopier_state *const dc = &css->dc;
+
+    STATE_AO_GC(cds->ao);
+
+    if (rc)
+        goto out;
+
+    /* write LIBXL_COLO_NEW_CHECKPOINT */
+    memset(dc, 0, sizeof(*dc));
+    dc->ao = ao;
+    dc->readfd = -1;
+    dc->writefd = css->send_fd;
+    dc->maxsz = INT_MAX;
+    dc->copywhat = "new checkpoint is triggered";
+    dc->writewhat = "colo stream";
+    dc->callback = colo_send_data_done;
+
+    rc = libxl__datacopier_start(dc);
+    if (rc) {
+        LOG(ERROR, "libxl__datacopier_start() fails");
+        goto out;
+    }
+
+    /* tell slave that a new checkpoint is triggered */
+    libxl__datacopier_prefixdata(egc, dc, &section, sizeof(section));
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void colo_send_data_done(libxl__egc *egc,
+                                libxl__datacopier_state *dc,
+                                int onwrite, int errnoval)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(dc, *css, dc);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    int ok;
+
+    STATE_AO_GC(dc->ao);
+
+    if (onwrite == -1 || errnoval) {
+        LOG(ERROR, "cannot start a new checkpoint");
+        ok = 0;
+        goto out;
+    }
+
+    /* Everything is OK */
+    ok = 1;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+
+/* ===================== colo: common callback ===================== */
+static void colo_common_read_done(libxl__egc *egc,
+                                  libxl__datareader_state *drs,
+                                  ssize_t real_size, int errnoval)
+{
+    int ok = 0;
+    libxl__colo_save_state *css = CONTAINER_OF(drs, *css, drs);
+    libxl__domain_suspend_state *dss = CONTAINER_OF(css, *dss, css);
+    STATE_AO_GC(drs->ao);
+
+    if (real_size < drs->readsize) {
+        LOG(ERROR, "reading data fails: %lld", (long long)real_size);
+        goto out;
+    }
+
+    if (!css->callback) {
+        /* Everything is OK */
+        ok = 1;
+        goto out;
+    }
+
+    css->callback(egc, css);
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 769952c..450eb39 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -20,6 +20,7 @@
 #include "libxl_internal.h"
 #include "libxl_arch.h"
 #include "libxl_remus.h"
+#include "libxl_colo.h"
 
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
@@ -1624,7 +1625,12 @@ void libxl__domain_suspend(libxl__egc *egc, libxl__domain_suspend_state *dss)
     }
 
     memset(callbacks, 0, sizeof(*callbacks));
-    if (r_info != NULL) {
+    if (r_info != NULL && r_info->colo) {
+        callbacks->suspend = libxl__colo_save_domain_suspend_callback;
+        callbacks->postcopy = libxl__colo_save_domain_resume_callback;
+        callbacks->checkpoint = libxl__colo_save_domain_checkpoint_callback;
+        callbacks->get_dirty_pfn = libxl__colo_save_get_dirty_pfn_callback;
+    } else if (r_info != NULL) {
         callbacks->suspend = libxl__remus_domain_suspend_callback;
         callbacks->postcopy = libxl__remus_domain_resume_callback;
         callbacks->checkpoint = libxl__remus_domain_checkpoint_callback;
@@ -1785,7 +1791,10 @@ static void domain_suspend_done(libxl__egc *egc,
         xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
                            dss2->guest_evtchn.port, &dss2->guest_evtchn_lockfd);
 
-    if (dss->remus) {
+    if (dss->remus && dss->remus->colo) {
+        libxl__colo_save_teardown(egc, &dss->css, rc);
+        return;
+    } else if (dss->remus) {
         libxl__remus_teardown(egc, &dss->rs, rc);
         return;
     }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index d2e3176..aebc972 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2521,7 +2521,7 @@ typedef struct libxl__save_helper_state {
 /*
  * The abstract checkpoint device layer exposes a common
  * set of API to [external] libxl for manipulating devices attached to
- * a guest protected by Remus. The device layer also exposes a set of
+ * a guest protected by Remus/COLO. The device layer also exposes a set of
  * [internal] interfaces that every device type must implement.
  *
  * The following API are exposed to libxl:
@@ -2539,7 +2539,7 @@ typedef struct libxl__save_helper_state {
  *  +libxl__checkpoint_devices_commit
  *
  * Each device type needs to implement the interfaces specified in
- * the libxl__checkpoint_device_subkind_ops if it wishes to support Remus.
+ * the libxl__checkpoint_device_subkind_ops if it wishes to support Remus/COLO.
  *
  * The high-level control flow through the checkpoint device layer is shown
  * below:
@@ -2564,7 +2564,7 @@ typedef struct libxl__checkpoint_device_subkind_ops libxl__checkpoint_device_sub
 
 /*
  * Interfaces to be implemented by every device type that wishes to
- * support Remus. Functions must be implemented unless otherwise
+ * support Remus/COLO. Functions must be implemented unless otherwise
  * stated. Many of these functions are asynchronous. They call
  * dev->aodev.callback when done.  The actual implementations may be
  * synchronous and call dev->aodev.callback directly (as the last
@@ -2719,6 +2719,25 @@ struct libxl__remus_state {
 
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
+/*----- colo related state structure -----*/
+typedef struct libxl__colo_save_state libxl__colo_save_state;
+struct libxl__colo_save_state {
+    libxl__checkpoint_devices_state cds;
+    int send_fd;
+    int recv_fd;
+
+    /* private */
+    libxl__datacopier_state dc;
+    libxl__datareader_state drs;
+    uint8_t section;
+    uint64_t count;
+    uint64_t *buff;
+    /* read section and count, and then store it in temp_buff */
+    uint8_t temp_buff[9];
+    void (*callback)(libxl__egc *, libxl__colo_save_state *);
+    bool svm_running;
+};
+
 /*----- Domain suspend (save) state structure -----*/
 
 typedef struct libxl__domain_suspend_state libxl__domain_suspend_state;
@@ -2782,7 +2801,12 @@ struct libxl__domain_suspend_state {
     libxl__domain_suspend_state2 dss2;
     int hvm;
     int xcflags;
-    libxl__remus_state rs;
+    union {
+        /* for Remus */
+        libxl__remus_state rs;
+        /* for COLO */
+        libxl__colo_save_state css;
+    };
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
     /* private for libxl__domain_save_device_model */
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index 0239cac..fbb2d67 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -36,6 +36,7 @@ our @msgs = (
                                               'unsigned long', 'console_mfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
+    [ 10, 'scxAB',  "get_dirty_pfn", [] ],
 );
 
 #----------------------------------------
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index ea51d1a..599f137 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -602,6 +602,7 @@ libxl_domain_remus_info = Struct("domain_remus_info",[
     ("netbuf",       bool),
     ("netbufscript", string),
     ("diskbuf",      bool),
+    ("colo",         bool)
     ])
 
 libxl_event_type = Enumeration("event_type", [
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 20/45] xc_domain_save: flush cache before calling callbacks->postcopy() in colo mode
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (18 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 19/45] primary vm suspend/get_dirty_pfn/resume/checkpoint code Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 21/45] COLO: xc related codes Wen Congyang
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

In colo mode, secondary vm is running. We will use the io_fd to
ensure that both primary vm and secondary vm are resumed
at the same time. So we should call postcopy later.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_save.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 254fdb3..61caa47 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -2078,10 +2078,15 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
  out_rc:
     completed = 1;
 
-    if ( !rc && callbacks->postcopy )
+    /*
+     * COLO: secondary vm is running. We will use the io_fd to
+     * ensure that both primary vm and secondary vm are resumed
+     * at the same time. So we should call postcopy later.
+     */
+    if ( !rc && callbacks->postcopy && !callbacks->get_dirty_pfn )
         callbacks->postcopy(callbacks->data);
 
-    /* guest has been resumed. Now we can compress data
+    /* Remus: guest has been resumed. Now we can compress data
      * at our own pace.
      */
     if (!rc && compressing)
@@ -2109,6 +2114,13 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
 
     discard_file_cache(xch, io_fd, 1 /* flush */);
 
+    /*
+     * COLO: send qemu device state and resume both
+     * primary vm and secondary vm now.
+     */
+    if ( !rc && callbacks->postcopy && callbacks->get_dirty_pfn )
+        callbacks->postcopy(callbacks->data);
+
     /* Enable compression now, finally */
     compressing = (flags & XCFLAGS_CHECKPOINT_COMPRESS);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 21/45] COLO: xc related codes
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (19 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 20/45] xc_domain_save: flush cache before calling callbacks->postcopy() in colo mode Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 22/45] send store mfn and console mfn to xl before resuming secondary vm Wen Congyang
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Save:
1. send XC_SAVE_ID_LAST_CHECKPOINT, so secondary vm can be resumed
2. call callbacks->get_dirty_pfn() after suspend primary vm if we
   are doing checkpoint.

Restore:
1. call the callbacks resume/checkpoint/suspend if secondary vm's
   status is the same as primary vm's status.
2. zero out tdata because we will use it zero out pagebuf.tdata.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_restore.c | 44 ++++++++++++++++++++++++++++++++--
 tools/libxc/xc_domain_save.c    | 52 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index 2d6139c..fe188f4 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -1454,7 +1454,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     int nraces = 0;
 
     /* The new domain's shared-info frame number. */
-    unsigned long shared_info_frame;
+    unsigned long shared_info_frame = 0;
     unsigned char shared_info_page[PAGE_SIZE]; /* saved contents from file */
     shared_info_any_t *old_shared_info = 
         (shared_info_any_t *)shared_info_page;
@@ -1504,6 +1504,8 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
 
     DPRINTF("%s: starting restore of new domid %u", __func__, dom);
 
+    n = m = 0;
+
     pagebuf_init(&pagebuf);
     memset(&tailbuf, 0, sizeof(tailbuf));
     tailbuf.ishvm = hvm;
@@ -1629,7 +1631,6 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
      * We uncanonicalise page tables as we go.
      */
 
-    n = m = 0;
  loadpages:
     for ( ; ; )
     {
@@ -1793,6 +1794,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
         goto finish;
     }
 
+new_checkpoint:
     // DPRINTF("Buffered checkpoint\n");
 
     if ( pagebuf_get(xch, ctx, &pagebuf, io_fd, dom) ) {
@@ -2292,6 +2294,7 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
             free(tdata.data);
             goto out;
         }
+        memset(&tdata, 0, sizeof(tdata));
     }
 
     /* Dump the QEMU state to a state file for QEMU to load */
@@ -2357,6 +2360,43 @@ int xc_domain_restore(xc_interface *xch, int io_fd, uint32_t dom,
     rc = 0;
 
  out:
+    if ( !rc && callbacks->checkpoint )
+    {
+#define HANDLE_CALLBACK_RETURN_VALUE(frc)                   \
+    do {                                                    \
+        if ( frc == 0 )                                     \
+        {                                                   \
+            /* Some internal error happens */               \
+            rc = 1;                                         \
+            goto out;                                       \
+        }                                                   \
+        else if ( frc == 2 )                                \
+        {                                                   \
+            /* Reading/writing error, do failover */        \
+            rc = 0;                                         \
+            goto failover;                                  \
+        }                                                   \
+    } while (0)
+        /* COLO */
+
+        /* TODO: call restore_results */
+
+        /* Resume secondary vm */
+        frc = callbacks->postcopy(callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(frc);
+
+        /* wait for new checkpoint */
+        frc = callbacks->checkpoint(callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(frc);
+
+        /* suspend secondary vm */
+        frc = callbacks->suspend(callbacks->data);
+        HANDLE_CALLBACK_RETURN_VALUE(frc);
+
+        goto new_checkpoint;
+    }
+
+failover:
     if ( (rc != 0) && (dom != 0) )
         xc_domain_destroy(xch, dom);
     xc_hypercall_buffer_free(xch, ctxt);
diff --git a/tools/libxc/xc_domain_save.c b/tools/libxc/xc_domain_save.c
index 61caa47..79cc2c8 100644
--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -377,6 +377,31 @@ static int suspend_and_state(int (*suspend)(void*), void* data,
     return 0;
 }
 
+static int update_dirty_bitmap(uint8_t *(*get_dirty_pfn)(void *), void *data,
+                               unsigned long p2m_size, unsigned long *to_send)
+{
+    uint64_t *pfn_list;
+    uint64_t count, i;
+    uint64_t pfn;
+
+    pfn_list = (uint64_t *)get_dirty_pfn(data);
+    assert(pfn_list);
+
+    count = pfn_list[0];
+    for (i = 0; i < count; i++) {
+        pfn = pfn_list[i + 1];
+        if (pfn > p2m_size) {
+            errno = EINVAL;
+            return -1;
+        }
+
+        set_bit(pfn, to_send);
+    }
+
+    free(pfn_list);
+    return 0;
+}
+
 /*
 ** Map the top-level page of MFNs from the guest. The guest might not have
 ** finished resuming from a previous restore operation, so we wait a while for
@@ -1769,11 +1794,14 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
         free(buf);
     }
 
-    if ( !callbacks->checkpoint )
+    if ( !callbacks->checkpoint || callbacks->get_dirty_pfn )
     {
         /*
          * If this is not a checkpointed save then this must be the first and
          * last checkpoint.
+         *
+         * If we are in colo mode, send last checkpoint to resume secondary
+         * vm.
          */
         i = XC_SAVE_ID_LAST_CHECKPOINT;
         if ( wrexact(io_fd, &i, sizeof(int)) )
@@ -2119,7 +2147,14 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
      * primary vm and secondary vm now.
      */
     if ( !rc && callbacks->postcopy && callbacks->get_dirty_pfn )
-        callbacks->postcopy(callbacks->data);
+    {
+        if ( !callbacks->postcopy(callbacks->data) )
+        {
+            ERROR("postcopy fails");
+            /* postcopy may be implemented in libxl, no way to get errno */
+            rc = -1;
+        }
+    }
 
     /* Enable compression now, finally */
     compressing = (flags & XCFLAGS_CHECKPOINT_COMPRESS);
@@ -2136,8 +2171,11 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
                                io_fd, dom, &info) )
         {
             ERROR("Domain appears not to have suspended");
+            /* postcopy may be implemented in libxl, no way to get errno */
+            errno = -1;
             goto out;
         }
+
         DPRINTF("SUSPEND shinfo %08lx\n", info.shared_info_frame);
         print_stats(xch, dom, 0, &time_stats, &shadow_stats, 1);
 
@@ -2148,6 +2186,16 @@ int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t max_iter
             PERROR("Error flushing shadow PT");
         }
 
+        if ( callbacks->get_dirty_pfn )
+        {
+            if ( update_dirty_bitmap(callbacks->get_dirty_pfn, callbacks->data,
+                                     dinfo->p2m_size, to_send) )
+            {
+                ERROR("getting secondary vm's dirty pages failed");
+                goto out;
+            }
+        }
+
         goto copypages;
     }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 22/45] send store mfn and console mfn to xl before resuming secondary vm
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (20 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 21/45] COLO: xc related codes Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 23/45] implement the cmdline for COLO Wen Congyang
                   ` (24 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We will call libxl__xc_domain_restore_done() to rebuild secondary vm. But
we need store mfn and console mfn when rebuilding secondary vm. So make
restore_results is a function pointers in callbacks struct and struct
{save,restore}_callbacks, and use this callback to send store mfn and
console mfn to xl.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxc/xc_domain_restore.c    | 2 +-
 tools/libxc/xenguest.h             | 8 ++++++++
 tools/libxl/libxl_colo_restore.c   | 5 -----
 tools/libxl/libxl_create.c         | 1 +
 tools/libxl/libxl_save_msgs_gen.pl | 2 +-
 5 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/tools/libxc/xc_domain_restore.c b/tools/libxc/xc_domain_restore.c
index fe188f4..3700473 100644
--- a/tools/libxc/xc_domain_restore.c
+++ b/tools/libxc/xc_domain_restore.c
@@ -2379,7 +2379,7 @@ new_checkpoint:
     } while (0)
         /* COLO */
 
-        /* TODO: call restore_results */
+        callbacks->restore_results(*store_mfn, *console_mfn, callbacks->data);
 
         /* Resume secondary vm */
         frc = callbacks->postcopy(callbacks->data);
diff --git a/tools/libxc/xenguest.h b/tools/libxc/xenguest.h
index 1aeaad2..be8afd4 100644
--- a/tools/libxc/xenguest.h
+++ b/tools/libxc/xenguest.h
@@ -123,6 +123,14 @@ struct restore_callbacks {
     /* Enable qemu-dm logging dirty pages to xen */
     int (*switch_qemu_logdirty)(int domid, unsigned enable, void *data); /* HVM only */
 
+    /*
+     * callback to send store mfn and console mfn to xl
+     * if we want to resume vm before xc_domain_save()
+     * exits.
+     */
+    void (*restore_results)(unsigned long store_mfn, unsigned long console_mfn,
+                            void *data);
+
     /* callback to restore toolstack specific data */
     int (*toolstack_restore)(uint32_t domid, const uint8_t *buf,
             uint32_t size, void* data);
diff --git a/tools/libxl/libxl_colo_restore.c b/tools/libxl/libxl_colo_restore.c
index ebbd6b9..aea3feb 100644
--- a/tools/libxl/libxl_colo_restore.c
+++ b/tools/libxl/libxl_colo_restore.c
@@ -133,11 +133,6 @@ static void colo_resume_vm(libxl__egc *egc,
         return;
     }
 
-    /*
-     * TODO: get store mfn and console mfn
-     *  We should call the callback restore_results in
-     *  xc_domain_restore() before resuming the guest.
-     */
     libxl__xc_domain_restore_done(egc, dcs, 0, 0, 0);
 
     return;
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index fef9b36..46bd02d 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -1063,6 +1063,7 @@ static void domcreate_bootloader_done(libxl__egc *egc,
         rc = ERROR_INVAL;
         goto out;
     }
+    callbacks->restore_results = libxl__srm_callout_callback_restore_results;
 
     if (checkpointed_stream == LIBXL_CHECKPOINTED_STREAM_COLO) {
         crs->ao = ao;
diff --git a/tools/libxl/libxl_save_msgs_gen.pl b/tools/libxl/libxl_save_msgs_gen.pl
index fbb2d67..2ecd25d 100755
--- a/tools/libxl/libxl_save_msgs_gen.pl
+++ b/tools/libxl/libxl_save_msgs_gen.pl
@@ -32,7 +32,7 @@ our @msgs = (
     #                toolstack_save          done entirely `by hand'
     [  7, 'rcxW',   "toolstack_restore",     [qw(uint32_t domid
                                                 BLOCK tsdata)] ],
-    [  8, 'r',      "restore_results",       ['unsigned long', 'store_mfn',
+    [  8, 'rcx',    "restore_results",       ['unsigned long', 'store_mfn',
                                               'unsigned long', 'console_mfn'] ],
     [  9, 'srW',    "complete",              [qw(int retval
                                                  int errnoval)] ],
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 23/45] implement the cmdline for COLO
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (21 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 22/45] send store mfn and console mfn to xl before resuming secondary vm Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 24/45] HACK: do checkpoint per 20ms Wen Congyang
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Add a new option -c to the command 'xl remus'. If you want
to use COLO HA instead of Remus HA, please use -c option.

Update man pages to reflect the addition of a new option to
'xl remus' command.

Also add a new option -c to the internal command 'xl migrate-receive'.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/man/xl.pod.1         | 11 +++++++++--
 tools/libxl/xl_cmdimpl.c  | 47 ++++++++++++++++++++++++++++++++++++++---------
 tools/libxl/xl_cmdtable.c |  3 ++-
 3 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index bce4bfe..297cd04 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -427,12 +427,15 @@ Print huge (!) amount of debug during the migration process.
 
 =item B<remus> [I<OPTIONS>] I<domain-id> I<host>
 
-Enable Remus HA for domain. By default B<xl> relies on ssh as a transport
-mechanism between the two hosts.
+Enable Remus HA or COLO HA for domain. By default B<xl> relies on ssh as a
+transport mechanism between the two hosts.
 
 N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
      Disk replication support is limited to DRBD disks.
 
+     COLO support in xl is still in experimental (proof-of-concept) phase.
+     There is no support for network or disk at the moment.
+
 B<OPTIONS>
 
 =over 4
@@ -478,6 +481,10 @@ Disable network output buffering. Requires enabling unsafe mode.
 
 Disable disk replication. Requires enabling unsafe mode.
 
+=item B<-c>
+
+Enable COLO HA. It is conflict with B<-i> and B<-b>.
+
 =back
 
 =item B<pause> I<domain-id>
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index ccb46ab..1110d53 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -3754,6 +3754,9 @@ static void migrate_receive(int debug, int daemonize, int monitor,
     dom_info.send_fd = send_fd;
     dom_info.migration_domname_r = &migration_domname;
     dom_info.checkpointed_stream = remus;
+    if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
+        /* COLO uses stdout to send control message to master */
+        dom_info.quiet = 1;
 
     rc = create_domain(&dom_info);
     if (rc < 0) {
@@ -3768,7 +3771,8 @@ static void migrate_receive(int debug, int daemonize, int monitor,
         /* If we are here, it means that the sender (primary) has crashed.
          * TODO: Split-Brain Check.
          */
-        fprintf(stderr, "migration target: Remus Failover for domain %u\n",
+        fprintf(stderr, "migration target: %s Failover for domain %u\n",
+                remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                 domid);
 
         /*
@@ -3785,15 +3789,21 @@ static void migrate_receive(int debug, int daemonize, int monitor,
             rc = libxl_domain_rename(ctx, domid, migration_domname,
                                      common_domname);
             if (rc)
-                fprintf(stderr, "migration target (Remus): "
+                fprintf(stderr, "migration target (%s): "
                         "Failed to rename domain from %s to %s:%d\n",
+                        remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                         migration_domname, common_domname, rc);
         }
 
+        if (remus == LIBXL_CHECKPOINTED_STREAM_COLO)
+            /* The guest is running after failover in COLO mode */
+            exit(rc ? -ERROR_FAIL: 0);
+
         rc = libxl_domain_unpause(ctx, domid);
         if (rc)
-            fprintf(stderr, "migration target (Remus): "
+            fprintf(stderr, "migration target (%s): "
                     "Failed to unpause domain %s (id: %u):%d\n",
+                    remus == LIBXL_CHECKPOINTED_STREAM_COLO ? "COLO" : "Remus",
                     common_domname, domid, rc);
 
         exit(rc ? -ERROR_FAIL: 0);
@@ -3939,7 +3949,7 @@ int main_migrate_receive(int argc, char **argv)
     int debug = 0, daemonize = 1, monitor = 1, remus = 0;
     int opt;
 
-    SWITCH_FOREACH_OPT(opt, "Fedr", NULL, "migrate-receive", 0) {
+    SWITCH_FOREACH_OPT(opt, "Fedrc", NULL, "migrate-receive", 0) {
     case 'F':
         daemonize = 0;
         break;
@@ -3951,8 +3961,10 @@ int main_migrate_receive(int argc, char **argv)
         debug = 1;
         break;
     case 'r':
-        remus = 1;
+        remus = LIBXL_CHECKPOINTED_STREAM_REMUS;
         break;
+    case 'c':
+        remus = LIBXL_CHECKPOINTED_STREAM_COLO;
     }
 
     if (argc-optind != 0) {
@@ -7253,6 +7265,7 @@ int main_remus(int argc, char **argv)
     pid_t child = -1;
     uint8_t *config_data;
     int config_len;
+    int interval = 0;
 
     memset(&r_info, 0, sizeof(libxl_domain_remus_info));
     /* Defaults */
@@ -7263,9 +7276,10 @@ int main_remus(int argc, char **argv)
     r_info.netbuf = 1;
     r_info.diskbuf = 1;
 
-    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:e", NULL, "remus", 2) {
+    SWITCH_FOREACH_OPT(opt, "Fbundi:s:N:ec", NULL, "remus", 2) {
     case 'i':
         r_info.interval = atoi(optarg);
+        interval = 1;
         break;
     case 'F':
         r_info.unsafe = 1;
@@ -7291,6 +7305,8 @@ int main_remus(int argc, char **argv)
     case 'e':
         daemonize = 0;
         break;
+    case 'c':
+        r_info.colo = 1;
     }
 
     if (!r_info.unsafe &&
@@ -7303,6 +7319,16 @@ int main_remus(int argc, char **argv)
     domid = find_domain(argv[optind]);
     host = argv[optind + 1];
 
+    if (r_info.colo) {
+        if (!interval)
+            r_info.interval = 0;
+
+        if (r_info.interval + r_info.blackhole > 0) {
+            perror("option c is conflict with i or b");
+            exit(-1);
+        }
+    }
+
     if (!r_info.netbufscript)
         r_info.netbufscript = default_remus_netbufscript;
 
@@ -7317,8 +7343,9 @@ int main_remus(int argc, char **argv)
         if (!ssh_command[0]) {
             rune = host;
         } else {
-            if (asprintf(&rune, "exec %s %s xl migrate-receive -r %s",
+            if (asprintf(&rune, "exec %s %s xl migrate-receive %s %s",
                          ssh_command, host,
+                         r_info.colo ? "-c" : "-r",
                          daemonize ? "" : " -e") < 0)
                 return 1;
         }
@@ -7347,7 +7374,8 @@ int main_remus(int argc, char **argv)
      * domain to force failover
      */
     if (libxl_domain_info(ctx, 0, domid)) {
-        fprintf(stderr, "Remus: Primary domain has been destroyed.\n");
+        fprintf(stderr, "%s: Primary domain has been destroyed.\n",
+                r_info.colo ? "COLO" : "Remus");
         close(send_fd);
         return 0;
     }
@@ -7359,7 +7387,8 @@ int main_remus(int argc, char **argv)
     if (rc == ERROR_GUEST_TIMEDOUT)
         fprintf(stderr, "Failed to suspend domain at primary.\n");
     else {
-        fprintf(stderr, "Remus: Backup failed? resuming domain at primary.\n");
+        fprintf(stderr, "%s: Backup failed? resuming domain at primary.\n",
+                r_info.colo ? "COLO" : "Remus");
         libxl_domain_resume(ctx, domid, 1, 0);
     }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 6d4596b..22b63db 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -498,7 +498,8 @@ struct cmd_spec cmd_table[] = {
       "-b                      Replicate memory checkpoints to /dev/null (blackhole).\n"
       "                        Works only in unsafe mode.\n"
       "-n                      Disable network output buffering. Works only in unsafe mode.\n"
-      "-d                      Disable disk replication. Works only in unsafe mode."
+      "-d                      Disable disk replication. Works only in unsafe mode.\n"
+      "-c                      Enable COLO HA. It is conflict with -i and -b"
     },
 #endif
     { "devd",
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 24/45] HACK: do checkpoint per 20ms
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (22 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 23/45] implement the cmdline for COLO Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 25/45] colo: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_colo_save.c | 19 ++++++++++++++++++-
 tools/libxl/libxl_internal.h  |  3 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index 6675d3d..75d83c8 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -55,6 +55,8 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     cds->ao = ao;
     cds->domid = dss->domid;
 
+    libxl__ev_time_init(&css->timeout);
+
     libxl__checkpoint_devices_setup(egc, &css->cds);
 
     return;
@@ -473,6 +475,8 @@ out:
 static void colo_device_commit_cb(libxl__egc *egc,
                                   libxl__checkpoint_devices_state *cds,
                                   int rc);
+static void colo_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                  const struct timeval *requested_abs);
 static void colo_start_new_checkpoint(libxl__egc *egc,
                                       libxl__checkpoint_devices_state *cds,
                                       int rc);
@@ -508,13 +512,26 @@ static void colo_device_commit_cb(libxl__egc *egc,
     }
 
     /* TODO: wait a new checkpoint */
-    colo_start_new_checkpoint(egc, cds, 0);
+    rc = libxl__ev_time_register_rel(gc, &css->timeout,
+                                     colo_next_checkpoint,
+                                     20);
+    if (rc)
+        goto out;
+
     return;
 
 out:
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
 }
 
+static void colo_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                 const struct timeval *requested_abs)
+{
+    libxl__colo_save_state *css = CONTAINER_OF(ev, *css, timeout);
+
+    colo_start_new_checkpoint(egc, &css->cds, 0);
+}
+
 static void colo_start_new_checkpoint(libxl__egc *egc,
                                       libxl__checkpoint_devices_state *cds,
                                       int rc)
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index aebc972..6fc26c9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2736,6 +2736,9 @@ struct libxl__colo_save_state {
     uint8_t temp_buff[9];
     void (*callback)(libxl__egc *, libxl__colo_save_state *);
     bool svm_running;
+
+    /* hack */
+    libxl__ev_time timeout;
 };
 
 /*----- Domain suspend (save) state structure -----*/
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 25/45] colo: dynamic allocate aio_requests to avoid -EBUSY error
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (23 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 24/45] HACK: do checkpoint per 20ms Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 26/45] fix memory leak in block-remus Wen Congyang
                   ` (21 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

From: Lai Jiangshan <laijs@cn.fujitsu.com>

In normal case, there are at most TAPDISK_DATA_REQUESTS request
at the same time. But in remus mode, the write requests are
forwarded from the master side, and cached in block-remus. All
cached requests will be forwarded to aio driver when syncing PVM
and SVM. In this case, The number of requests may be more than
TAPDISK_DATA_REQUESTS. So aio driver can't hanlde these requests
at the same time, it will cause tapdisk2 exit.

We don't know how many requests will be handled, so dynamic allocate
aio_requests to avoid this error.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-aio.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index f398da2..10ab20b 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -55,9 +55,10 @@ struct tdaio_state {
 	int                  fd;
 	td_driver_t         *driver;
 
+	int                  aio_max_count;
 	int                  aio_free_count;	
 	struct aio_request   aio_requests[MAX_AIO_REQS];
-	struct aio_request  *aio_free_list[MAX_AIO_REQS];
+	struct aio_request   **aio_free_list;
 };
 
 /*Get Image size, secsize*/
@@ -122,6 +123,11 @@ int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags)
 
 	memset(prv, 0, sizeof(struct tdaio_state));
 
+	prv->aio_free_list = malloc(MAX_AIO_REQS * sizeof(*prv->aio_free_list));
+	if (!prv->aio_free_list)
+		return -ENOMEM;
+
+	prv->aio_max_count = MAX_AIO_REQS;
 	prv->aio_free_count = MAX_AIO_REQS;
 	for (i = 0; i < MAX_AIO_REQS; i++)
 		prv->aio_free_list[i] = &prv->aio_requests[i];
@@ -159,6 +165,28 @@ done:
 	return ret;	
 }
 
+static int tdaio_refill(struct tdaio_state *prv)
+{
+	struct aio_request **new, *new_req;
+	int i, max = prv->aio_max_count + MAX_AIO_REQS;
+
+	new = realloc(prv->aio_free_list, max * sizeof(*prv->aio_free_list));
+	if (!new)
+		return -1;
+	prv->aio_free_list = new;
+
+	new_req = calloc(MAX_AIO_REQS, sizeof(*new_req));
+	if (!new_req)
+		return -1;
+
+	prv->aio_max_count = max;
+	prv->aio_free_count = MAX_AIO_REQS;
+	for (i = 0; i < MAX_AIO_REQS; i++)
+		prv->aio_free_list[i] = &new_req[i];
+
+	return 0;
+}
+
 void tdaio_complete(void *arg, struct tiocb *tiocb, int err)
 {
 	struct aio_request *aio = (struct aio_request *)arg;
@@ -207,8 +235,10 @@ void tdaio_queue_write(td_driver_t *driver, td_request_t treq)
 	size    = treq.secs * driver->info.sector_size;
 	offset  = treq.sec  * (uint64_t)driver->info.sector_size;
 
-	if (prv->aio_free_count == 0)
-		goto fail;
+	if (prv->aio_free_count == 0) {
+		if (tdaio_refill(prv) < 0)
+			goto fail;
+	}
 
 	aio        = prv->aio_free_list[--prv->aio_free_count];
 	aio->treq  = treq;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 26/45] fix memory leak in block-remus
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (24 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 25/45] colo: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 27/45] pass uuid to the callback td_open Wen Congyang
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Fix the following two memory leak:
1. If s->ramdisk.prev is not NULL, we merge the write requests in
   s->ramdisk.h into s->ramdisk.prev, and then destroy s->ramdisk.h.
   But we forget to free hash value when destroying s->ramdisk.h.
2. When write requests is finished, replicated_write_callback() will
   be called. We forget free the buff in this function.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 079588d..4ce9dbe 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -602,7 +602,7 @@ static int ramdisk_start_flush(td_driver_t *driver)
 		}
 		free(sectors);
 
-		hashtable_destroy (s->ramdisk.h, 0);
+		hashtable_destroy (s->ramdisk.h, 1);
 	} else
 		s->ramdisk.prev = s->ramdisk.h;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 27/45] pass uuid to the callback td_open
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (25 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 26/45] fix memory leak in block-remus Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 28/45] return the correct dev path Wen Congyang
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

remus's callback td_open needs uuid, but it is hard coded as 0.
After commit 4b1af8, the vbd's uuid is the minor of the blktap
device, not 0.
---
 tools/blktap2/drivers/block-aio.c         | 3 ++-
 tools/blktap2/drivers/block-cache.c       | 3 ++-
 tools/blktap2/drivers/block-log.c         | 3 ++-
 tools/blktap2/drivers/block-qcow.c        | 3 ++-
 tools/blktap2/drivers/block-ram.c         | 3 ++-
 tools/blktap2/drivers/block-remus.c       | 8 ++------
 tools/blktap2/drivers/block-vhd.c         | 3 ++-
 tools/blktap2/drivers/tapdisk-interface.c | 4 +++-
 tools/blktap2/drivers/tapdisk.h           | 2 +-
 9 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index 10ab20b..1b560e5 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -111,7 +111,8 @@ static int tdaio_get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize aio state. */
-int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags)
+int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
+	       td_uuid_t uuid)
 {
 	int i, fd, ret, o_flags;
 	struct tdaio_state *prv;
diff --git a/tools/blktap2/drivers/block-cache.c b/tools/blktap2/drivers/block-cache.c
index 1d2f4eb..cd6ea6a 100644
--- a/tools/blktap2/drivers/block-cache.c
+++ b/tools/blktap2/drivers/block-cache.c
@@ -517,7 +517,8 @@ block_cache_put_request(block_cache_t *cache, block_cache_request_t *breq)
 }
 
 static int
-block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags)
+block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
+		 td_uuid_t uuid)
 {
 	int i, err;
 	radix_tree_t *tree;
diff --git a/tools/blktap2/drivers/block-log.c b/tools/blktap2/drivers/block-log.c
index 5330cdc..7b33b63 100644
--- a/tools/blktap2/drivers/block-log.c
+++ b/tools/blktap2/drivers/block-log.c
@@ -585,7 +585,8 @@ static void ctl_request(event_id_t id, char mode, void *private)
 
 static int tdlog_close(td_driver_t*);
 
-static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags)
+static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags,
+		      td_uuid_t uuid)
 {
   struct tdlog_state* s = (struct tdlog_state*)driver->data;
   int rc;
diff --git a/tools/blktap2/drivers/block-qcow.c b/tools/blktap2/drivers/block-qcow.c
index b45bcaa..64dfafc 100644
--- a/tools/blktap2/drivers/block-qcow.c
+++ b/tools/blktap2/drivers/block-qcow.c
@@ -865,7 +865,8 @@ out:
 }
 
 /* Open the disk file and initialize qcow state. */
-int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags)
+int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
+		 td_uuid_t uuid)
 {
 	int fd, len, i, ret, size, o_flags;
 	td_disk_info_t *bs = &(driver->info);
diff --git a/tools/blktap2/drivers/block-ram.c b/tools/blktap2/drivers/block-ram.c
index a859481..b64a194 100644
--- a/tools/blktap2/drivers/block-ram.c
+++ b/tools/blktap2/drivers/block-ram.c
@@ -108,7 +108,8 @@ static int get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize ram state. */
-int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags)
+int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
+		td_uuid_t uuid)
 {
 	char *p;
 	uint64_t size;
diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 4ce9dbe..504f6b4 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -1633,18 +1633,14 @@ static int ctl_register(struct tdremus_state *s)
 /* interface */
 
 static int tdremus_open(td_driver_t *driver, const char *name,
-			td_flag_t flags)
+			td_flag_t flags, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int rc;
 
 	RPRINTF("opening %s\n", name);
 
-	/* first we need to get the underlying vbd for this driver stack. To do so we
-	 * need to know the vbd's id. Fortunately, for tapdisk2 this is hard-coded as
-	 * 0 (see tapdisk2.c)
-	 */
-	device_vbd = tapdisk_server_get_vbd(0);
+	device_vbd = tapdisk_server_get_vbd(uuid);
 
 	memset(s, 0, sizeof(*s));
 	s->server_fd.fd = -1;
diff --git a/tools/blktap2/drivers/block-vhd.c b/tools/blktap2/drivers/block-vhd.c
index 76ea5bd..06e9c89 100644
--- a/tools/blktap2/drivers/block-vhd.c
+++ b/tools/blktap2/drivers/block-vhd.c
@@ -675,7 +675,8 @@ __vhd_open(td_driver_t *driver, const char *name, vhd_flag_t flags)
 }
 
 static int
-_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags)
+_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
+	  td_uuid_t uuid)
 {
 	vhd_flag_t vhd_flags = 0;
 
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index 2e51883..36b5393 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -63,6 +63,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 {
 	int err;
 	td_driver_t *driver;
+	td_vbd_t *vbd = image->private;
 
 	driver = image->driver;
 	if (!driver) {
@@ -78,7 +79,8 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 	}
 
 	if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
-		err = driver->ops->td_open(driver, image->name, image->flags);
+		err = driver->ops->td_open(driver, image->name, image->flags,
+					   vbd->uuid);
 		if (err) {
 			if (!image->driver)
 				tapdisk_driver_free(driver);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 66d508e..459eaec 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -157,7 +157,7 @@ struct tap_disk {
 	const char                  *disk_type;
 	td_flag_t                    flags;
 	int                          private_data_size;
-	int (*td_open)               (td_driver_t *, const char *, td_flag_t);
+	int (*td_open)               (td_driver_t *, const char *, td_flag_t, td_uuid_t);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 28/45] return the correct dev path
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (26 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 27/45] pass uuid to the callback td_open Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 29/45] blktap2: use correct way to get remus_image Wen Congyang
                   ` (18 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

The user uses TAPDISK_MESSAGE_OPEN to pass the devpath to tapdisk2,
and will use TAPDISK_MESSAGE_LIST to query and get the pid of the
tapdisk2.

The devpath's format is: driver:params[|driver:params[...]].
The first vbd image only contains the first params, and we will
return driver:params, not devpath. The devpath is stored in
vbd->name, so return vbd->name instead of image->name.
---
 tools/blktap2/drivers/tapdisk-control.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 0b5cf3c..3a4ec8e 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -270,15 +270,10 @@ tapdisk_control_list(struct tapdisk_control_connection *connection,
 		response.u.list.state   = vbd->state;
 		response.u.list.path[0] = 0;
 
-		if (!list_empty(&vbd->images)) {
-			td_image_t *image = list_entry(vbd->images.next,
-						       td_image_t, next);
+		if (vbd->name)
 			snprintf(response.u.list.path,
 				 sizeof(response.u.list.path),
-				 "%s:%s",
-				 tapdisk_disk_types[image->type]->name,
-				 image->name);
-		}
+				 "%s", vbd->name);
 
 		tapdisk_control_write_message(connection->socket, &response, 2);
 	}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 29/45] blktap2: use correct way to get remus_image
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (27 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 28/45] return the correct dev path Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 30/45] don't call client_flush() when switching to unprotected mode Wen Congyang
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We set remus_image in backup_read(). If we do flush
before the first read operation, remus_image will be
NULL. Pass image to remus via the callback td_open().

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-aio.c         | 6 ++++--
 tools/blktap2/drivers/block-cache.c       | 5 +++--
 tools/blktap2/drivers/block-log.c         | 5 +++--
 tools/blktap2/drivers/block-qcow.c        | 6 ++++--
 tools/blktap2/drivers/block-ram.c         | 6 ++++--
 tools/blktap2/drivers/block-remus.c       | 8 ++++----
 tools/blktap2/drivers/block-vhd.c         | 6 ++++--
 tools/blktap2/drivers/tapdisk-interface.c | 3 +--
 tools/blktap2/drivers/tapdisk.h           | 2 +-
 9 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index 1b560e5..27ba07d 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -40,6 +40,7 @@
 #include "tapdisk.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
+#include "tapdisk-image.h"
 
 #define MAX_AIO_REQS         TAPDISK_DATA_REQUESTS
 
@@ -111,11 +112,12 @@ static int tdaio_get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize aio state. */
-int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
-	       td_uuid_t uuid)
+int tdaio_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int i, fd, ret, o_flags;
 	struct tdaio_state *prv;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	ret = 0;
 	prv = (struct tdaio_state *)driver->data;
diff --git a/tools/blktap2/drivers/block-cache.c b/tools/blktap2/drivers/block-cache.c
index cd6ea6a..ff2c773 100644
--- a/tools/blktap2/drivers/block-cache.c
+++ b/tools/blktap2/drivers/block-cache.c
@@ -517,12 +517,13 @@ block_cache_put_request(block_cache_t *cache, block_cache_request_t *breq)
 }
 
 static int
-block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
-		 td_uuid_t uuid)
+block_cache_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int i, err;
 	radix_tree_t *tree;
 	block_cache_t *cache;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	if (!td_flag_test(flags, TD_OPEN_RDONLY))
 		return -EINVAL;
diff --git a/tools/blktap2/drivers/block-log.c b/tools/blktap2/drivers/block-log.c
index 7b33b63..80351d3 100644
--- a/tools/blktap2/drivers/block-log.c
+++ b/tools/blktap2/drivers/block-log.c
@@ -585,11 +585,12 @@ static void ctl_request(event_id_t id, char mode, void *private)
 
 static int tdlog_close(td_driver_t*);
 
-static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags,
-		      td_uuid_t uuid)
+static int tdlog_open(td_driver_t* driver, td_image_t *image, td_uuid_t uuid)
 {
   struct tdlog_state* s = (struct tdlog_state*)driver->data;
   int rc;
+  const char *name = image->name;
+  td_flag_t flags = image->flags;
 
   memset(s, 0, sizeof(*s));
 
diff --git a/tools/blktap2/drivers/block-qcow.c b/tools/blktap2/drivers/block-qcow.c
index 64dfafc..c63bd9d 100644
--- a/tools/blktap2/drivers/block-qcow.c
+++ b/tools/blktap2/drivers/block-qcow.c
@@ -45,6 +45,7 @@
 #include "qcow.h"
 #include "blk.h"
 #include "atomicio.h"
+#include "tapdisk-image.h"
 
 /* *BSD has no O_LARGEFILE */
 #ifndef O_LARGEFILE
@@ -865,14 +866,15 @@ out:
 }
 
 /* Open the disk file and initialize qcow state. */
-int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
-		 td_uuid_t uuid)
+int tdqcow_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int fd, len, i, ret, size, o_flags;
 	td_disk_info_t *bs = &(driver->info);
 	struct tdqcow_state   *s  = (struct tdqcow_state *)driver->data;
 	QCowHeader header;
 	uint64_t final_cluster = 0;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
  	DPRINTF("QCOW: Opening %s\n", name);
 
diff --git a/tools/blktap2/drivers/block-ram.c b/tools/blktap2/drivers/block-ram.c
index b64a194..3e148ab 100644
--- a/tools/blktap2/drivers/block-ram.c
+++ b/tools/blktap2/drivers/block-ram.c
@@ -40,6 +40,7 @@
 #include "tapdisk.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
+#include "tapdisk-image.h"
 
 char *img;
 long int   disksector_size;
@@ -108,13 +109,14 @@ static int get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize ram state. */
-int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
-		td_uuid_t uuid)
+int tdram_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	char *p;
 	uint64_t size;
 	int i, fd, ret = 0, count = 0, o_flags;
 	struct tdram_state *prv = (struct tdram_state *)driver->data;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	connections++;
 
diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 504f6b4..23a908a 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -1152,8 +1152,6 @@ void backup_queue_read(td_driver_t *driver, td_request_t treq)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int i;
-	if(!remus_image)
-		remus_image = treq.image;
 	
 	/* check if this read is queued in any currently ongoing flush */
 	if (ramdisk_read(&s->ramdisk, treq.sec, treq.secs, treq.buf)) {
@@ -1632,15 +1630,17 @@ static int ctl_register(struct tdremus_state *s)
 
 /* interface */
 
-static int tdremus_open(td_driver_t *driver, const char *name,
-			td_flag_t flags, td_uuid_t uuid)
+static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int rc;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	RPRINTF("opening %s\n", name);
 
 	device_vbd = tapdisk_server_get_vbd(uuid);
+	remus_image = image;
 
 	memset(s, 0, sizeof(*s));
 	s->server_fd.fd = -1;
diff --git a/tools/blktap2/drivers/block-vhd.c b/tools/blktap2/drivers/block-vhd.c
index 06e9c89..b20f724 100644
--- a/tools/blktap2/drivers/block-vhd.c
+++ b/tools/blktap2/drivers/block-vhd.c
@@ -59,6 +59,7 @@
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
 #include "tapdisk-disktype.h"
+#include "tapdisk-image.h"
 
 unsigned int SPB;
 
@@ -675,10 +676,11 @@ __vhd_open(td_driver_t *driver, const char *name, vhd_flag_t flags)
 }
 
 static int
-_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
-	  td_uuid_t uuid)
+_vhd_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	vhd_flag_t vhd_flags = 0;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	if (flags & TD_OPEN_RDONLY)
 		vhd_flags |= VHD_FLAG_OPEN_RDONLY;
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index 36b5393..a29de64 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -79,8 +79,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 	}
 
 	if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
-		err = driver->ops->td_open(driver, image->name, image->flags,
-					   vbd->uuid);
+		err = driver->ops->td_open(driver, image, vbd->uuid);
 		if (err) {
 			if (!image->driver)
 				tapdisk_driver_free(driver);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 459eaec..3c3b51d 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -157,7 +157,7 @@ struct tap_disk {
 	const char                  *disk_type;
 	td_flag_t                    flags;
 	int                          private_data_size;
-	int (*td_open)               (td_driver_t *, const char *, td_flag_t, td_uuid_t);
+	int (*td_open)               (td_driver_t *, td_image_t *, td_uuid_t);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 30/45] don't call client_flush() when switching to unprotected mode
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (28 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 29/45] blktap2: use correct way to get remus_image Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 31/45] remus: fix bug in tdremus_close() Wen Congyang
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

If we switch the mode from primary to unprotected, the connection
between primary and backup cannot be used.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 23a908a..b15c966 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -881,6 +881,7 @@ static void primary_queue_write(td_driver_t *driver, td_request_t treq)
 }
 
 
+/* It is called when the user writes "flush" to control file */
 static int client_flush(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -891,7 +892,8 @@ static int client_flush(td_driver_t *driver)
 		/* connection not yet established, nothing to flush */
 		return 0;
 
-	if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT, strlen(TDREMUS_COMMIT)) < 0) {
+	if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
+	    strlen(TDREMUS_COMMIT)) < 0) {
 		RPRINTF("error flushing output");
 		close_stream_fd(s);
 		return -1;
@@ -920,7 +922,6 @@ static int primary_start(td_driver_t *driver)
 
 	tapdisk_remus.td_queue_read = primary_queue_read;
 	tapdisk_remus.td_queue_write = primary_queue_write;
-	s->queue_flush = client_flush;
 
 	s->stream_fd.fd = -1;
 	s->stream_fd.id = -1;
@@ -1504,16 +1505,23 @@ static void ctl_request(event_id_t id, char mode, void *private)
 		return;
 	}
 
-	/* TODO: need to get driver somehow */
 	msg[rc] = '\0';
-	if (!strncmp(msg, "flush", 5)) {
-		if (s->queue_flush)
-			if ((rc = s->queue_flush(driver))) {
-				RPRINTF("error passing flush request to backup");
-				ctl_respond(s, TDREMUS_FAIL);
-			}
-	} else {
+	if (strncmp(msg, "flush", 5)) {
 		RPRINTF("unknown command: %s\n", msg);
+		ctl_respond(s, TDREMUS_FAIL);
+		return;
+	}
+
+	if (s->mode != mode_primary) {
+		RPRINTF("We are not in primary mode\n");
+		ctl_respond(s, TDREMUS_FAIL);
+		return;
+	}
+
+	rc = client_flush(driver);
+	if (rc) {
+		RPRINTF("error passing flush request to backup");
+		ctl_respond(s, TDREMUS_FAIL);
 	}
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 31/45] remus: fix bug in tdremus_close()
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (29 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 30/45] don't call client_flush() when switching to unprotected mode Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 32/45] blktap2: use correct way to get free event id Wen Congyang
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We close ctl_fd.fd, but we don't unregister ctl_fd.id. It will
cause select() return fails, and the user cannot talk with
tapdisk2.

This patch also does some cleanup.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c | 90 ++++++++++++++++++++++---------------
 1 file changed, 53 insertions(+), 37 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index b15c966..d358b44 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -151,9 +151,6 @@ typedef struct poll_fd {
 } poll_fd_t;
 
 struct tdremus_state {
-//  struct tap_disk* driver;
-	void* driver_data;
-
   /* XXX: this is needed so that the server can perform operations on
    * the driver from the stream_fd event handler. fix this. */
 	td_driver_t *tdremus_driver;
@@ -731,12 +728,26 @@ static int mwrite(int fd, void* buf, size_t len)
 
 static void inline close_stream_fd(struct tdremus_state *s)
 {
+	if (s->stream_fd.fd < 0)
+		return;
+
 	/* XXX: -2 is magic. replace with macro perhaps? */
 	tapdisk_server_unregister_event(s->stream_fd.id);
 	close(s->stream_fd.fd);
 	s->stream_fd.fd = -2;
 }
 
+static void close_server_fd(struct tdremus_state *s)
+{
+	if (s->server_fd.fd < 0)
+		return;
+
+	tapdisk_server_unregister_event(s->server_fd.id);
+	s->server_fd.id = -1;
+	close(s->stream_fd.fd);
+	s->stream_fd.fd = -1;
+}
+
 /* primary functions */
 static void remus_client_event(event_id_t, char mode, void *private);
 static void remus_connect_event(event_id_t id, char mode, void *private);
@@ -1348,12 +1359,7 @@ static int unprotected_start(td_driver_t *driver)
 	/* close the server socket */
 	close_stream_fd(s);
 
-	/* unregister the replication stream */
-	tapdisk_server_unregister_event(s->server_fd.id);
-
-	/* close the replication stream */
-	close(s->server_fd.fd);
-	s->server_fd.fd = -1;
+	close_server_fd(s);
 
 	/* install the unprotected read/write handlers */
 	tapdisk_remus.td_queue_read = unprotected_queue_read;
@@ -1561,27 +1567,27 @@ static int ctl_open(td_driver_t *driver, const char* name)
 			s->ctl_path[i] = '_';
 	}
 	if (asprintf(&s->msg_path, "%s.msg", s->ctl_path) < 0)
-		goto err_ctlfifo;
+		goto err_setmsgfifo;
 
 	if (mkfifo(s->ctl_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno != EEXIST) {
 		RPRINTF("error creating control FIFO %s: %d\n", s->ctl_path, errno);
-		goto err_msgfifo;
+		goto err_mkctlfifo;
 	}
 
 	if (mkfifo(s->msg_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno != EEXIST) {
 		RPRINTF("error creating message FIFO %s: %d\n", s->msg_path, errno);
-		goto err_msgfifo;
+		goto err_mkmsgfifo;
 	}
 
 	/* RDWR so that fd doesn't block select when no writer is present */
 	if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
 		RPRINTF("error opening control FIFO %s: %d\n", s->ctl_path, errno);
-		goto err_msgfifo;
+		goto err_openctlfifo;
 	}
 
 	if ((s->msg_fd.fd = open(s->msg_path, O_RDWR)) < 0) {
 		RPRINTF("error opening message FIFO %s: %d\n", s->msg_path, errno);
-		goto err_openctlfifo;
+		goto err_openmsgfifo;
 	}
 
 	RPRINTF("control FIFO %s\n", s->ctl_path);
@@ -1589,36 +1595,45 @@ static int ctl_open(td_driver_t *driver, const char* name)
 
 	return 0;
 
- err_openctlfifo:
+err_openmsgfifo:
 	close(s->ctl_fd.fd);
- err_msgfifo:
+	s->ctl_fd.fd = -1;
+err_openctlfifo:
+	unlink(s->ctl_path);
+err_mkmsgfifo:
+	unlink(s->msg_path);
+err_mkctlfifo:
 	free(s->msg_path);
 	s->msg_path = NULL;
- err_ctlfifo:
+err_setmsgfifo:
 	free(s->ctl_path);
 	s->ctl_path = NULL;
 	return -1;
 }
 
-static void ctl_close(td_driver_t *driver)
+static void ctl_close(struct tdremus_state *s)
 {
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	/* TODO: close *all* connections */
-
-	if(s->ctl_fd.fd)
+	if(s->ctl_fd.fd) {
 		close(s->ctl_fd.fd);
+		s->ctl_fd.fd = -1;
+	}
 
 	if (s->ctl_path) {
 		unlink(s->ctl_path);
 		free(s->ctl_path);
 		s->ctl_path = NULL;
 	}
+
 	if (s->msg_path) {
 		unlink(s->msg_path);
 		free(s->msg_path);
 		s->msg_path = NULL;
 	}
+
+	if (s->msg_fd.fd) {
+		close(s->msg_fd.fd);
+		s->msg_fd.fd = -1;
+	}
 }
 
 static int ctl_register(struct tdremus_state *s)
@@ -1636,6 +1651,16 @@ static int ctl_register(struct tdremus_state *s)
 	return 0;
 }
 
+static void ctl_unregister(struct tdremus_state *s)
+{
+	RPRINTF("unregistering ctl fifo\n");
+
+	if (s->ctl_fd.id >= 0) {
+		tapdisk_server_unregister_event(s->ctl_fd.id);
+		s->ctl_fd.id = -1;
+	}
+}
+
 /* interface */
 
 static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
@@ -1666,13 +1691,12 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 
 	if ((rc = ctl_open(driver, name))) {
 		RPRINTF("error setting up control channel\n");
-		free(s->driver_data);
 		return rc;
 	}
 
 	if ((rc = ctl_register(s))) {
 		RPRINTF("error registering control channel\n");
-		free(s->driver_data);
+		ctl_close(s);
 		return rc;
 	}
 
@@ -1695,19 +1719,11 @@ static int tdremus_close(td_driver_t *driver)
 	RPRINTF("closing\n");
 	if (s->ramdisk.inprogress)
 		hashtable_destroy(s->ramdisk.inprogress, 0);
-	
-	if (s->driver_data) {
-		free(s->driver_data);
-		s->driver_data = NULL;
-	}
-	if (s->server_fd.fd >= 0) {
-		close(s->server_fd.fd);
-		s->server_fd.fd = -1;
-	}
-	if (s->stream_fd.fd >= 0)
-		close_stream_fd(s);
 
-	ctl_close(driver);
+	close_server_fd(s);
+	close_stream_fd(s);
+	ctl_unregister(s);
+	ctl_close(s);
 
 	return 0;
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 32/45] blktap2: use correct way to get free event id
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (30 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 31/45] remus: fix bug in tdremus_close() Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 33/45] blktap2: don't return negative " Wen Congyang
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/scheduler.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/tools/blktap2/drivers/scheduler.c b/tools/blktap2/drivers/scheduler.c
index 6b8d009..dd608dd 100644
--- a/tools/blktap2/drivers/scheduler.c
+++ b/tools/blktap2/drivers/scheduler.c
@@ -160,6 +160,31 @@ scheduler_run_events(scheduler_t *s)
 	}
 }
 
+static int
+get_free_id(scheduler_t *s)
+{
+	event_t *event, *tmp;
+	int old_uuid = s->uuid;
+	int id = s->uuid++;
+
+	if (!s->uuid)
+		s->uuid++;
+
+retry:
+	scheduler_for_each_event(s, event, tmp)
+		if (event->id == id) {
+			id = s->uuid++;
+			if (!s->uuid)
+				s->uuid++;
+			if (id == old_uuid)
+				return 0;
+
+			goto retry;
+		}
+
+	return id;
+}
+
 int
 scheduler_register_event(scheduler_t *s, char mode, int fd,
 			 int timeout, event_cb_t cb, void *private)
@@ -187,10 +212,12 @@ scheduler_register_event(scheduler_t *s, char mode, int fd,
 	event->deadline = now.tv_sec + timeout;
 	event->cb       = cb;
 	event->private  = private;
-	event->id       = s->uuid++;
+	event->id       = get_free_id(s);
 
-	if (!s->uuid)
-		s->uuid++;
+	if (!event->id) {
+		free(event);
+		return -EBUSY;
+	}
 
 	list_add_tail(&event->next, &s->events);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 33/45] blktap2: don't return negative event id
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (31 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 32/45] blktap2: use correct way to get free event id Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 34/45] blktap2: use correct way to define array Wen Congyang
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

If we find some error when registering a new event, we will return
a negative value. So we should skip negative event id.

Also fix a wrong check of return value.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/scheduler.c       | 8 ++++----
 tools/blktap2/drivers/tapdisk-control.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/blktap2/drivers/scheduler.c b/tools/blktap2/drivers/scheduler.c
index dd608dd..e07528b 100644
--- a/tools/blktap2/drivers/scheduler.c
+++ b/tools/blktap2/drivers/scheduler.c
@@ -167,15 +167,15 @@ get_free_id(scheduler_t *s)
 	int old_uuid = s->uuid;
 	int id = s->uuid++;
 
-	if (!s->uuid)
-		s->uuid++;
+	if (s->uuid < 0)
+		s->uuid = 1;
 
 retry:
 	scheduler_for_each_event(s, event, tmp)
 		if (event->id == id) {
 			id = s->uuid++;
-			if (!s->uuid)
-				s->uuid++;
+			if (s->uuid < 0)
+				s->uuid = 1;
 			if (id == old_uuid)
 				return 0;
 
diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 3a4ec8e..4e5f748 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -700,7 +700,7 @@ tapdisk_control_accept(event_id_t id, char mode, void *private)
 					    connection->socket, 0,
 					    tapdisk_control_handle_request,
 					    connection);
-	if (err == -1) {
+	if (err < 0) {
 		close(fd);
 		free(connection);
 		EPRINTF("failed to register new control event: %d\n", err);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 34/45] blktap2: use correct way to define array.
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (32 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 33/45] blktap2: don't return negative " Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 35/45] blktap2: connect to backup asynchronously Wen Congyang
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Currently, we use the following way to define an array:
type array[] = {
    [index] = xxx,
    0,
};
So array[index+1] will be NULL. If index is not the last
index, it will override another index.

tapdisk_vbd_index is not defined, but array[DISK_TYPE_VINDEX]
is overridden, so we don't find this problem when building
the source.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/tapdisk-disktype.c | 12 ++----------
 tools/blktap2/drivers/tapdisk-disktype.h |  2 +-
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/tools/blktap2/drivers/tapdisk-disktype.c b/tools/blktap2/drivers/tapdisk-disktype.c
index e9a6890..8d1383b 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.c
+++ b/tools/blktap2/drivers/tapdisk-disktype.c
@@ -82,12 +82,6 @@ static const disk_info_t block_cache_disk = {
        1,
 };
 
-static const disk_info_t vhd_index_disk = {
-       "vhdi",
-       "vhd index image (vhdi)",
-       1,
-};
-
 static const disk_info_t log_disk = {
 	"log",
 	"write logger (log)",
@@ -110,9 +104,8 @@ const disk_info_t *tapdisk_disk_types[] = {
 	[DISK_TYPE_QCOW]	= &qcow_disk,
 	[DISK_TYPE_BLOCK_CACHE] = &block_cache_disk,
 	[DISK_TYPE_LOG]	= &log_disk,
-	[DISK_TYPE_VINDEX]	= &vhd_index_disk,
 	[DISK_TYPE_REMUS]	= &remus_disk,
-	0,
+	[DISK_TYPE_MAX]		= NULL,
 };
 
 extern struct tap_disk tapdisk_aio;
@@ -137,10 +130,9 @@ const struct tap_disk *tapdisk_disk_drivers[] = {
 	[DISK_TYPE_RAM]         = &tapdisk_ram,
 	[DISK_TYPE_QCOW]        = &tapdisk_qcow,
 	[DISK_TYPE_BLOCK_CACHE] = &tapdisk_block_cache,
-	[DISK_TYPE_VINDEX]      = &tapdisk_vhd_index,
 	[DISK_TYPE_LOG]         = &tapdisk_log,
 	[DISK_TYPE_REMUS]       = &tapdisk_remus,
-	0,
+	[DISK_TYPE_MAX]         = NULL,
 };
 
 int
diff --git a/tools/blktap2/drivers/tapdisk-disktype.h b/tools/blktap2/drivers/tapdisk-disktype.h
index b697eea..c574990 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.h
+++ b/tools/blktap2/drivers/tapdisk-disktype.h
@@ -39,7 +39,7 @@
 #define DISK_TYPE_BLOCK_CACHE 7
 #define DISK_TYPE_LOG         8
 #define DISK_TYPE_REMUS       9
-#define DISK_TYPE_VINDEX      10
+#define DISK_TYPE_MAX         10
 
 #define DISK_TYPE_NAME_MAX    32
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 35/45] blktap2: connect to backup asynchronously
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (33 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 34/45] blktap2: use correct way to define array Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 36/45] switch to unprotected mode before closing Wen Congyang
                   ` (11 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

tapdisk2 is a single thread process. If we use remus,
we will block in primary_blocking_connect(). The
user will not have any chance to talk with tapdisk2.
So we should connect to backup asynchronously.
Before the connection is established, we queue
all I/O request, and handle it when the connection
is established.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c | 760 +++++++++++++++++++++++-------------
 1 file changed, 479 insertions(+), 281 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index d358b44..c21f851 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -63,10 +63,28 @@
 #define RAMDISK_HASHSIZE 128
 
 /* connect retry timeout (seconds) */
-#define REMUS_CONNRETRY_TIMEOUT 10
+#define REMUS_CONNRETRY_TIMEOUT 1
 
 #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
 
+#define UNREGISTER_EVENT(id)					\
+	do {							\
+		if (id >= 0) {					\
+			tapdisk_server_unregister_event(id);	\
+			id = -1;				\
+		}						\
+	} while (0)
+
+#define CLOSE_FD(fd)			\
+	do {				\
+		if (fd >= 0) {		\
+			close(fd);	\
+			fd = -1;	\
+		}			\
+	} while (0)
+
+#define MAX_REMUS_REQUEST       TAPDISK_DATA_REQUESTS
+
 enum tdremus_mode {
 	mode_invalid = 0,
 	mode_unprotected,
@@ -74,17 +92,21 @@ enum tdremus_mode {
 	mode_backup
 };
 
+enum {
+	ERROR_INTERNAL = -1,
+	ERROR_IO = -2,
+	ERROR_CONNECTION = -3,
+};
+
 struct tdremus_req {
-	uint64_t sector;
-	int nb_sectors;
-	char buf[4096];
+	td_request_t treq;
 };
 
 struct req_ring {
 	/* waste one slot to distinguish between empty and full */
-	struct tdremus_req requests[MAX_REQUESTS * 2 + 1];
-	unsigned int head;
-	unsigned int tail;
+	struct tdremus_req pending_requests[MAX_REMUS_REQUEST + 1];
+	unsigned int prod;
+	unsigned int cons;
 };
 
 /* TODO: This isn't very pretty, but to properly generate our own treqs (needed
@@ -144,10 +166,21 @@ struct ramdisk_write_cbdata {
 
 typedef void (*queue_rw_t) (td_driver_t *driver, td_request_t treq);
 
-/* poll_fd type for blktap2 fd system. taken from block_log.c */
+/*
+ * If cid, rid and wid are -1, fd must be -1. It means that
+ * we are in unpritected mode or we don't start to connect
+ * to backup.
+ * If fd is an valid fd:
+ *  cid is valid, rid and wid must be invalid. It means that
+ *      the connection is in progress.
+ *  cid is invalid. rid or wid must be valid. It means that
+ *      the connection is established.
+ */
 typedef struct poll_fd {
 	int        fd;
-	event_id_t id;
+	event_id_t cid;
+	event_id_t rid;
+	event_id_t wid;
 } poll_fd_t;
 
 struct tdremus_state {
@@ -166,8 +199,11 @@ struct tdremus_state {
 	poll_fd_t server_fd;    /* server listen port */
 	poll_fd_t stream_fd;     /* replication channel */
 
-	/* queue write requests, batch-replicate at submit */
-	struct req_ring write_ring;
+	/*
+	 * queue I/O requests, batch-replicate when
+	 * the connection is established.
+	 */
+	struct req_ring queued_io;
 
 	/* ramdisk data*/
 	struct ramdisk ramdisk;
@@ -207,11 +243,13 @@ static int tdremus_close(td_driver_t *driver);
 
 static int switch_mode(td_driver_t *driver, enum tdremus_mode mode);
 static int ctl_respond(struct tdremus_state *s, const char *response);
+static int ctl_register(struct tdremus_state *s);
+static void ctl_unregister(struct tdremus_state *s);
 
 /* ring functions */
-static inline unsigned int ring_next(struct req_ring* ring, unsigned int pos)
+static inline unsigned int ring_next(unsigned int pos)
 {
-	if (++pos >= MAX_REQUESTS * 2 + 1)
+	if (++pos >= MAX_REMUS_REQUEST + 1)
 		return 0;
 
 	return pos;
@@ -219,13 +257,26 @@ static inline unsigned int ring_next(struct req_ring* ring, unsigned int pos)
 
 static inline int ring_isempty(struct req_ring* ring)
 {
-	return ring->head == ring->tail;
+	return ring->cons == ring->prod;
 }
 
 static inline int ring_isfull(struct req_ring* ring)
 {
-	return ring_next(ring, ring->tail) == ring->head;
+	return ring_next(ring->prod) == ring->cons;
+}
+
+static void ring_add_request(struct req_ring *ring, const td_request_t *treq)
+{
+	/* If ring is full, it means that tapdisk2 has some bug */
+	if (ring_isfull(ring)) {
+		RPRINTF("OOPS, ring is full\n");
+		exit(1);
+	}
+
+	ring->pending_requests[ring->prod].treq = *treq;
+	ring->prod = ring_next(ring->prod);
 }
+
 /* Prototype declarations */
 static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
 
@@ -728,30 +779,39 @@ static int mwrite(int fd, void* buf, size_t len)
 
 static void inline close_stream_fd(struct tdremus_state *s)
 {
-	if (s->stream_fd.fd < 0)
-		return;
 
-	/* XXX: -2 is magic. replace with macro perhaps? */
-	tapdisk_server_unregister_event(s->stream_fd.id);
-	close(s->stream_fd.fd);
-	s->stream_fd.fd = -2;
+	UNREGISTER_EVENT(s->stream_fd.cid);
+	UNREGISTER_EVENT(s->stream_fd.rid);
+	UNREGISTER_EVENT(s->stream_fd.wid);
+
+	/* close the connection */
+	CLOSE_FD(s->stream_fd.fd);
 }
 
 static void close_server_fd(struct tdremus_state *s)
 {
-	if (s->server_fd.fd < 0)
-		return;
-
-	tapdisk_server_unregister_event(s->server_fd.id);
-	s->server_fd.id = -1;
-	close(s->stream_fd.fd);
-	s->stream_fd.fd = -1;
+	UNREGISTER_EVENT(s->server_fd.cid);
+	CLOSE_FD(s->server_fd.fd);
 }
 
 /* primary functions */
 static void remus_client_event(event_id_t, char mode, void *private);
 static void remus_connect_event(event_id_t id, char mode, void *private);
 static void remus_retry_connect_event(event_id_t id, char mode, void *private);
+static int primary_forward_request(struct tdremus_state *s,
+				   const td_request_t *treq);
+
+/*
+ * It is called when we cannot connect to backup, or find I/O error when
+ * reading/writing.
+ */
+static void primary_failed(struct tdremus_state *s, int rc)
+{
+	close_stream_fd(s);
+	if (rc == ERROR_INTERNAL)
+		RPRINTF("switch to unprotected mode due to internal error");
+	switch_mode(s->tdremus_driver, mode_unprotected);
+}
 
 static int primary_do_connect(struct tdremus_state *state)
 {
@@ -760,281 +820,247 @@ static int primary_do_connect(struct tdremus_state *state)
 	int rc;
 	int flags;
 
-	RPRINTF("client connecting to %s:%d...\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
+	RPRINTF("client connecting to %s:%d...\n",
+		inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
 
 	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
 		RPRINTF("could not create client socket: %d\n", errno);
-		return -1;
+		return ERROR_INTERNAL;
 	}
+	state->stream_fd.fd = fd;
 
 	/* make socket nonblocking */
 	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
 		flags = 0;
-	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
-		return -1;
+	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) {
+		RPRINTF("error setting fd %d to non block mode\n", fd);
+		return ERROR_INTERNAL;
+	}
 
-	/* once we have created the socket and populated the address, we can now start
-	 * our non-blocking connect. rather than duplicating code we trigger a timeout
-	 * on the socket fd, which calls out nonblocking connect code
+	/*
+	 * once we have created the socket and populated the address,
+	 * we can now start our non-blocking connect. rather than
+	 * duplicating code we trigger a timeout on the socket fd,
+	 * which calls out nonblocking connect code
 	 */
-	if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0, remus_retry_connect_event, state)) < 0) {
-		RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-		/* TODO: we leak a fd here */
-		return -1;
+	if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0,
+					       remus_retry_connect_event,
+					       state)) < 0) {
+		RPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
 	}
-	state->stream_fd.fd = fd;
-	state->stream_fd.id = id;
+
+	state->stream_fd.cid = id;
 	return 0;
 }
 
-static int primary_blocking_connect(struct tdremus_state *state)
+static int remus_handle_queued_io(struct tdremus_state *s)
 {
-	int fd;
-	int id;
+	struct req_ring *queued_io = &s->queued_io;
+	unsigned int cons;
+	td_request_t *treq;
 	int rc;
-	int flags;
-
-	RPRINTF("client connecting to %s:%d...\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
 
-	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create client socket: %d\n", errno);
-		return -1;
-	}
+	while (!ring_isempty(queued_io)) {
+		cons = queued_io->cons;
+		treq = &queued_io->pending_requests[cons].treq;
 
-	do {
-		if ((rc = connect(fd, (struct sockaddr *)&state->sa,
-		    sizeof(state->sa))) < 0)
-		{
-			if (errno == ECONNREFUSED) {
-				RPRINTF("connection refused -- retrying in 1 second\n");
-				sleep(1);
-			} else {
-				RPRINTF("connection failed: %d\n", errno);
-				close(fd);
-				return -1;
-			}
+		if (treq->op == TD_OP_WRITE) {
+			rc = primary_forward_request(s, treq);
+			if (rc)
+				return rc;
 		}
-	} while (rc < 0);
-
-	RPRINTF("client connected\n");
-
-	/* make socket nonblocking */
-	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
-		flags = 0;
-	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
-	{
-		RPRINTF("error making socket nonblocking\n");
-		close(fd);
-		return -1;
-	}
 
-	if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, fd, 0, remus_client_event, state)) < 0) {
-		RPRINTF("error registering client event handler: %s\n", strerror(id));
-		close(fd);
-		return -1;
+		td_forward_request(*treq);
+		queued_io->cons = ring_next(cons);
 	}
 
-	state->stream_fd.fd = fd;
-	state->stream_fd.id = id;
 	return 0;
 }
 
-/* on read, just pass request through */
-static void primary_queue_read(td_driver_t *driver, td_request_t treq)
-{
-	/* just pass read through */
-	td_forward_request(treq);
-}
-
-/* TODO:
- * The primary uses mwrite() to write the contents of a write request to the
- * backup. This effectively blocks until all data has been copied into a system
- * buffer or a timeout has occured. We may wish to instead use tapdisk's
- * nonblocking i/o interface, tapdisk_server_register_event(), to set timeouts
- * and write data in an asynchronous fashion.
- */
-static void primary_queue_write(td_driver_t *driver, td_request_t treq)
+static int remus_connection_done(struct tdremus_state *s)
 {
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	char header[sizeof(uint32_t) + sizeof(uint64_t)];
-	uint32_t *sectors = (uint32_t *)header;
-	uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
+	event_id_t id;
 
-	// RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
+	/* the connect succeeded */
+	/* unregister this function and register a new event handler */
+	tapdisk_server_unregister_event(s->stream_fd.cid);
+	s->stream_fd.cid = -1;
 
-	/* -1 means we haven't connected yet, -2 means the connection was lost */
-	if(s->stream_fd.fd == -1) {
-		RPRINTF("connecting to backup...\n");
-		primary_blocking_connect(s);
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->stream_fd.fd,
+					   0, remus_client_event, s);
+	if(id < 0) {
+		RPRINTF("error registering client event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
 	}
+	s->stream_fd.rid = id;
 
-	*sectors = treq.secs;
-	*sector = treq.sec;
-
-	if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE)) < 0)
-		goto fail;
-	if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
-		goto fail;
+	/* handle the queued requests */
+	return remus_handle_queued_io(s);
+}
 
-	if (mwrite(s->stream_fd.fd, treq.buf, treq.secs * driver->info.sector_size) < 0)
-		goto fail;
+static int remus_retry_connect(struct tdremus_state *s)
+{
+	event_id_t id;
 
-	td_forward_request(treq);
+	tapdisk_server_unregister_event(s->stream_fd.cid);
+	s->stream_fd.cid = -1;
 
-	return;
+	RPRINTF("connect to backup 1 second later");
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
+					   s->stream_fd.fd,
+					   REMUS_CONNRETRY_TIMEOUT,
+					   remus_retry_connect_event, s);
+	if (id < 0) {
+		RPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
+	}
 
- fail:
-	/* switch to unprotected mode and tell tapdisk to retry */
-	RPRINTF("write request replication failed, switching to unprotected mode");
-	switch_mode(s->tdremus_driver, mode_unprotected);
-	td_complete_request(treq, -EBUSY);
+	s->stream_fd.cid = id;
+	return 0;
 }
 
-
-/* It is called when the user writes "flush" to control file */
-static int client_flush(td_driver_t *driver)
+static int remus_wait_connect_done(struct tdremus_state *s)
 {
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	// RPRINTF("committing output\n");
+	event_id_t id;
 
-	if (s->stream_fd.fd == -1)
-		/* connection not yet established, nothing to flush */
-		return 0;
+	tapdisk_server_unregister_event(s->stream_fd.cid);
+	s->stream_fd.cid = -1;
 
-	if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
-	    strlen(TDREMUS_COMMIT)) < 0) {
-		RPRINTF("error flushing output");
-		close_stream_fd(s);
-		return -1;
+	id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD,
+					   s->stream_fd.fd, 0,
+					   remus_connect_event, s);
+	if (id < 0) {
+		RPRINTF("error registering client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
 	}
+	s->stream_fd.cid = id;
 
 	return 0;
 }
 
-static int server_flush(td_driver_t *driver)
+/* return 1 if we need to reconnect to backup */
+static int check_connect_errno(int err)
 {
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	/* 
-	 * Nothing to flush in beginning.
+	/*
+	 * The fd is non-block, so we will not get ETIMEDOUT
+	 * after calling connect(). We only can get this errno
+	 * by getsockopt().
 	 */
-	if (!s->ramdisk.prev)
-		return 0;
-	/* Try to flush any remaining requests */
-	return ramdisk_flush(driver, s);	
-}
-
-static int primary_start(td_driver_t *driver)
-{
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	RPRINTF("activating client mode\n");
-
-	tapdisk_remus.td_queue_read = primary_queue_read;
-	tapdisk_remus.td_queue_write = primary_queue_write;
-
-	s->stream_fd.fd = -1;
-	s->stream_fd.id = -1;
+	if (err == ECONNREFUSED || err == ENETUNREACH ||
+	    err == EAGAIN || err == ECONNABORTED ||
+	    err == ETIMEDOUT)
+	    return 1;
 
 	return 0;
 }
 
-/* timeout callback */
 static void remus_retry_connect_event(event_id_t id, char mode, void *private)
 {
 	struct tdremus_state *s = (struct tdremus_state *)private;
+	int rc, ret;
 
 	/* do a non-blocking connect */
-	if (connect(s->stream_fd.fd, (struct sockaddr *)&s->sa, sizeof(s->sa))
-	    && errno != EINPROGRESS)
-	{
-		if(errno == ECONNREFUSED || errno == ENETUNREACH || errno == EAGAIN || errno == ECONNABORTED)
-		{
-			/* try again in a second */
-			tapdisk_server_unregister_event(s->stream_fd.id);
-			if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd, REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
-				RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-				return;
-			}
-			s->stream_fd.id = id;
-		}
-		else
-		{
-			/* not recoverable */
-			RPRINTF("error connection to server %s\n", strerror(errno));
+	ret = connect(s->stream_fd.fd,
+		      (struct sockaddr *)&s->sa,
+		      sizeof(s->sa));
+	if (ret) {
+		if (errno == EINPROGRESS) {
+			/*
+			 * the connect returned EINPROGRESS (nonblocking
+			 * connect) we must wait for the fd to be writeable
+			 * to determine if the connect worked
+			 */
+			rc = remus_wait_connect_done(s);
+			if (rc)
+				goto fail;
 			return;
 		}
-	}
-	else
-	{
-		/* the connect returned EINPROGRESS (nonblocking connect) we must wait for the fd to be writeable to determine if the connect worked */
 
-		tapdisk_server_unregister_event(s->stream_fd.id);
-		if((id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD, s->stream_fd.fd, 0, remus_connect_event, s)) < 0) {
-			RPRINTF("error registering client connection event handler: %s\n", strerror(id));
+		if (check_connect_errno(errno)) {
+			rc = remus_retry_connect(s);
+			if (rc)
+				goto fail;
 			return;
 		}
-		s->stream_fd.id = id;
+
+		/* not recoverable */
+		RPRINTF("error connection to server %s\n", strerror(errno));
+		rc = ERROR_CONNECTION;
+		goto fail;
 	}
+
+	/* The connection is established unexpectedly */
+	rc = remus_connection_done(s);
+	if (rc)
+		goto fail;
+
+	return;
+
+fail:
+	primary_failed(s, rc);
+	return;
 }
 
 /* callback when nonblocking connect() is finished */
-/* called only by primary in unprotected state */
 static void remus_connect_event(event_id_t id, char mode, void *private)
 {
 	int socket_errno;
 	socklen_t socket_errno_size;
 	struct tdremus_state *s = (struct tdremus_state *)private;
+	int rc;
 
-	/* check to se if the connect succeeded */
+	/* check to see if the connect succeeded */
 	socket_errno_size = sizeof(socket_errno);
-	if (getsockopt(s->stream_fd.fd, SOL_SOCKET, SO_ERROR, &socket_errno, &socket_errno_size)) {
+	if (getsockopt(s->stream_fd.fd, SOL_SOCKET, SO_ERROR,
+		       &socket_errno, &socket_errno_size)) {
 		RPRINTF("error getting socket errno\n");
 		return;
 	}
 
 	RPRINTF("socket connect returned %d\n", socket_errno);
 
-	if(socket_errno)
-	{
+	if (socket_errno) {
 		/* the connect did not succeed */
+		if (check_connect_errno(socket_errno)) {
+			/*
+			 * we can probably assume that the backup is down.
+			 * just try again later
+			 */
+			rc = remus_retry_connect(s);
+			if (rc)
+				goto fail;
 
-		if(socket_errno == ECONNREFUSED || socket_errno == ENETUNREACH || socket_errno == ETIMEDOUT
-		   || socket_errno == ECONNABORTED || socket_errno == EAGAIN)
-		{
-			/* we can probably assume that the backup is down. just try again later */
-			tapdisk_server_unregister_event(s->stream_fd.id);
-			if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd, REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
-				RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-				return;
-			}
-			s->stream_fd.id = id;
-		}
-		else
-		{
-			RPRINTF("socket connect returned %d, giving up\n", socket_errno);
-		}
-	}
-	else
-	{
-		/* the connect succeeded */
-
-		/* unregister this function and register a new event handler */
-		tapdisk_server_unregister_event(s->stream_fd.id);
-		if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->stream_fd.fd, 0, remus_client_event, s)) < 0) {
-			RPRINTF("error registering client event handler: %s\n", strerror(id));
 			return;
+		} else {
+			RPRINTF("socket connect returned %d, giving up\n",
+				socket_errno);
+			rc = ERROR_CONNECTION;
+			goto fail;
 		}
-		s->stream_fd.id = id;
 
-		/* switch from unprotected to protected client */
-		switch_mode(s->tdremus_driver, mode_primary);
+		return;
 	}
+
+	rc = remus_connection_done(s);
+	if (rc)
+		goto fail;
+
+	return;
+
+fail:
+	primary_failed(s, rc);
 }
 
 
-/* we install this event handler on the primary once we have connected to the backup */
+/*
+ * we install this event handler on the primary once we have
+ * connected to the backup.
+ */
 /* wait for "done" message to commit checkpoint */
 static void remus_client_event(event_id_t id, char mode, void *private)
 {
@@ -1043,9 +1069,12 @@ static void remus_client_event(event_id_t id, char mode, void *private)
 	int rc;
 
 	if (mread(s->stream_fd.fd, req, sizeof(req) - 1) < 0) {
-		/* replication stream closed or otherwise broken (timeout, reset, &c) */
+		/*
+		 * replication stream closed or otherwise broken
+		 * (timeout, reset, &c)
+		 */
 		RPRINTF("error reading from backup\n");
-		close_stream_fd(s);
+		primary_failed(s, ERROR_IO);
 		return;
 	}
 
@@ -1056,22 +1085,169 @@ static void remus_client_event(event_id_t id, char mode, void *private)
 		ctl_respond(s, TDREMUS_DONE);
 	else {
 		RPRINTF("received unknown message: %s\n", req);
-		close_stream_fd(s);
+		primary_failed(s, ERROR_IO);
+	}
+
+	return;
+}
+
+static void primary_queue_read(td_driver_t *driver, td_request_t treq)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	struct req_ring *ring = &s->queued_io;
+
+	if (ring_isempty(ring)) {
+		/* just pass read through */
+		td_forward_request(treq);
+		return;
+	}
+
+	ring_add_request(ring, &treq);
+}
+
+static int primary_forward_request(struct tdremus_state *s,
+				   const td_request_t *treq)
+{
+	char header[sizeof(uint32_t) + sizeof(uint64_t)];
+	uint32_t *sectors = (uint32_t *)header;
+	uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
+	td_driver_t *driver = s->tdremus_driver;
+
+	*sectors = treq->secs;
+	*sector = treq->sec;
+
+	if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE)) < 0)
+		return ERROR_IO;
+
+	if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
+		return ERROR_IO;
+
+	if (mwrite(s->stream_fd.fd, treq->buf,
+	    treq->secs * driver->info.sector_size) < 0)
+		return ERROR_IO;
+
+	return 0;
+}
+
+/* TODO:
+ * The primary uses mwrite() to write the contents of a write request to the
+ * backup. This effectively blocks until all data has been copied into a system
+ * buffer or a timeout has occured. We may wish to instead use tapdisk's
+ * nonblocking i/o interface, tapdisk_server_register_event(), to set timeouts
+ * and write data in an asynchronous fashion.
+ */
+static void primary_queue_write(td_driver_t *driver, td_request_t treq)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	int rc;
+
+	// RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
+
+	if(s->stream_fd.fd < 0) {
+		RPRINTF("connecting to backup...\n");
+		rc = primary_do_connect(s);
+		if (rc)
+			goto fail;
+	}
+
+	/* The connection is not established, just queue the request */
+	if (s->stream_fd.cid >= 0) {
+		ring_add_request(&s->queued_io, &treq);
+		return;
 	}
 
+	/* The connection is established */
+	rc = primary_forward_request(s, &treq);
+	if (rc)
+		goto fail;
+
+	td_forward_request(treq);
+
 	return;
+
+fail:
+	/* switch to unprotected mode and forward the request */
+	RPRINTF("write request replication failed, switching to unprotected mode");
+	primary_failed(s, rc);
+	td_forward_request(treq);
+}
+
+/* It is called when the user write "flush" to control file. */
+static int client_flush(td_driver_t *driver)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+
+	// RPRINTF("committing output\n");
+
+	if (s->stream_fd.fd == -1)
+		/* connection not yet established, nothing to flush */
+		return 0;
+
+	if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
+	    strlen(TDREMUS_COMMIT)) < 0) {
+		RPRINTF("error flushing output");
+		primary_failed(s, ERROR_IO);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* It is called when switching the mode from primary to unprotected */
+static int primary_flush(td_driver_t *driver)
+{
+	struct tdremus_state *s = driver->data;
+	struct req_ring *ring = &s->queued_io;
+	unsigned int cons;
+
+	if (ring_isempty(ring))
+		return 0;
+
+	while (!ring_isempty(ring)) {
+		cons = ring->cons;
+		ring->cons = ring_next(cons);
+
+		td_forward_request(ring->pending_requests[cons].treq);
+	}
+
+	return 0;
+}
+
+static int primary_start(td_driver_t *driver)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+
+	RPRINTF("activating client mode\n");
+
+	tapdisk_remus.td_queue_read = primary_queue_read;
+	tapdisk_remus.td_queue_write = primary_queue_write;
+	s->queue_flush = primary_flush;
+
+	s->stream_fd.fd = -1;
+	s->stream_fd.cid = -1;
+	s->stream_fd.rid = -1;
+	s->stream_fd.wid = -1;
+
+	return 0;
 }
 
 /* backup functions */
 static void remus_server_event(event_id_t id, char mode, void *private);
 
+/* It is called when we find some I/O error */
+static void backup_failed(struct tdremus_state *s, int rc)
+{
+	close_stream_fd(s);
+	close_server_fd(s);
+	/* We will switch to unprotected mode in backup_queue_write() */
+}
+
 /* returns the socket that receives write requests */
 static void remus_server_accept(event_id_t id, char mode, void* private)
 {
 	struct tdremus_state* s = (struct tdremus_state *) private;
 
 	int stream_fd;
-	event_id_t cid;
 
 	/* XXX: add address-based black/white list */
 	if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
@@ -1079,68 +1255,80 @@ static void remus_server_accept(event_id_t id, char mode, void* private)
 		return;
 	}
 
-	/* TODO: check to see if we are already replicating. if so just close the
-	 * connection (or do something smarter) */
+	/*
+	 * TODO: check to see if we are already replicating.
+	 * if so just close the connection (or do something
+	 * smarter)
+	 */
 	RPRINTF("server accepted connection\n");
 
 	/* add tapdisk event for replication stream */
-	cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, stream_fd, 0,
-					    remus_server_event, s);
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, stream_fd, 0,
+					   remus_server_event, s);
 
-	if(cid < 0) {
-		RPRINTF("error registering connection event handler: %s\n", strerror(errno));
+	if (id < 0) {
+		RPRINTF("error registering connection event handler: %s\n",
+			strerror(errno));
 		close(stream_fd);
 		return;
 	}
 
 	/* store replication file descriptor */
 	s->stream_fd.fd = stream_fd;
-	s->stream_fd.id = cid;
+	s->stream_fd.rid = id;
 }
 
 /* returns -2 if EADDRNOTAVAIL */
 static int remus_bind(struct tdremus_state* s)
 {
-//  struct sockaddr_in sa;
 	int opt;
 	int rc = -1;
+	event_id_t id;
 
 	if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
 		RPRINTF("could not create server socket: %d\n", errno);
 		return rc;
 	}
-	opt = 1;
-	if (setsockopt(s->server_fd.fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)) < 0)
-		RPRINTF("Error setting REUSEADDR on %d: %d\n", s->server_fd.fd, errno);
 
-	if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa, sizeof(s->sa)) < 0) {
-		RPRINTF("could not bind server socket %d to %s:%d: %d %s\n", s->server_fd.fd,
-			inet_ntoa(s->sa.sin_addr), ntohs(s->sa.sin_port), errno, strerror(errno));
-		if (errno != EADDRINUSE)
+	opt = 1;
+	if (setsockopt(s->server_fd.fd, SOL_SOCKET,
+		       SO_REUSEADDR, &opt, sizeof(opt)) < 0)
+		RPRINTF("Error setting REUSEADDR on %d: %d\n",
+			s->server_fd.fd, errno);
+
+	if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa,
+		 sizeof(s->sa)) < 0) {
+		RPRINTF("could not bind server socket %d to %s:%d: %d %s\n",
+			s->server_fd.fd, inet_ntoa(s->sa.sin_addr),
+			ntohs(s->sa.sin_port), errno, strerror(errno));
+		if (errno == EADDRNOTAVAIL)
 			rc = -2;
 		goto err_sfd;
 	}
+
 	if (listen(s->server_fd.fd, 10)) {
 		RPRINTF("could not listen on socket: %d\n", errno);
 		goto err_sfd;
 	}
 
-	/* The socket s now bound to the address and listening so we may now register
-   * the fd with tapdisk */
-
-	if((s->server_fd.id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
-							    s->server_fd.fd, 0,
-							    remus_server_accept, s)) < 0) {
+	/*
+	 * The socket s now bound to the address and listening so we
+	 * may now register the fd with tapdisk
+	 */
+	id =  tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+					    s->server_fd.fd, 0,
+					    remus_server_accept, s);
+	if (id < 0) {
 		RPRINTF("error registering server connection event handler: %s",
-			strerror(s->server_fd.id));
+			strerror(id));
 		goto err_sfd;
 	}
+	s->server_fd.cid = id;
 
 	return 0;
 
- err_sfd:
-	close(s->server_fd.fd);
-	s->server_fd.fd = -1;
+err_sfd:
+	CLOSE_FD(s->server_fd.fd);
 
 	return rc;
 }
@@ -1190,10 +1378,21 @@ void backup_queue_write(td_driver_t *driver, td_request_t treq)
 	td_complete_request(treq, -EBUSY);
 }
 
+static int server_flush(td_driver_t *driver)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	/*
+	 * Nothing to flush in beginning.
+	 */
+	if (!s->ramdisk.prev)
+		return 0;
+	/* Try to flush any remaining requests */
+	return ramdisk_flush(driver, s);
+}
+
 static int backup_start(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	int fd;
 
 	if (ramdisk_start(driver) < 0)
 		return -1;
@@ -1201,16 +1400,15 @@ static int backup_start(td_driver_t *driver)
 	tapdisk_remus.td_queue_read = backup_queue_read;
 	tapdisk_remus.td_queue_write = backup_queue_write;
 	s->queue_flush = server_flush;
-	/* TODO set flush function */
 	return 0;
 }
 
-static int server_do_wreq(td_driver_t *driver)
+static void server_do_wreq(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	static tdremus_wire_t twreq;
 	char buf[4096];
-	int len, rc;
+	int len, rc = ERROR_IO;
 
 	char header[sizeof(uint32_t) + sizeof(uint64_t)];
 	uint32_t *sectors = (uint32_t *) header;
@@ -1227,39 +1425,40 @@ static int server_do_wreq(td_driver_t *driver)
 	// *sector);
 
 	if (len > sizeof(buf)) {
-		/* freak out! */
-		RPRINTF("write request too large: %d/%u\n", len, (unsigned)sizeof(buf));
-		return -1;
+		/* freak out! How to handle the remaining data from primary */
+		RPRINTF("write request too large: %d/%u\n",
+			len, (unsigned)sizeof(buf));
+		goto err;
 	}
 
 	if (mread(s->stream_fd.fd, buf, len) < 0)
 		goto err;
 
-	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0)
+	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
+		rc = ERROR_INTERNAL;
 		goto err;
+	}
 
-	return 0;
+	return;
 
  err:
 	/* should start failover */
 	RPRINTF("backup write request error\n");
-	close_stream_fd(s);
-
-	return -1;
+	backup_failed(s, rc);
 }
 
-static int server_do_sreq(td_driver_t *driver)
+static void server_do_sreq(td_driver_t *driver)
 {
 	/*
 	  RPRINTF("submit request received\n");
   */
 
-	return 0;
+	return;
 }
 
 /* at this point, the server can start applying the most recent
  * ramdisk. */
-static int server_do_creq(td_driver_t *driver)
+static void server_do_creq(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
@@ -1269,9 +1468,7 @@ static int server_do_creq(td_driver_t *driver)
 
 	/* XXX this message should not be sent until flush completes! */
 	if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) != 4)
-		return -1;
-
-	return 0;
+		backup_failed(s, ERROR_IO);
 }
 
 
@@ -1356,10 +1553,6 @@ static int unprotected_start(td_driver_t *driver)
 
 	RPRINTF("failure detected, activating passthrough\n");
 
-	/* close the server socket */
-	close_stream_fd(s);
-
-	close_server_fd(s);
 
 	/* install the unprotected read/write handlers */
 	tapdisk_remus.td_queue_read = unprotected_queue_read;
@@ -1486,6 +1679,19 @@ static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
 	return rc;
 }
 
+static void ctl_reopen(struct tdremus_state *s)
+{
+	ctl_unregister(s);
+	CLOSE_FD(s->ctl_fd.fd);
+	RPRINTF("FIFO closed\n");
+
+	if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
+		RPRINTF("error reopening FIFO: %d\n", errno);
+		return;
+	}
+	ctl_register(s);
+}
+
 static void ctl_request(event_id_t id, char mode, void *private)
 {
 	struct tdremus_state *s = (struct tdremus_state *)private;
@@ -1497,12 +1703,6 @@ static void ctl_request(event_id_t id, char mode, void *private)
 
 	if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append nul */))) {
 		RPRINTF("0-byte read received, reopening FIFO\n");
-		/*TODO: we may have to unregister/re-register with tapdisk_server */
-		close(s->ctl_fd.fd);
-		RPRINTF("FIFO closed\n");
-		if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
-			RPRINTF("error reopening FIFO: %d\n", errno);
-		}
 		return;
 	}
 
@@ -1641,10 +1841,11 @@ static int ctl_register(struct tdremus_state *s)
 	RPRINTF("registering ctl fifo\n");
 
 	/* register ctl fd */
-	s->ctl_fd.id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->ctl_fd.fd, 0, ctl_request, s);
+	s->ctl_fd.cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->ctl_fd.fd, 0, ctl_request, s);
 
-	if (s->ctl_fd.id < 0) {
-		RPRINTF("error registering ctrl FIFO %s: %d\n", s->ctl_path, s->ctl_fd.id);
+	if (s->ctl_fd.cid < 0) {
+		RPRINTF("error registering ctrl FIFO %s: %d\n",
+			s->ctl_path, s->ctl_fd.cid);
 		return -1;
 	}
 
@@ -1655,10 +1856,7 @@ static void ctl_unregister(struct tdremus_state *s)
 {
 	RPRINTF("unregistering ctl fifo\n");
 
-	if (s->ctl_fd.id >= 0) {
-		tapdisk_server_unregister_event(s->ctl_fd.id);
-		s->ctl_fd.id = -1;
-	}
+	UNREGISTER_EVENT(s->ctl_fd.cid);
 }
 
 /* interface */
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 36/45] switch to unprotected mode before closing
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (34 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 35/45] blktap2: connect to backup asynchronously Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 37/45] blktap2: move async connect related codes to block-replication.c Wen Congyang
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

If the user wants to stop tapdisk2, he will do
the following thing:
1. close the image
2. detach from blktap device

If there is some pending I/O request, close will
fail. But the I/O request is pended in remus until
the connection is established. Introduce a new
callback td_pre_close() to flush these I/O requests.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c       | 15 +++++++++++++++
 tools/blktap2/drivers/tapdisk-control.c   |  6 ++++++
 tools/blktap2/drivers/tapdisk-interface.c | 18 ++++++++++++++++++
 tools/blktap2/drivers/tapdisk-interface.h |  1 +
 tools/blktap2/drivers/tapdisk-vbd.c       |  9 +++++++++
 tools/blktap2/drivers/tapdisk-vbd.h       |  1 +
 tools/blktap2/drivers/tapdisk.h           |  1 +
 7 files changed, 51 insertions(+)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index c21f851..5d27d41 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -96,6 +96,7 @@ enum {
 	ERROR_INTERNAL = -1,
 	ERROR_IO = -2,
 	ERROR_CONNECTION = -3,
+	ERROR_CLOSE = -4,
 };
 
 struct tdremus_req {
@@ -810,6 +811,8 @@ static void primary_failed(struct tdremus_state *s, int rc)
 	close_stream_fd(s);
 	if (rc == ERROR_INTERNAL)
 		RPRINTF("switch to unprotected mode due to internal error");
+	if (rc == ERROR_CLOSE)
+		RPRINTF("switch to unprotected mode before closing");
 	switch_mode(s->tdremus_driver, mode_unprotected);
 }
 
@@ -1910,6 +1913,17 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	return -EIO;
 }
 
+static int tdremus_pre_close(td_driver_t *driver)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+
+	if (s->mode != mode_primary)
+		return 0;
+
+	primary_failed(s, ERROR_CLOSE);
+	return 0;
+}
+
 static int tdremus_close(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -1944,6 +1958,7 @@ struct tap_disk tapdisk_remus = {
 	.td_open            = tdremus_open,
 	.td_queue_read      = unprotected_queue_read,
 	.td_queue_write     = unprotected_queue_write,
+	.td_pre_close       = tdremus_pre_close,
 	.td_close           = tdremus_close,
 	.td_get_parent_id   = tdremus_get_parent_id,
 	.td_validate_parent = tdremus_validate_parent,
diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 4e5f748..2fa4cbe 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -508,6 +508,12 @@ tapdisk_control_close_image(struct tapdisk_control_connection *connection,
 		goto out;
 	}
 
+	/*
+	 * Some I/O requests are pended in the driver, and
+	 * flush these requests first.
+	 */
+	tapdisk_vbd_pre_close_vdi(vbd);
+
 	if (!list_empty(&vbd->pending_requests)) {
 		err = -EAGAIN;
 		goto out;
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index a29de64..ed92e12 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -105,6 +105,24 @@ td_open(td_image_t *image)
 }
 
 int
+td_pre_close(td_image_t *image)
+{
+	td_driver_t *driver;
+
+	driver = image->driver;
+	if (!driver)
+		return -ENODEV;
+
+	if (!driver->ops->td_pre_close)
+		return 0;
+
+	if (driver->refcnt && td_flag_test(driver->state, TD_DRIVER_OPEN))
+		driver->ops->td_pre_close(driver);
+
+	return 0;
+}
+
+int
 td_close(td_image_t *image)
 {
 	td_driver_t *driver;
diff --git a/tools/blktap2/drivers/tapdisk-interface.h b/tools/blktap2/drivers/tapdisk-interface.h
index adc4376..ba9b3ea 100644
--- a/tools/blktap2/drivers/tapdisk-interface.h
+++ b/tools/blktap2/drivers/tapdisk-interface.h
@@ -34,6 +34,7 @@
 int td_open(td_image_t *);
 int __td_open(td_image_t *, td_disk_info_t *);
 int td_load(td_image_t *);
+int td_pre_close(td_image_t *);
 int td_close(td_image_t *);
 int td_get_parent_id(td_image_t *, td_disk_id_t *);
 int td_validate_parent(td_image_t *, td_image_t *);
diff --git a/tools/blktap2/drivers/tapdisk-vbd.c b/tools/blktap2/drivers/tapdisk-vbd.c
index c665f27..aba545b 100644
--- a/tools/blktap2/drivers/tapdisk-vbd.c
+++ b/tools/blktap2/drivers/tapdisk-vbd.c
@@ -180,6 +180,15 @@ tapdisk_vbd_validate_chain(td_vbd_t *vbd)
 }
 
 void
+tapdisk_vbd_pre_close_vdi(td_vbd_t *vbd)
+{
+	td_image_t *image, *tmp;
+
+	tapdisk_vbd_for_each_image(vbd, image, tmp)
+		td_pre_close(image);
+}
+
+void
 tapdisk_vbd_close_vdi(td_vbd_t *vbd)
 {
 	td_image_t *image, *tmp;
diff --git a/tools/blktap2/drivers/tapdisk-vbd.h b/tools/blktap2/drivers/tapdisk-vbd.h
index be084b2..040f2b8 100644
--- a/tools/blktap2/drivers/tapdisk-vbd.h
+++ b/tools/blktap2/drivers/tapdisk-vbd.h
@@ -181,6 +181,7 @@ void tapdisk_vbd_free_stack(td_vbd_t *);
 int tapdisk_vbd_open_stack(td_vbd_t *, uint16_t, td_flag_t);
 int tapdisk_vbd_open_vdi(td_vbd_t *, const char *,
 			 uint16_t, uint16_t, td_flag_t);
+void tapdisk_vbd_pre_close_vdi(td_vbd_t *);
 void tapdisk_vbd_close_vdi(td_vbd_t *);
 
 int tapdisk_vbd_attach(td_vbd_t *, const char *, int);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 3c3b51d..16efd07 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -158,6 +158,7 @@ struct tap_disk {
 	td_flag_t                    flags;
 	int                          private_data_size;
 	int (*td_open)               (td_driver_t *, td_image_t *, td_uuid_t);
+	int (*td_pre_close)          (td_driver_t *);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 37/45] blktap2: move async connect related codes to block-replication.c
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (35 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 36/45] switch to unprotected mode before closing Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 38/45] blktap2: move ramdisk " Wen Congyang
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

   COLO will reuse them.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/Makefile            |   2 +-
 tools/blktap2/drivers/block-remus.c       | 494 +++---------------------------
 tools/blktap2/drivers/block-replication.c | 468 ++++++++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.h | 113 +++++++
 4 files changed, 630 insertions(+), 447 deletions(-)
 create mode 100644 tools/blktap2/drivers/block-replication.c
 create mode 100644 tools/blktap2/drivers/block-replication.h

diff --git a/tools/blktap2/drivers/Makefile b/tools/blktap2/drivers/Makefile
index 1129ca1..c7a2ca4 100644
--- a/tools/blktap2/drivers/Makefile
+++ b/tools/blktap2/drivers/Makefile
@@ -23,7 +23,7 @@ endif
 
 VHDLIBS    := -L$(LIBVHDDIR) -lvhd
 
-REMUS-OBJS  := block-remus.o
+REMUS-OBJS  := block-remus.o block-replication.o
 REMUS-OBJS  += hashtable.o
 REMUS-OBJS  += hashtable_itr.o
 REMUS-OBJS  += hashtable_utility.o
diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 5d27d41..8b6f157 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -40,6 +40,7 @@
 #include "hashtable.h"
 #include "hashtable_itr.h"
 #include "hashtable_utility.h"
+#include "block-replication.h"
 
 #include <errno.h>
 #include <inttypes.h>
@@ -49,10 +50,7 @@
 #include <string.h>
 #include <sys/time.h>
 #include <sys/types.h>
-#include <sys/socket.h>
-#include <netdb.h>
 #include <netinet/in.h>
-#include <arpa/inet.h>
 #include <sys/param.h>
 #include <sys/sysctl.h>
 #include <unistd.h>
@@ -67,22 +65,6 @@
 
 #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
 
-#define UNREGISTER_EVENT(id)					\
-	do {							\
-		if (id >= 0) {					\
-			tapdisk_server_unregister_event(id);	\
-			id = -1;				\
-		}						\
-	} while (0)
-
-#define CLOSE_FD(fd)			\
-	do {				\
-		if (fd >= 0) {		\
-			close(fd);	\
-			fd = -1;	\
-		}			\
-	} while (0)
-
 #define MAX_REMUS_REQUEST       TAPDISK_DATA_REQUESTS
 
 enum tdremus_mode {
@@ -92,13 +74,6 @@ enum tdremus_mode {
 	mode_backup
 };
 
-enum {
-	ERROR_INTERNAL = -1,
-	ERROR_IO = -2,
-	ERROR_CONNECTION = -3,
-	ERROR_CLOSE = -4,
-};
-
 struct tdremus_req {
 	td_request_t treq;
 };
@@ -167,21 +142,9 @@ struct ramdisk_write_cbdata {
 
 typedef void (*queue_rw_t) (td_driver_t *driver, td_request_t treq);
 
-/*
- * If cid, rid and wid are -1, fd must be -1. It means that
- * we are in unpritected mode or we don't start to connect
- * to backup.
- * If fd is an valid fd:
- *  cid is valid, rid and wid must be invalid. It means that
- *      the connection is in progress.
- *  cid is invalid. rid or wid must be valid. It means that
- *      the connection is established.
- */
 typedef struct poll_fd {
 	int        fd;
-	event_id_t cid;
-	event_id_t rid;
-	event_id_t wid;
+	event_id_t id;
 } poll_fd_t;
 
 struct tdremus_state {
@@ -195,9 +158,7 @@ struct tdremus_state {
 	char*     msg_path; /* output completion message here */
 	poll_fd_t msg_fd;
 
-  /* replication host */
-	struct sockaddr_in sa;
-	poll_fd_t server_fd;    /* server listen port */
+	td_replication_connect_t t;
 	poll_fd_t stream_fd;     /* replication channel */
 
 	/*
@@ -777,28 +738,8 @@ static int mwrite(int fd, void* buf, size_t len)
 	select(fd + 1, NULL, &wfds, NULL, &tv);
 }
 
-
-static void inline close_stream_fd(struct tdremus_state *s)
-{
-
-	UNREGISTER_EVENT(s->stream_fd.cid);
-	UNREGISTER_EVENT(s->stream_fd.rid);
-	UNREGISTER_EVENT(s->stream_fd.wid);
-
-	/* close the connection */
-	CLOSE_FD(s->stream_fd.fd);
-}
-
-static void close_server_fd(struct tdremus_state *s)
-{
-	UNREGISTER_EVENT(s->server_fd.cid);
-	CLOSE_FD(s->server_fd.fd);
-}
-
 /* primary functions */
-static void remus_client_event(event_id_t, char mode, void *private);
-static void remus_connect_event(event_id_t id, char mode, void *private);
-static void remus_retry_connect_event(event_id_t id, char mode, void *private);
+static void remus_client_event(event_id_t id, char mode, void *private);
 static int primary_forward_request(struct tdremus_state *s,
 				   const td_request_t *treq);
 
@@ -808,56 +749,15 @@ static int primary_forward_request(struct tdremus_state *s,
  */
 static void primary_failed(struct tdremus_state *s, int rc)
 {
-	close_stream_fd(s);
+	td_replication_connect_kill(&s->t);
 	if (rc == ERROR_INTERNAL)
 		RPRINTF("switch to unprotected mode due to internal error");
 	if (rc == ERROR_CLOSE)
 		RPRINTF("switch to unprotected mode before closing");
+	UNREGISTER_EVENT(s->stream_fd.id);
 	switch_mode(s->tdremus_driver, mode_unprotected);
 }
 
-static int primary_do_connect(struct tdremus_state *state)
-{
-	event_id_t id;
-	int fd;
-	int rc;
-	int flags;
-
-	RPRINTF("client connecting to %s:%d...\n",
-		inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
-
-	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create client socket: %d\n", errno);
-		return ERROR_INTERNAL;
-	}
-	state->stream_fd.fd = fd;
-
-	/* make socket nonblocking */
-	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
-		flags = 0;
-	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) {
-		RPRINTF("error setting fd %d to non block mode\n", fd);
-		return ERROR_INTERNAL;
-	}
-
-	/*
-	 * once we have created the socket and populated the address,
-	 * we can now start our non-blocking connect. rather than
-	 * duplicating code we trigger a timeout on the socket fd,
-	 * which calls out nonblocking connect code
-	 */
-	if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0,
-					       remus_retry_connect_event,
-					       state)) < 0) {
-		RPRINTF("error registering timeout client connection event handler: %s\n",
-			strerror(id));
-		return ERROR_INTERNAL;
-	}
-
-	state->stream_fd.cid = id;
-	return 0;
-}
-
 static int remus_handle_queued_io(struct tdremus_state *s)
 {
 	struct req_ring *queued_io = &s->queued_io;
@@ -882,184 +782,35 @@ static int remus_handle_queued_io(struct tdremus_state *s)
 	return 0;
 }
 
-static int remus_connection_done(struct tdremus_state *s)
+static void remus_client_established(td_replication_connect_t *t, int rc)
 {
+	struct tdremus_state *s = CONTAINER_OF(t, *s, t);
 	event_id_t id;
 
-	/* the connect succeeded */
-	/* unregister this function and register a new event handler */
-	tapdisk_server_unregister_event(s->stream_fd.cid);
-	s->stream_fd.cid = -1;
+	if (rc) {
+		primary_failed(s, rc);
+		return;
+	}
 
-	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->stream_fd.fd,
+	/* the connect succeeded */
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
 					   0, remus_client_event, s);
 	if(id < 0) {
 		RPRINTF("error registering client event handler: %s\n",
 			strerror(id));
-		return ERROR_INTERNAL;
-	}
-	s->stream_fd.rid = id;
-
-	/* handle the queued requests */
-	return remus_handle_queued_io(s);
-}
-
-static int remus_retry_connect(struct tdremus_state *s)
-{
-	event_id_t id;
-
-	tapdisk_server_unregister_event(s->stream_fd.cid);
-	s->stream_fd.cid = -1;
-
-	RPRINTF("connect to backup 1 second later");
-	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
-					   s->stream_fd.fd,
-					   REMUS_CONNRETRY_TIMEOUT,
-					   remus_retry_connect_event, s);
-	if (id < 0) {
-		RPRINTF("error registering timeout client connection event handler: %s\n",
-			strerror(id));
-		return ERROR_INTERNAL;
-	}
-
-	s->stream_fd.cid = id;
-	return 0;
-}
-
-static int remus_wait_connect_done(struct tdremus_state *s)
-{
-	event_id_t id;
-
-	tapdisk_server_unregister_event(s->stream_fd.cid);
-	s->stream_fd.cid = -1;
-
-	id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD,
-					   s->stream_fd.fd, 0,
-					   remus_connect_event, s);
-	if (id < 0) {
-		RPRINTF("error registering client connection event handler: %s\n",
-			strerror(id));
-		return ERROR_INTERNAL;
-	}
-	s->stream_fd.cid = id;
-
-	return 0;
-}
-
-/* return 1 if we need to reconnect to backup */
-static int check_connect_errno(int err)
-{
-	/*
-	 * The fd is non-block, so we will not get ETIMEDOUT
-	 * after calling connect(). We only can get this errno
-	 * by getsockopt().
-	 */
-	if (err == ECONNREFUSED || err == ENETUNREACH ||
-	    err == EAGAIN || err == ECONNABORTED ||
-	    err == ETIMEDOUT)
-	    return 1;
-
-	return 0;
-}
-
-static void remus_retry_connect_event(event_id_t id, char mode, void *private)
-{
-	struct tdremus_state *s = (struct tdremus_state *)private;
-	int rc, ret;
-
-	/* do a non-blocking connect */
-	ret = connect(s->stream_fd.fd,
-		      (struct sockaddr *)&s->sa,
-		      sizeof(s->sa));
-	if (ret) {
-		if (errno == EINPROGRESS) {
-			/*
-			 * the connect returned EINPROGRESS (nonblocking
-			 * connect) we must wait for the fd to be writeable
-			 * to determine if the connect worked
-			 */
-			rc = remus_wait_connect_done(s);
-			if (rc)
-				goto fail;
-			return;
-		}
-
-		if (check_connect_errno(errno)) {
-			rc = remus_retry_connect(s);
-			if (rc)
-				goto fail;
-			return;
-		}
-
-		/* not recoverable */
-		RPRINTF("error connection to server %s\n", strerror(errno));
-		rc = ERROR_CONNECTION;
-		goto fail;
-	}
-
-	/* The connection is established unexpectedly */
-	rc = remus_connection_done(s);
-	if (rc)
-		goto fail;
-
-	return;
-
-fail:
-	primary_failed(s, rc);
-	return;
-}
-
-/* callback when nonblocking connect() is finished */
-static void remus_connect_event(event_id_t id, char mode, void *private)
-{
-	int socket_errno;
-	socklen_t socket_errno_size;
-	struct tdremus_state *s = (struct tdremus_state *)private;
-	int rc;
-
-	/* check to see if the connect succeeded */
-	socket_errno_size = sizeof(socket_errno);
-	if (getsockopt(s->stream_fd.fd, SOL_SOCKET, SO_ERROR,
-		       &socket_errno, &socket_errno_size)) {
-		RPRINTF("error getting socket errno\n");
+		primary_failed(s, ERROR_INTERNAL);
 		return;
 	}
 
-	RPRINTF("socket connect returned %d\n", socket_errno);
+	s->stream_fd.fd = t->fd;
+	s->stream_fd.id = id;
 
-	if (socket_errno) {
-		/* the connect did not succeed */
-		if (check_connect_errno(socket_errno)) {
-			/*
-			 * we can probably assume that the backup is down.
-			 * just try again later
-			 */
-			rc = remus_retry_connect(s);
-			if (rc)
-				goto fail;
-
-			return;
-		} else {
-			RPRINTF("socket connect returned %d, giving up\n",
-				socket_errno);
-			rc = ERROR_CONNECTION;
-			goto fail;
-		}
-
-		return;
-	}
-
-	rc = remus_connection_done(s);
+	/* handle the queued requests */
+	rc = remus_handle_queued_io(s);
 	if (rc)
-		goto fail;
-
-	return;
-
-fail:
-	primary_failed(s, rc);
+		primary_failed(s, rc);
 }
 
-
 /*
  * we install this event handler on the primary once we have
  * connected to the backup.
@@ -1142,19 +893,21 @@ static int primary_forward_request(struct tdremus_state *s,
 static void primary_queue_write(td_driver_t *driver, td_request_t treq)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	int rc;
+	int rc, ret;
 
 	// RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
 
-	if(s->stream_fd.fd < 0) {
+	ret = td_replication_connect_status(&s->t);
+	if(ret == -1) {
 		RPRINTF("connecting to backup...\n");
-		rc = primary_do_connect(s);
+		s->t.callback = remus_client_established;
+		rc = td_replication_client_start(&s->t);
 		if (rc)
 			goto fail;
 	}
 
 	/* The connection is not established, just queue the request */
-	if (s->stream_fd.cid >= 0) {
+	if (ret != 1) {
 		ring_add_request(&s->queued_io, &treq);
 		return;
 	}
@@ -1227,9 +980,7 @@ static int primary_start(td_driver_t *driver)
 	s->queue_flush = primary_flush;
 
 	s->stream_fd.fd = -1;
-	s->stream_fd.cid = -1;
-	s->stream_fd.rid = -1;
-	s->stream_fd.wid = -1;
+	s->stream_fd.id = -1;
 
 	return 0;
 }
@@ -1240,100 +991,32 @@ static void remus_server_event(event_id_t id, char mode, void *private);
 /* It is called when we find some I/O error */
 static void backup_failed(struct tdremus_state *s, int rc)
 {
-	close_stream_fd(s);
-	close_server_fd(s);
+	td_replication_connect_kill(&s->t);
 	/* We will switch to unprotected mode in backup_queue_write() */
 }
 
 /* returns the socket that receives write requests */
-static void remus_server_accept(event_id_t id, char mode, void* private)
+static void remus_server_established(td_replication_connect_t *t, int rc)
 {
-	struct tdremus_state* s = (struct tdremus_state *) private;
-
-	int stream_fd;
-
-	/* XXX: add address-based black/white list */
-	if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
-		RPRINTF("error accepting connection: %d\n", errno);
-		return;
-	}
+	struct tdremus_state *s = CONTAINER_OF(t, *s, t);
+	event_id_t id;
 
-	/*
-	 * TODO: check to see if we are already replicating.
-	 * if so just close the connection (or do something
-	 * smarter)
-	 */
-	RPRINTF("server accepted connection\n");
+	/* rc is always 0 */
 
 	/* add tapdisk event for replication stream */
-	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, stream_fd, 0,
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd, 0,
 					   remus_server_event, s);
 
 	if (id < 0) {
 		RPRINTF("error registering connection event handler: %s\n",
 			strerror(errno));
-		close(stream_fd);
+		td_replication_server_restart(t);
 		return;
 	}
 
 	/* store replication file descriptor */
-	s->stream_fd.fd = stream_fd;
-	s->stream_fd.rid = id;
-}
-
-/* returns -2 if EADDRNOTAVAIL */
-static int remus_bind(struct tdremus_state* s)
-{
-	int opt;
-	int rc = -1;
-	event_id_t id;
-
-	if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create server socket: %d\n", errno);
-		return rc;
-	}
-
-	opt = 1;
-	if (setsockopt(s->server_fd.fd, SOL_SOCKET,
-		       SO_REUSEADDR, &opt, sizeof(opt)) < 0)
-		RPRINTF("Error setting REUSEADDR on %d: %d\n",
-			s->server_fd.fd, errno);
-
-	if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa,
-		 sizeof(s->sa)) < 0) {
-		RPRINTF("could not bind server socket %d to %s:%d: %d %s\n",
-			s->server_fd.fd, inet_ntoa(s->sa.sin_addr),
-			ntohs(s->sa.sin_port), errno, strerror(errno));
-		if (errno == EADDRNOTAVAIL)
-			rc = -2;
-		goto err_sfd;
-	}
-
-	if (listen(s->server_fd.fd, 10)) {
-		RPRINTF("could not listen on socket: %d\n", errno);
-		goto err_sfd;
-	}
-
-	/*
-	 * The socket s now bound to the address and listening so we
-	 * may now register the fd with tapdisk
-	 */
-	id =  tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
-					    s->server_fd.fd, 0,
-					    remus_server_accept, s);
-	if (id < 0) {
-		RPRINTF("error registering server connection event handler: %s",
-			strerror(id));
-		goto err_sfd;
-	}
-	s->server_fd.cid = id;
-
-	return 0;
-
-err_sfd:
-	CLOSE_FD(s->server_fd.fd);
-
-	return rc;
+	s->stream_fd.fd = t->fd;
+	s->stream_fd.id = id;
 }
 
 /* wait for latest checkpoint to be applied */
@@ -1566,90 +1249,6 @@ static int unprotected_start(td_driver_t *driver)
 
 
 /* control */
-
-static inline int resolve_address(const char* addr, struct in_addr* ia)
-{
-	struct hostent* he;
-	uint32_t ip;
-
-	if (!(he = gethostbyname(addr))) {
-		RPRINTF("error resolving %s: %d\n", addr, h_errno);
-		return -1;
-	}
-
-	if (!he->h_addr_list[0]) {
-		RPRINTF("no address found for %s\n", addr);
-		return -1;
-	}
-
-	/* network byte order */
-	ip = *((uint32_t**)he->h_addr_list)[0];
-	ia->s_addr = ip;
-
-	return 0;
-}
-
-static int get_args(td_driver_t *driver, const char* name)
-{
-	struct tdremus_state *state = (struct tdremus_state *)driver->data;
-	char* host;
-	char* port;
-//  char* driver_str;
-//  char* parent;
-//  int type;
-//  char* path;
-//  unsigned long ulport;
-//  int i;
-//  struct sockaddr_in server_addr_in;
-
-	int gai_status;
-	int valid_addr;
-	struct addrinfo gai_hints;
-	struct addrinfo *servinfo, *servinfo_itr;
-
-	memset(&gai_hints, 0, sizeof gai_hints);
-	gai_hints.ai_family = AF_UNSPEC;
-	gai_hints.ai_socktype = SOCK_STREAM;
-
-	port = strchr(name, ':');
-	if (!port) {
-		RPRINTF("missing host in %s\n", name);
-		return -ENOENT;
-	}
-	if (!(host = strndup(name, port - name))) {
-		RPRINTF("unable to allocate host\n");
-		return -ENOMEM;
-	}
-	port++;
-
-	if ((gai_status = getaddrinfo(host, port, &gai_hints, &servinfo)) != 0) {
-		RPRINTF("getaddrinfo error: %s\n", gai_strerror(gai_status));
-		return -ENOENT;
-	}
-
-	/* TODO: do something smarter here */
-	valid_addr = 0;
-	for(servinfo_itr = servinfo; servinfo_itr != NULL; servinfo_itr = servinfo_itr->ai_next) {
-		void *addr;
-		char *ipver;
-
-		if (servinfo_itr->ai_family == AF_INET) {
-			valid_addr = 1;
-			memset(&state->sa, 0, sizeof(state->sa));
-			state->sa = *(struct sockaddr_in *)servinfo_itr->ai_addr;
-			break;
-		}
-	}
-	freeaddrinfo(servinfo);
-
-	if (!valid_addr)
-		return -ENOENT;
-
-	RPRINTF("host: %s, port: %d\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
-
-	return 0;
-}
-
 static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -1844,11 +1443,11 @@ static int ctl_register(struct tdremus_state *s)
 	RPRINTF("registering ctl fifo\n");
 
 	/* register ctl fd */
-	s->ctl_fd.cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->ctl_fd.fd, 0, ctl_request, s);
+	s->ctl_fd.id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->ctl_fd.fd, 0, ctl_request, s);
 
-	if (s->ctl_fd.cid < 0) {
+	if (s->ctl_fd.id < 0) {
 		RPRINTF("error registering ctrl FIFO %s: %d\n",
-			s->ctl_path, s->ctl_fd.cid);
+			s->ctl_path, s->ctl_fd.id);
 		return -1;
 	}
 
@@ -1859,7 +1458,7 @@ static void ctl_unregister(struct tdremus_state *s)
 {
 	RPRINTF("unregistering ctl fifo\n");
 
-	UNREGISTER_EVENT(s->ctl_fd.cid);
+	UNREGISTER_EVENT(s->ctl_fd.id);
 }
 
 /* interface */
@@ -1867,6 +1466,7 @@ static void ctl_unregister(struct tdremus_state *s)
 static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	td_replication_connect_t *t = &s->t;
 	int rc;
 	const char *name = image->name;
 	td_flag_t flags = image->flags;
@@ -1877,7 +1477,6 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	remus_image = image;
 
 	memset(s, 0, sizeof(*s));
-	s->server_fd.fd = -1;
 	s->stream_fd.fd = -1;
 	s->ctl_fd.fd = -1;
 	s->msg_fd.fd = -1;
@@ -1886,8 +1485,12 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	 * the driver stack from the stream_fd event handler */
 	s->tdremus_driver = driver;
 
+	t->log_prefix = "remus";
+	t->retry_timeout_s = REMUS_CONNRETRY_TIMEOUT;
+	t->max_connections = 10;
+	t->callback = remus_server_established;
 	/* parse name to get info etc */
-	if ((rc = get_args(driver, name)))
+	if ((rc = td_replication_connect_init(t, name)))
 		return rc;
 
 	if ((rc = ctl_open(driver, name))) {
@@ -1901,7 +1504,7 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 		return rc;
 	}
 
-	if (!(rc = remus_bind(s)))
+	if (!(rc = td_replication_server_start(t)))
 		rc = switch_mode(driver, mode_backup);
 	else if (rc == -2)
 		rc = switch_mode(driver, mode_primary);
@@ -1932,8 +1535,7 @@ static int tdremus_close(td_driver_t *driver)
 	if (s->ramdisk.inprogress)
 		hashtable_destroy(s->ramdisk.inprogress, 0);
 
-	close_server_fd(s);
-	close_stream_fd(s);
+	td_replication_connect_kill(&s->t);
 	ctl_unregister(s);
 	ctl_close(s);
 
diff --git a/tools/blktap2/drivers/block-replication.c b/tools/blktap2/drivers/block-replication.c
new file mode 100644
index 0000000..e4b2679
--- /dev/null
+++ b/tools/blktap2/drivers/block-replication.c
@@ -0,0 +1,468 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "tapdisk-server.h"
+#include "block-replication.h"
+
+#include <string.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <syslog.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+
+#undef DPRINTF
+#undef EPRINTF
+#define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "%s: " _f, log_prefix, ## _a)
+#define EPRINTF(_f, _a...) syslog (LOG_ERR, "%s: " _f, log_prefix, ## _a)
+
+/* connection status */
+enum {
+	connection_none,
+	connection_in_progress,
+	connection_established,
+	connection_closed,
+};
+
+/* common functions */
+/* args should be host:port */
+static int get_args(td_replication_connect_t *t, const char* name)
+{
+	char* host;
+	const char* port;
+	int gai_status;
+	int valid_addr;
+	struct addrinfo gai_hints;
+	struct addrinfo *servinfo, *servinfo_itr;
+	const char *log_prefix = t->log_prefix;
+
+	memset(&gai_hints, 0, sizeof gai_hints);
+	gai_hints.ai_family = AF_UNSPEC;
+	gai_hints.ai_socktype = SOCK_STREAM;
+
+	port = strchr(name, ':');
+	if (!port) {
+		EPRINTF("missing host in %s\n", name);
+		return -ENOENT;
+	}
+	if (!(host = strndup(name, port - name))) {
+		EPRINTF("unable to allocate host\n");
+		return -ENOMEM;
+	}
+	port++;
+	if ((gai_status = getaddrinfo(host, port,
+				      &gai_hints, &servinfo)) != 0) {
+		EPRINTF("getaddrinfo error: %s\n", gai_strerror(gai_status));
+		free(host);
+		return -ENOENT;
+	}
+	free(host);
+
+	/* TODO: do something smarter here */
+	valid_addr = 0;
+	for (servinfo_itr = servinfo; servinfo_itr != NULL;
+	     servinfo_itr = servinfo_itr->ai_next) {
+		if (servinfo_itr->ai_family == AF_INET) {
+			valid_addr = 1;
+			memset(&t->sa, 0, sizeof(t->sa));
+			t->sa = *(struct sockaddr_in *)servinfo_itr->ai_addr;
+			break;
+		}
+	}
+	freeaddrinfo(servinfo);
+
+	if (!valid_addr)
+		return -ENOENT;
+
+	DPRINTF("host: %s, port: %d\n", inet_ntoa(t->sa.sin_addr),
+		ntohs(t->sa.sin_port));
+
+	return 0;
+}
+
+int td_replication_connect_init(td_replication_connect_t *t, const char *name)
+{
+	int rc;
+
+	rc = get_args(t, name);
+	if (rc)
+		return rc;
+
+	t->listen_fd = -1;
+	t->id = -1;
+	t->status = connection_none;
+	return 0;
+}
+
+int td_replication_connect_status(td_replication_connect_t *t)
+{
+	const char *log_prefix = t->log_prefix;
+
+	switch (t->status) {
+	case connection_none:
+	case connection_closed:
+		return -1;
+	case connection_in_progress:
+		return 0;
+	case connection_established:
+		return 1;
+	default:
+		EPRINTF("td_replication_connect is corruptted\n");
+		return -2;
+	}
+}
+
+void td_replication_connect_kill(td_replication_connect_t *t)
+{
+	if (t->status != connection_in_progress &&
+	    t->status != connection_established)
+		return;
+
+	UNREGISTER_EVENT(t->id);
+	CLOSE_FD(t->fd);
+	CLOSE_FD(t->listen_fd);
+	t->status = connection_closed;
+}
+
+/* server */
+static void td_replication_server_accept(event_id_t id, char mode,
+					 void *private);
+
+int td_replication_server_start(td_replication_connect_t *t)
+{
+	int opt;
+	int rc = -1;
+	event_id_t id;
+	int fd;
+	const char *log_prefix = t->log_prefix;
+
+	if (t->status == connection_in_progress ||
+	    t->status == connection_established)
+		return rc;
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	if (fd < 0) {
+		EPRINTF("could not create server socket: %d\n", errno);
+		return rc;
+	}
+
+	opt = 1;
+	if (setsockopt(fd, SOL_SOCKET,
+		       SO_REUSEADDR, &opt, sizeof(opt)) < 0)
+		DPRINTF("Error setting REUSEADDR on %d: %d\n", fd, errno);
+
+	if (bind(fd, (struct sockaddr *)&t->sa, sizeof(t->sa)) < 0) {
+		DPRINTF("could not bind server socket %d to %s:%d: %d %s\n",
+			fd, inet_ntoa(t->sa.sin_addr),
+			ntohs(t->sa.sin_port), errno, strerror(errno));
+		if (errno == EADDRNOTAVAIL)
+			rc = -2;
+		goto err;
+	}
+
+	if (listen(fd, t->max_connections)) {
+		EPRINTF("could not listen on socket: %d\n", errno);
+		goto err;
+	}
+
+	/*
+	 * The socket is now bound to the address and listening so we
+	 * may now register the fd with tapdisk
+	 */
+	id =  tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+					    fd, 0,
+					    td_replication_server_accept, t);
+	if (id < 0) {
+		EPRINTF("error registering server connection event handler: %s",
+			strerror(id));
+		goto err;
+	}
+	t->listen_fd = fd;
+	t->id = id;
+	t->status = connection_in_progress;
+
+	return 0;
+
+err:
+	close(fd);
+	return rc;
+}
+
+static void td_replication_server_accept(event_id_t id, char mode,
+					 void *private)
+{
+	td_replication_connect_t *t = private;
+	int fd;
+	const char *log_prefix = t->log_prefix;
+
+	/* XXX: add address-based black/white list */
+	fd = accept(t->listen_fd, NULL, NULL);
+	if (fd < 0) {
+		EPRINTF("error accepting connection: %d\n", errno);
+		return;
+	}
+
+	if (t->status == connection_established) {
+		EPRINTF("connection is already established\n");
+		close(fd);
+		return;
+	}
+
+	DPRINTF("server accepted connection\n");
+	t->fd = fd;
+	t->status = connection_established;
+	t->callback(t, 0);
+}
+
+int td_replication_server_restart(td_replication_connect_t *t)
+{
+	switch (t->status) {
+	case connection_in_progress:
+		return 0;
+	case connection_established:
+		CLOSE_FD(t->fd);
+		t->status = connection_in_progress;
+		return 0;
+	case connection_none:
+	case connection_closed:
+		return td_replication_server_start(t);
+	default:
+		/* not reached */
+		return -1;
+	}
+}
+
+/* client */
+static void td_replication_retry_connect_event(event_id_t id, char mode,
+					       void *private);
+static void td_replication_connect_event(event_id_t id, char mode,
+					 void *private);
+int td_replication_client_start(td_replication_connect_t *t)
+{
+	event_id_t id;
+	int fd;
+	int rc;
+	int flags;
+	const char *log_prefix = t->log_prefix;
+
+	if (t->status == connection_in_progress ||
+	    t->status == connection_established)
+		return ERROR_INTERNAL;
+
+	DPRINTF("client connecting to %s:%d...\n",
+		inet_ntoa(t->sa.sin_addr), ntohs(t->sa.sin_port));
+
+	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
+		EPRINTF("could not create client socket: %d\n", errno);
+		return ERROR_INTERNAL;
+	}
+
+	/* make socket nonblocking */
+	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
+		flags = 0;
+	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) {
+		EPRINTF("error setting fd %d to non block mode\n", fd);
+		goto err;
+	}
+
+	/*
+	 * once we have created the socket and populated the address,
+	 * we can now start our non-blocking connect. rather than
+	 * duplicating code we trigger a timeout on the socket fd,
+	 * which calls out nonblocking connect code
+	 */
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0,
+					   td_replication_retry_connect_event,
+					   t);
+	if(id < 0) {
+		EPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		goto err;
+	}
+
+	t->fd = fd;
+	t->id = id;
+	t->status = connection_in_progress;
+	return 0;
+
+err:
+	close(fd);
+	return ERROR_INTERNAL;
+}
+
+static void td_replication_client_failed(td_replication_connect_t *t, int rc)
+{
+	td_replication_connect_kill(t);
+	t->callback(t, rc);
+}
+
+static void td_replication_client_done(td_replication_connect_t *t)
+{
+	UNREGISTER_EVENT(t->id);
+	t->status = connection_established;
+	t->callback(t, 0);
+}
+
+static int td_replication_retry_connect(td_replication_connect_t *t)
+{
+	event_id_t id;
+	const char *log_prefix = t->log_prefix;
+
+	UNREGISTER_EVENT(t->id);
+
+	DPRINTF("connect to server 1 second later");
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
+					   t->fd, t->retry_timeout_s,
+					   td_replication_retry_connect_event,
+					   t);
+	if (id < 0) {
+		EPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
+	}
+
+	t->id = id;
+	return 0;
+}
+
+static int td_replication_wait_connect_done(td_replication_connect_t *t)
+{
+	event_id_t id;
+	const char *log_prefix = t->log_prefix;
+
+	UNREGISTER_EVENT(t->id);
+
+	id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD,
+					   t->fd, 0,
+					   td_replication_connect_event, t);
+	if (id < 0) {
+		EPRINTF("error registering client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
+	}
+	t->id = id;
+
+	return 0;
+}
+
+/* return 1 if we need to reconnect to backup server */
+static int check_connect_errno(int err)
+{
+	/*
+	 * The fd is non-block, so we will not get ETIMEDOUT
+	 * after calling connect(). We only can get this errno
+	 * by getsockopt().
+	 */
+	if (err == ECONNREFUSED || err == ENETUNREACH ||
+	    err == EAGAIN || err == ECONNABORTED ||
+	    err == ETIMEDOUT)
+	    return 1;
+
+	return 0;
+}
+
+static void td_replication_retry_connect_event(event_id_t id, char mode,
+					       void *private)
+{
+	td_replication_connect_t *t = private;
+	int rc, ret;
+	const char *log_prefix = t->log_prefix;
+
+	/* do a non-blocking connect */
+	ret = connect(t->fd, (struct sockaddr *)&t->sa, sizeof(t->sa));
+	if (ret) {
+		if (errno == EINPROGRESS) {
+			/*
+			 * the connect returned EINPROGRESS (nonblocking
+			 * connect) we must wait for the fd to be writeable
+			 * to determine if the connect worked
+			 */
+			rc = td_replication_wait_connect_done(t);
+			if (rc)
+				goto fail;
+			return;
+		}
+
+		if (check_connect_errno(errno)) {
+			rc = td_replication_retry_connect(t);
+			if (rc)
+				goto fail;
+			return;
+		}
+
+		/* not recoverable */
+		EPRINTF("error connection to server %s\n", strerror(errno));
+		rc = ERROR_CONNECTION;
+		goto fail;
+	}
+
+	/* The connection is established unexpectedly */
+	td_replication_client_done(t);
+
+	return;
+
+fail:
+	td_replication_client_failed(t, rc);
+}
+
+/* callback when nonblocking connect() is finished */
+static void td_replication_connect_event(event_id_t id, char mode,
+					 void *private)
+{
+	int socket_errno;
+	socklen_t socket_errno_size;
+	td_replication_connect_t *t = private;
+	int rc;
+	const char *log_prefix = t->log_prefix;
+
+	/* check to see if the connect succeeded */
+	socket_errno_size = sizeof(socket_errno);
+	if (getsockopt(t->fd, SOL_SOCKET, SO_ERROR,
+		       &socket_errno, &socket_errno_size)) {
+		EPRINTF("error getting socket errno\n");
+		return;
+	}
+
+	DPRINTF("socket connect returned %d\n", socket_errno);
+
+	if (socket_errno) {
+		/* the connect did not succeed */
+		if (check_connect_errno(socket_errno)) {
+			/*
+			 * we can probably assume that the backup is down.
+			 * just try again later
+			 */
+			rc = td_replication_retry_connect(t);
+			if (rc)
+				goto fail;
+
+			return;
+		} else {
+			EPRINTF("socket connect returned %d, giving up\n",
+				socket_errno);
+			rc = ERROR_CONNECTION;
+			goto fail;
+		}
+	}
+
+	td_replication_client_done(t);
+
+	return;
+
+fail:
+	td_replication_client_failed(t, rc);
+}
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
new file mode 100644
index 0000000..0bd6e71
--- /dev/null
+++ b/tools/blktap2/drivers/block-replication.h
@@ -0,0 +1,113 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef BLOCK_REPLICATION_H
+#define BLOCK_REPLICATION_H
+
+#include "scheduler.h"
+#include <sys/socket.h>
+#include <netdb.h>
+
+#define CONTAINER_OF(inner_ptr, outer, member_name)			\
+	({								\
+		typeof(outer) *container_of_;				\
+		container_of_ = (void*)((char*)(inner_ptr) -		\
+				offsetof(typeof(outer), member_name));	\
+		(void)(&container_of_->member_name ==			\
+		       (typeof(inner_ptr))0) /* type check */;		\
+		container_of_;						\
+	})
+
+#define UNREGISTER_EVENT(id)					\
+	do {							\
+		if (id >= 0) {					\
+			tapdisk_server_unregister_event(id);	\
+			id = -1;				\
+		}						\
+	} while (0)
+#define CLOSE_FD(fd)			\
+	do {				\
+		if (fd >= 0) {		\
+			close(fd);	\
+			fd = -1;	\
+		}			\
+	} while (0)
+
+enum {
+	ERROR_INTERNAL = -1,
+	ERROR_IO = -2,
+	ERROR_CONNECTION = -3,
+	ERROR_CLOSE = -4,
+};
+
+typedef struct td_replication_connect td_replication_connect_t;
+typedef void td_replication_callback(td_replication_connect_t *r, int rc);
+
+struct td_replication_connect {
+	/*
+	 * caller must fill these in before calling
+	 * td_replication_connect_init()
+	 */
+	const char *log_prefix;
+	td_replication_callback *callback;
+	int retry_timeout_s;
+	int max_connections;
+	/*
+	 * The caller uses this fd to read/write after
+	 * the connection is established
+	 */
+	int fd;
+
+	/* private */
+	struct sockaddr_in sa;
+	int listen_fd;
+	event_id_t id;
+
+	int status;
+};
+
+/* return -errno if failure happened, otherwise return 0 */
+int td_replication_connect_init(td_replication_connect_t *t, const char *name);
+/*
+ * Return value:
+ *   -1: connection is closed or not connected
+ *    0: connection is in progress
+ *    1: connection is established
+ */
+int td_replication_connect_status(td_replication_connect_t *t);
+void td_replication_connect_kill(td_replication_connect_t *t);
+
+/*
+ * Return value:
+ *   -2: this caller should be client
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_server_start(td_replication_connect_t *t);
+/*
+ * Return value:
+ *   -2: this caller should be client
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_server_restart(td_replication_connect_t *t);
+/*
+ * Return value:
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_client_start(td_replication_connect_t *t);
+
+#endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 38/45] blktap2: move ramdisk related codes to block-replication.c
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (36 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 37/45] blktap2: move async connect related codes to block-replication.c Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 39/45] block-colo: implement colo disk replication Wen Congyang
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

 COLO will reuse them

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c       | 485 ++----------------------------
 tools/blktap2/drivers/block-replication.c | 452 ++++++++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.h |  48 +++
 3 files changed, 523 insertions(+), 462 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 8b6f157..2713af1 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -37,9 +37,6 @@
 #include "tapdisk-server.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
-#include "hashtable.h"
-#include "hashtable_itr.h"
-#include "hashtable_utility.h"
 #include "block-replication.h"
 
 #include <errno.h>
@@ -58,7 +55,6 @@
 
 /* timeout for reads and writes in ms */
 #define HEARTBEAT_MS 1000
-#define RAMDISK_HASHSIZE 128
 
 /* connect retry timeout (seconds) */
 #define REMUS_CONNRETRY_TIMEOUT 1
@@ -97,51 +93,6 @@ td_vbd_t *device_vbd = NULL;
 td_image_t *remus_image = NULL;
 struct tap_disk tapdisk_remus;
 
-struct ramdisk {
-	size_t sector_size;
-	struct hashtable* h;
-	/* when a ramdisk is flushed, h is given a new empty hash for writes
-	 * while the old ramdisk (prev) is drained asynchronously.
-	 */
-	struct hashtable* prev;
-	/* count of outstanding requests to the base driver */
-	size_t inflight;
-	/* prev holds the requests to be flushed, while inprogress holds
-	 * requests being flushed. When requests complete, they are removed
-	 * from inprogress.
-	 * Whenever a new flush is merged with ongoing flush (i.e, prev),
-	 * we have to make sure that none of the new requests overlap with
-	 * ones in "inprogress". If it does, keep it back in prev and dont issue
-	 * IO until the current one finishes. If we allow this IO to proceed,
-	 * we might end up with two "overlapping" requests in the disk's queue and
-	 * the disk may not offer any guarantee on which one is written first.
-	 * IOW, make sure we dont create a write-after-write time ordering constraint.
-	 * 
-	 */
-	struct hashtable* inprogress;
-};
-
-/* the ramdisk intercepts the original callback for reads and writes.
- * This holds the original data. */
-/* Might be worth making this a static array in struct ramdisk to avoid
- * a malloc per request */
-
-struct tdremus_state;
-
-struct ramdisk_cbdata {
-	td_callback_t cb;
-	void* private;
-	char* buf;
-	struct tdremus_state* state;
-};
-
-struct ramdisk_write_cbdata {
-	struct tdremus_state* state;
-	char* buf;
-};
-
-typedef void (*queue_rw_t) (td_driver_t *driver, td_request_t treq);
-
 typedef struct poll_fd {
 	int        fd;
 	event_id_t id;
@@ -167,8 +118,14 @@ struct tdremus_state {
 	 */
 	struct req_ring queued_io;
 
-	/* ramdisk data*/
+	/* ramdisk data */
 	struct ramdisk ramdisk;
+	/*
+	 * The primary write request is queued in this
+	 * hashtable, and will be flushed to ramdisk when
+	 * the checkpoint finishes.
+	 */
+	struct hashtable *h;
 
 	/* mode methods */
 	enum tdremus_mode mode;
@@ -239,404 +196,20 @@ static void ring_add_request(struct req_ring *ring, const td_request_t *treq)
 	ring->prod = ring_next(ring->prod);
 }
 
-/* Prototype declarations */
-static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
-
-/* functions to create and sumbit treq's */
-
-static void
-replicated_write_callback(td_request_t treq, int err)
-{
-	struct tdremus_state *s = (struct tdremus_state *) treq.cb_data;
-	td_vbd_request_t *vreq;
-	int i;
-	uint64_t start;
-	vreq = (td_vbd_request_t *) treq.private;
-
-	/* the write failed for now, lets panic. this is very bad */
-	if (err) {
-		RPRINTF("ramdisk write failed, disk image is not consistent\n");
-		exit(-1);
-	}
-
-	/* The write succeeded. let's pull the vreq off whatever request list
-	 * it is on and free() it */
-	list_del(&vreq->next);
-	free(vreq);
-
-	s->ramdisk.inflight--;
-	start = treq.sec;
-	for (i = 0; i < treq.secs; i++) {
-		hashtable_remove(s->ramdisk.inprogress, &start);
-		start++;
-	}
-	free(treq.buf);
-
-	if (!s->ramdisk.inflight && !s->ramdisk.prev) {
-		/* TODO: the ramdisk has been flushed */
-	}
-}
-
-static inline int
-create_write_request(struct tdremus_state *state, td_sector_t sec, int secs, char *buf)
-{
-	td_request_t treq;
-	td_vbd_request_t *vreq;
-
-	treq.op      = TD_OP_WRITE;
-	treq.buf     = buf;
-	treq.sec     = sec;
-	treq.secs    = secs;
-	treq.image   = remus_image;
-	treq.cb      = replicated_write_callback;
-	treq.cb_data = state;
-	treq.id      = 0;
-	treq.sidx    = 0;
-
-	vreq         = calloc(1, sizeof(td_vbd_request_t));
-	treq.private = vreq;
-
-	if(!vreq)
-		return -1;
-
-	vreq->submitting = 1;
-	INIT_LIST_HEAD(&vreq->next);
-	tapdisk_vbd_move_request(treq.private, &device_vbd->pending_requests);
-
-	/* TODO:
-	 * we should probably leave it up to the caller to forward the request */
-	td_forward_request(treq);
-
-	vreq->submitting--;
-
-	return 0;
-}
-
-
-/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
-static unsigned int uint64_hash(void* k)
-{
-	uint64_t key = *(uint64_t*)k;
-
-	key = (~key) + (key << 18);
-	key = key ^ (key >> 31);
-	key = key * 21;
-	key = key ^ (key >> 11);
-	key = key + (key << 6);
-	key = key ^ (key >> 22);
-
-	return (unsigned int)key;
-}
-
-static int rd_hash_equal(void* k1, void* k2)
-{
-	uint64_t key1, key2;
-
-	key1 = *(uint64_t*)k1;
-	key2 = *(uint64_t*)k2;
-
-	return key1 == key2;
-}
-
-static int ramdisk_read(struct ramdisk* ramdisk, uint64_t sector,
-			int nb_sectors, char* buf)
-{
-	int i;
-	char* v;
-	uint64_t key;
-
-	for (i = 0; i < nb_sectors; i++) {
-		key = sector + i;
-		/* check whether it is queued in a previous flush request */
-		if (!(ramdisk->prev && (v = hashtable_search(ramdisk->prev, &key)))) {
-			/* check whether it is an ongoing flush */
-			if (!(ramdisk->inprogress && (v = hashtable_search(ramdisk->inprogress, &key))))
-				return -1;
-		}
-		memcpy(buf + i * ramdisk->sector_size, v, ramdisk->sector_size);
-	}
-
-	return 0;
-}
-
-static int ramdisk_write_hash(struct hashtable* h, uint64_t sector, char* buf,
-			      size_t len)
-{
-	char* v;
-	uint64_t* key;
-
-	if ((v = hashtable_search(h, &sector))) {
-		memcpy(v, buf, len);
-		return 0;
-	}
-
-	if (!(v = malloc(len))) {
-		DPRINTF("ramdisk_write_hash: malloc failed\n");
-		return -1;
-	}
-	memcpy(v, buf, len);
-	if (!(key = malloc(sizeof(*key)))) {
-		DPRINTF("ramdisk_write_hash: error allocating key\n");
-		free(v);
-		return -1;
-	}
-	*key = sector;
-	if (!hashtable_insert(h, key, v)) {
-		DPRINTF("ramdisk_write_hash failed on sector %" PRIu64 "\n", sector);
-		free(key);
-		free(v);
-		return -1;
-	}
-
-	return 0;
-}
-
-static inline int ramdisk_write(struct ramdisk* ramdisk, uint64_t sector,
-				int nb_sectors, char* buf)
-{
-	int i, rc;
-
-	for (i = 0; i < nb_sectors; i++) {
-		rc = ramdisk_write_hash(ramdisk->h, sector + i,
-					buf + i * ramdisk->sector_size,
-					ramdisk->sector_size);
-		if (rc)
-			return rc;
-	}
-
-	return 0;
-}
-
-static int uint64_compare(const void* k1, const void* k2)
-{
-	uint64_t u1 = *(uint64_t*)k1;
-	uint64_t u2 = *(uint64_t*)k2;
-
-	/* u1 - u2 is unsigned */
-	return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
-}
-
-/* set psectors to an array of the sector numbers in the hash, returning
- * the number of entries (or -1 on error) */
-static int ramdisk_get_sectors(struct hashtable* h, uint64_t** psectors)
-{
-	struct hashtable_itr* itr;
-	uint64_t* sectors;
-	int count;
-
-	if (!(count = hashtable_count(h)))
-		return 0;
-
-	if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
-		DPRINTF("ramdisk_get_sectors: error allocating sector map\n");
-		return -1;
-	}
-	sectors = *psectors;
-
-	itr = hashtable_iterator(h);
-	count = 0;
-	do {
-		sectors[count++] = *(uint64_t*)hashtable_iterator_key(itr);
-	} while (hashtable_iterator_advance(itr));
-	free(itr);
-
-	return count;
-}
-
-/*
-  return -1 for OOM
-  return -2 for merge lookup failure
-  return -3 for WAW race
-  return 0 on success.
-*/
-static int merge_requests(struct ramdisk* ramdisk, uint64_t start,
-			size_t count, char **mergedbuf)
-{
-	char* buf;
-	char* sector;
-	int i;
-	uint64_t *key;
-	int rc = 0;
-
-	if (!(buf = valloc(count * ramdisk->sector_size))) {
-		DPRINTF("merge_request: allocation failed\n");
-		return -1;
-	}
-
-	for (i = 0; i < count; i++) {
-		if (!(sector = hashtable_search(ramdisk->prev, &start))) {
-			DPRINTF("merge_request: lookup failed on %"PRIu64"\n", start);
-			free(buf);
-			rc = -2;
-			goto fail;
-		}
-
-		/* Check inprogress requests to avoid waw non-determinism */
-		if (hashtable_search(ramdisk->inprogress, &start)) {
-			DPRINTF("merge_request: WAR RACE on %"PRIu64"\n", start);
-			free(buf);
-			rc = -3;
-			goto fail;
-		}
-		/* Insert req into inprogress (brief period of duplication of hash entries until
-		 * they are removed from prev. Read tracking would not be reading wrong entries)
-		 */
-		if (!(key = malloc(sizeof(*key)))) {
-			DPRINTF("%s: error allocating key\n", __FUNCTION__);
-			free(buf);			
-			rc = -1;
-			goto fail;
-		}
-		*key = start;
-		if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
-			DPRINTF("%s failed to insert sector %" PRIu64 " into inprogress hash\n", 
-				__FUNCTION__, start);
-			free(key);
-			free(buf);
-			rc = -1;
-			goto fail;
-		}
-		memcpy(buf + i * ramdisk->sector_size, sector, ramdisk->sector_size);
-		start++;
-	}
-
-	*mergedbuf = buf;
-	return 0;
-fail:
-	for (start--; i >0; i--, start--)
-		hashtable_remove(ramdisk->inprogress, &start);
-	return rc;
-}
-
-/* The underlying driver may not handle having the whole ramdisk queued at
- * once. We queue what we can and let the callbacks attempt to queue more. */
-/* NOTE: may be called from callback, while dd->private still belongs to
- * the underlying driver */
-static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s)
-{
-	uint64_t* sectors;
-	char* buf = NULL;
-	uint64_t base, batchlen;
-	int i, j, count = 0;
-
-	// RPRINTF("ramdisk flush\n");
-
-	if ((count = ramdisk_get_sectors(s->ramdisk.prev, &sectors)) <= 0)
-		return count;
-
-	/* Create the inprogress table if empty */
-	if (!s->ramdisk.inprogress)
-		s->ramdisk.inprogress = create_hashtable(RAMDISK_HASHSIZE,
-							uint64_hash,
-							rd_hash_equal);
-	
-	/*
-	  RPRINTF("ramdisk: flushing %d sectors\n", count);
-	*/
-
-	/* sort and merge sectors to improve disk performance */
-	qsort(sectors, count, sizeof(*sectors), uint64_compare);
-
-	for (i = 0; i < count;) {
-		base = sectors[i++];
-		while (i < count && sectors[i] == sectors[i-1] + 1)
-			i++;
-		batchlen = sectors[i-1] - base + 1;
-
-		j = merge_requests(&s->ramdisk, base, batchlen, &buf);
-			
-		if (j) {
-			RPRINTF("ramdisk_flush: merge_requests failed:%s\n",
-				j == -1? "OOM": (j==-2? "missing sector" : "WAW race"));
-			if (j == -3) continue;
-			free(sectors);
-			return -1;
-		}
-
-		/* NOTE: create_write_request() creates a treq AND forwards it down
-		 * the driver chain */
-		// RPRINTF("forwarding write request at %" PRIu64 ", length: %" PRIu64 "\n", base, batchlen);
-		create_write_request(s, base, batchlen, buf);
-		//RPRINTF("write request at %" PRIu64 ", length: %" PRIu64 " forwarded\n", base, batchlen);
-
-		s->ramdisk.inflight++;
-
-		for (j = 0; j < batchlen; j++) {
-			buf = hashtable_search(s->ramdisk.prev, &base);
-			free(buf);
-			hashtable_remove(s->ramdisk.prev, &base);
-			base++;
-		}
-	}
-
-	if (!hashtable_count(s->ramdisk.prev)) {
-		/* everything is in flight */
-		hashtable_destroy(s->ramdisk.prev, 0);
-		s->ramdisk.prev = NULL;
-	}
-
-	free(sectors);
-
-	// RPRINTF("ramdisk flush done\n");
-	return 0;
-}
-
-/* flush ramdisk contents to disk */
-static int ramdisk_start_flush(td_driver_t *driver)
-{
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	uint64_t* key;
-	char* buf;
-	int rc = 0;
-	int i, j, count, batchlen;
-	uint64_t* sectors;
-
-	if (!hashtable_count(s->ramdisk.h)) {
-		/*
-		  RPRINTF("Nothing to flush\n");
-		*/
-		return 0;
-	}
-
-	if (s->ramdisk.prev) {
-		/* a flush request issued while a previous flush is still in progress
-		 * will merge with the previous request. If you want the previous
-		 * request to be consistent, wait for it to complete. */
-		if ((count = ramdisk_get_sectors(s->ramdisk.h, &sectors)) < 0)
-			return count;
-
-		for (i = 0; i < count; i++) {
-			buf = hashtable_search(s->ramdisk.h, sectors + i);
-			ramdisk_write_hash(s->ramdisk.prev, sectors[i], buf,
-					   s->ramdisk.sector_size);
-		}
-		free(sectors);
-
-		hashtable_destroy (s->ramdisk.h, 1);
-	} else
-		s->ramdisk.prev = s->ramdisk.h;
-
-	/* We create a new hashtable so that new writes can be performed before
-	 * the old hashtable is completely drained. */
-	s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
-					rd_hash_equal);
-
-	return ramdisk_flush(driver, s);
-}
-
-
 static int ramdisk_start(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
-	if (s->ramdisk.h) {
+	if (s->h) {
 		RPRINTF("ramdisk already allocated\n");
 		return 0;
 	}
 
 	s->ramdisk.sector_size = driver->info.sector_size;
-	s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
-					rd_hash_equal);
+	s->ramdisk.log_prefix = "remus";
+	s->ramdisk.image = remus_image;
+	ramdisk_init(&s->ramdisk);
+	s->h = ramdisk_new_hashtable();
 
 	DPRINTF("Ramdisk started, %zu bytes/sector\n", s->ramdisk.sector_size);
 
@@ -1024,10 +597,7 @@ static inline int server_writes_inflight(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
-	if (!s->ramdisk.inflight && !s->ramdisk.prev)
-		return 0;
-
-	return 1;
+	return ramdisk_writes_inflight(&s->ramdisk);
 }
 
 /* Due to block device prefetching this code may be called on the server side
@@ -1067,13 +637,9 @@ void backup_queue_write(td_driver_t *driver, td_request_t treq)
 static int server_flush(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	/*
-	 * Nothing to flush in beginning.
-	 */
-	if (!s->ramdisk.prev)
-		return 0;
+
 	/* Try to flush any remaining requests */
-	return ramdisk_flush(driver, s);
+	return ramdisk_flush(&s->ramdisk);
 }
 
 static int backup_start(td_driver_t *driver)
@@ -1120,7 +686,9 @@ static void server_do_wreq(td_driver_t *driver)
 	if (mread(s->stream_fd.fd, buf, len) < 0)
 		goto err;
 
-	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
+	if (ramdisk_write_to_hashtable(s->h, *sector, *sectors,
+				       driver->info.sector_size, buf,
+				       "remus") < 0) {
 		rc = ERROR_INTERNAL;
 		goto err;
 	}
@@ -1150,7 +718,7 @@ static void server_do_creq(td_driver_t *driver)
 
 	// RPRINTF("committing buffer\n");
 
-	ramdisk_start_flush(driver);
+	ramdisk_start_flush(&s->ramdisk, &s->h);
 
 	/* XXX this message should not be sent until flush completes! */
 	if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) != 4)
@@ -1199,12 +767,7 @@ void unprotected_queue_read(td_driver_t *driver, td_request_t treq)
 
 	/* wait for previous ramdisk to flush  before servicing reads */
 	if (server_writes_inflight(driver)) {
-		/* for now lets just return EBUSY.
-		 * if there are any left-over requests in prev,
-		 * kick em again.
-		 */
-		if(!s->ramdisk.inflight) /* nothing in inprogress */
-			ramdisk_flush(driver, s);
+		ramdisk_flush(&s->ramdisk);
 
 		td_complete_request(treq, -EBUSY);
 	}
@@ -1222,8 +785,7 @@ void unprotected_queue_write(td_driver_t *driver, td_request_t treq)
 	/* wait for previous ramdisk to flush */
 	if (server_writes_inflight(driver)) {
 		RPRINTF("queue_write: waiting for queue to drain");
-		if(!s->ramdisk.inflight) /* nothing in inprogress. Kick prev */
-			ramdisk_flush(driver, s);
+		ramdisk_flush(&s->ramdisk);
 		td_complete_request(treq, -EBUSY);
 	}
 	else {
@@ -1532,9 +1094,8 @@ static int tdremus_close(td_driver_t *driver)
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
 	RPRINTF("closing\n");
-	if (s->ramdisk.inprogress)
-		hashtable_destroy(s->ramdisk.inprogress, 0);
-
+	ramdisk_destroy(&s->ramdisk);
+	ramdisk_destroy_hashtable(s->h);
 	td_replication_connect_kill(&s->t);
 	ctl_unregister(s);
 	ctl_close(s);
diff --git a/tools/blktap2/drivers/block-replication.c b/tools/blktap2/drivers/block-replication.c
index e4b2679..30eba8f 100644
--- a/tools/blktap2/drivers/block-replication.c
+++ b/tools/blktap2/drivers/block-replication.c
@@ -15,6 +15,10 @@
 
 #include "tapdisk-server.h"
 #include "block-replication.h"
+#include "tapdisk-interface.h"
+#include "hashtable.h"
+#include "hashtable_itr.h"
+#include "hashtable_utility.h"
 
 #include <string.h>
 #include <errno.h>
@@ -30,6 +34,8 @@
 #define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "%s: " _f, log_prefix, ## _a)
 #define EPRINTF(_f, _a...) syslog (LOG_ERR, "%s: " _f, log_prefix, ## _a)
 
+#define RAMDISK_HASHSIZE 128
+
 /* connection status */
 enum {
 	connection_none,
@@ -466,3 +472,449 @@ static void td_replication_connect_event(event_id_t id, char mode,
 fail:
 	td_replication_client_failed(t, rc);
 }
+
+
+/* I/O replication */
+static void replicated_write_callback(td_request_t treq, int err)
+{
+	ramdisk_t *ramdisk = treq.cb_data;
+	td_vbd_request_t *vreq = treq.private;
+	int i;
+	uint64_t start;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/* the write failed for now, lets panic. this is very bad */
+	if (err) {
+		EPRINTF("ramdisk write failed, disk image is not consistent\n");
+		exit(-1);
+	}
+
+	/*
+	 * The write succeeded. let's pull the vreq off whatever request list
+	 * it is on and free() it
+	 */
+	list_del(&vreq->next);
+	free(vreq);
+
+	ramdisk->inflight--;
+	start = treq.sec;
+	for (i = 0; i < treq.secs; i++) {
+		hashtable_remove(ramdisk->inprogress, &start);
+		start++;
+	}
+	free(treq.buf);
+
+	if (!ramdisk->inflight && ramdisk->prev)
+		ramdisk_flush(ramdisk);
+}
+
+static int
+create_write_request(ramdisk_t *ramdisk, td_sector_t sec, int secs, char *buf)
+{
+	td_request_t treq;
+	td_vbd_request_t *vreq;
+	td_vbd_t *vbd = ramdisk->image->private;
+
+	treq.op      = TD_OP_WRITE;
+	treq.buf     = buf;
+	treq.sec     = sec;
+	treq.secs    = secs;
+	treq.image   = ramdisk->image;
+	treq.cb      = replicated_write_callback;
+	treq.cb_data = ramdisk;
+	treq.id      = 0;
+	treq.sidx    = 0;
+
+	vreq         = calloc(1, sizeof(td_vbd_request_t));
+	treq.private = vreq;
+
+	if(!vreq)
+		return -1;
+
+	vreq->submitting = 1;
+	INIT_LIST_HEAD(&vreq->next);
+	tapdisk_vbd_move_request(treq.private, &vbd->pending_requests);
+
+	td_forward_request(treq);
+
+	vreq->submitting--;
+
+	return 0;
+}
+
+/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
+static unsigned int uint64_hash(void *k)
+{
+	uint64_t key = *(uint64_t*)k;
+
+	key = (~key) + (key << 18);
+	key = key ^ (key >> 31);
+	key = key * 21;
+	key = key ^ (key >> 11);
+	key = key + (key << 6);
+	key = key ^ (key >> 22);
+
+	return (unsigned int)key;
+}
+
+static int rd_hash_equal(void *k1, void *k2)
+{
+	uint64_t key1, key2;
+
+	key1 = *(uint64_t*)k1;
+	key2 = *(uint64_t*)k2;
+
+	return key1 == key2;
+}
+
+static int uint64_compare(const void *k1, const void *k2)
+{
+	uint64_t u1 = *(uint64_t*)k1;
+	uint64_t u2 = *(uint64_t*)k2;
+
+	/* u1 - u2 is unsigned */
+	return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
+}
+
+/*
+ * set psectors to an array of the sector numbers in the hash, returning
+ * the number of entries (or -1 on error)
+ */
+static int ramdisk_get_sectors(struct hashtable *h, uint64_t **psectors,
+			       const char *log_prefix)
+{
+	struct hashtable_itr* itr;
+	uint64_t* sectors;
+	int count;
+
+	if (!(count = hashtable_count(h)))
+		return 0;
+
+	if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
+		DPRINTF("ramdisk_get_sectors: error allocating sector map\n");
+		return -1;
+	}
+	sectors = *psectors;
+
+	itr = hashtable_iterator(h);
+	count = 0;
+	do {
+		sectors[count++] = *(uint64_t*)hashtable_iterator_key(itr);
+	} while (hashtable_iterator_advance(itr));
+	free(itr);
+
+	return count;
+}
+
+static int ramdisk_write_hash(struct hashtable *h, uint64_t sector, char *buf,
+			      size_t len, const char *log_prefix)
+{
+	char *v;
+	uint64_t *key;
+
+	if ((v = hashtable_search(h, &sector))) {
+		memcpy(v, buf, len);
+		return 0;
+	}
+
+	if (!(v = malloc(len))) {
+		DPRINTF("ramdisk_write_hash: malloc failed\n");
+		return -1;
+	}
+	memcpy(v, buf, len);
+	if (!(key = malloc(sizeof(*key)))) {
+		DPRINTF("ramdisk_write_hash: error allocating key\n");
+		free(v);
+		return -1;
+	}
+	*key = sector;
+	if (!hashtable_insert(h, key, v)) {
+		DPRINTF("ramdisk_write_hash failed on sector %" PRIu64 "\n", sector);
+		free(key);
+		free(v);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * return -1 for OOM
+ * return -2 for merge lookup failure(should not happen)
+ * return -3 for WAW race
+ * return 0 on success.
+ */
+static int merge_requests(struct ramdisk *ramdisk, uint64_t start,
+			  size_t count, char **mergedbuf)
+{
+	char* buf;
+	char* sector;
+	int i;
+	uint64_t *key;
+	int rc = 0;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	if (!(buf = valloc(count * ramdisk->sector_size))) {
+		DPRINTF("merge_request: allocation failed\n");
+		return -1;
+	}
+
+	for (i = 0; i < count; i++) {
+		if (!(sector = hashtable_search(ramdisk->prev, &start))) {
+			EPRINTF("merge_request: lookup failed on %"PRIu64"\n",
+				start);
+			free(buf);
+			rc = -2;
+			goto fail;
+		}
+
+		/* Check inprogress requests to avoid waw non-determinism */
+		if (hashtable_search(ramdisk->inprogress, &start)) {
+			DPRINTF("merge_request: WAR RACE on %"PRIu64"\n",
+				start);
+			free(buf);
+			rc = -3;
+			goto fail;
+		}
+
+		/*
+		 * Insert req into inprogress (brief period of duplication of
+		 * hash entries until they are removed from prev. Read tracking
+		 * would not be reading wrong entries)
+		 */
+		if (!(key = malloc(sizeof(*key)))) {
+			EPRINTF("%s: error allocating key\n", __FUNCTION__);
+			free(buf);
+			rc = -1;
+			goto fail;
+		}
+		*key = start;
+		if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
+			EPRINTF("%s failed to insert sector %" PRIu64 " into inprogress hash\n",
+				__FUNCTION__, start);
+			free(key);
+			free(buf);
+			rc = -1;
+			goto fail;
+		}
+
+		memcpy(buf + i * ramdisk->sector_size, sector, ramdisk->sector_size);
+		start++;
+	}
+
+	*mergedbuf = buf;
+	return 0;
+fail:
+	for (start--; i > 0; i--, start--)
+		hashtable_remove(ramdisk->inprogress, &start);
+	return rc;
+}
+
+int ramdisk_flush(ramdisk_t *ramdisk)
+{
+	uint64_t *sectors;
+	char *buf = NULL;
+	uint64_t base, batchlen;
+	int i, j, count = 0;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/* everything is in flight */
+	if (!ramdisk->prev)
+		return 0;
+
+	count = ramdisk_get_sectors(ramdisk->prev, &sectors, log_prefix);
+	if (count <= 0)
+		/* should not happen */
+		return count;
+
+	/* Create the inprogress table if empty */
+	if (!ramdisk->inprogress)
+		ramdisk->inprogress = ramdisk_new_hashtable();
+
+	/* sort and merge sectors to improve disk performance */
+	qsort(sectors, count, sizeof(*sectors), uint64_compare);
+
+	for (i = 0; i < count;) {
+		base = sectors[i++];
+		while (i < count && sectors[i] == sectors[i-1] + 1)
+			i++;
+		batchlen = sectors[i-1] - base + 1;
+
+		j = merge_requests(ramdisk, base, batchlen, &buf);
+		if (j) {
+			EPRINTF("ramdisk_flush: merge_requests failed:%s\n",
+				j == -1 ? "OOM" :
+					(j == -2 ? "missing sector" :
+						 "WAW race"));
+			if (j == -3)
+				continue;
+			free(sectors);
+			return -1;
+		}
+
+		/*
+		 * NOTE: create_write_request() creates a treq AND forwards
+		 * it down the driver chain
+		 *
+		 * TODO: handle create_write_request()'s error.
+		 */
+		create_write_request(ramdisk, base, batchlen, buf);
+
+		ramdisk->inflight++;
+
+		for (j = 0; j < batchlen; j++) {
+			buf = hashtable_search(ramdisk->prev, &base);
+			free(buf);
+			hashtable_remove(ramdisk->prev, &base);
+			base++;
+		}
+	}
+
+	if (!hashtable_count(ramdisk->prev)) {
+		/* everything is in flight */
+		hashtable_destroy(ramdisk->prev, 0);
+		ramdisk->prev = NULL;
+	}
+
+	free(sectors);
+	return 0;
+}
+
+int ramdisk_start_flush(ramdisk_t *ramdisk, struct hashtable **new)
+{
+	uint64_t *key;
+	char *buf;
+	int rc = 0;
+	int i, j, count, batchlen;
+	uint64_t *sectors;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	if (!hashtable_count(*new))
+		return 0;
+
+	if (ramdisk->prev) {
+		/*
+		 * a flush request issued while a previous flush is still in
+		 * progress will merge with the previous request. If you want
+		 * the previous request to be consistent, wait for it to
+		 * complete.
+		 */
+		count = ramdisk_get_sectors(*new, &sectors, log_prefix);
+		if (count < 0 )
+			return count;
+
+		for (i = 0; i < count; i++) {
+			buf = hashtable_search(*new, sectors + i);
+			ramdisk_write_hash(ramdisk->prev, sectors[i], buf,
+					   ramdisk->sector_size, log_prefix);
+		}
+		free(sectors);
+
+		hashtable_destroy(*new, 1);
+	} else
+		ramdisk->prev = *new;
+
+	/*
+	 * We create a new hashtable so that new writes can be performed before
+	 * the old hashtable is completely drained.
+	 */
+	*new = ramdisk_new_hashtable();
+
+	return ramdisk_flush(ramdisk);
+}
+
+void ramdisk_init(ramdisk_t *ramdisk)
+{
+	ramdisk->inflight = 0;
+	ramdisk->prev = NULL;
+	ramdisk->inprogress = NULL;
+}
+
+void ramdisk_destroy(ramdisk_t *ramdisk)
+{
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/*
+	 * ramdisk_destroy() is called only when we will close the tapdisk image.
+	 * In this case, there are no pending requests in vbd.
+	 *
+	 * If ramdisk->inflight is not 0, it means that the requests created by
+	 * us are still in vbd->pending_requests.
+	 */
+	if (ramdisk->inflight) {
+		/* should not happen */
+		EPRINTF("cannot destroy ramdisk\n");
+		return;
+	}
+
+	if (ramdisk->inprogress) {
+		hashtable_destroy(ramdisk->inprogress, 0);
+		ramdisk->inprogress = NULL;
+	}
+
+	if (ramdisk->prev) {
+		hashtable_destroy(ramdisk->prev, 1);
+		ramdisk->prev = NULL;
+	}
+}
+
+int ramdisk_writes_inflight(ramdisk_t *ramdisk)
+{
+	if (!ramdisk->inflight && !ramdisk->prev)
+		return 0;
+
+	return 1;
+}
+
+int ramdisk_read(struct ramdisk *ramdisk, uint64_t sector,
+		 int nb_sectors, char *buf)
+{
+	int i;
+	char *v;
+	uint64_t key;
+
+	for (i = 0; i < nb_sectors; i++) {
+		key = sector + i;
+		/* check whether it is queued in a previous flush request */
+		if (!(ramdisk->prev &&
+		    (v = hashtable_search(ramdisk->prev, &key)))) {
+			/* check whether it is an ongoing flush */
+			if (!(ramdisk->inprogress &&
+			    (v = hashtable_search(ramdisk->inprogress, &key))))
+				return -1;
+		}
+		memcpy(buf + i * ramdisk->sector_size, v, ramdisk->sector_size);
+	}
+
+	return 0;
+}
+
+struct hashtable *ramdisk_new_hashtable(void)
+{
+	return create_hashtable(RAMDISK_HASHSIZE, uint64_hash, rd_hash_equal);
+}
+
+int ramdisk_write_to_hashtable(struct hashtable *h, uint64_t sector,
+			       int nb_sectors, size_t sector_size, char* buf,
+			       const char *log_prefix)
+{
+	int i, rc;
+
+	for (i = 0; i < nb_sectors; i++) {
+		rc = ramdisk_write_hash(h, sector + i,
+					buf + i * sector_size,
+					sector_size, log_prefix);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+void ramdisk_destroy_hashtable(struct hashtable *h)
+{
+	if (!h)
+		return;
+
+	hashtable_destroy(h, 1);
+}
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
index 0bd6e71..fdc216e 100644
--- a/tools/blktap2/drivers/block-replication.h
+++ b/tools/blktap2/drivers/block-replication.h
@@ -110,4 +110,52 @@ int td_replication_server_restart(td_replication_connect_t *t);
  */
 int td_replication_client_start(td_replication_connect_t *t);
 
+/* I/O replication */
+typedef struct ramdisk ramdisk_t;
+struct ramdisk {
+	size_t sector_size;
+	const char *log_prefix;
+	td_image_t *image;
+
+	/* private */
+	/* count of outstanding requests to the base driver */
+	size_t inflight;
+	/* prev holds the requests to be flushed, while inprogress holds
+	 * requests being flushed. When requests complete, they are removed
+	 * from inprogress.
+	 * Whenever a new flush is merged with ongoing flush (i.e, prev),
+	 * we have to make sure that none of the new requests overlap with
+	 * ones in "inprogress". If it does, keep it back in prev and dont issue
+	 * IO until the current one finishes. If we allow this IO to proceed,
+	 * we might end up with two "overlapping" requests in the disk's queue and
+	 * the disk may not offer any guarantee on which one is written first.
+	 * IOW, make sure we dont create a write-after-write time ordering constraint.
+	 */
+	struct hashtable *prev;
+	struct hashtable *inprogress;
+};
+
+void ramdisk_init(ramdisk_t *ramdisk);
+void ramdisk_destroy(ramdisk_t *ramdisk);
+
+/* flush pending contents to disk */
+int ramdisk_flush(ramdisk_t *ramdisk);
+/* flush new contents to disk */
+int ramdisk_start_flush(ramdisk_t *ramdisk, struct hashtable **new);
+int ramdisk_writes_inflight(ramdisk_t *ramdisk);
+
+/*
+ * try to read from ramdisk. Return -1 if some sectors are not in
+ * ramdisk. Otherwise, return 0.
+ */
+int ramdisk_read(struct ramdisk *ramdisk, uint64_t sector,
+		 int nb_sectors, char *buf);
+
+/* create a new hashtable that can be used by ramdisk */
+struct hashtable *ramdisk_new_hashtable(void);
+int ramdisk_write_to_hashtable(struct hashtable *h, uint64_t sector,
+			       int nb_sectors, size_t sector_size, char* buf,
+			       const char *log_prefix);
+void ramdisk_destroy_hashtable(struct hashtable *h);
+
 #endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 39/45] block-colo: implement colo disk replication
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (37 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 38/45] blktap2: move ramdisk " Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 40/45] pass correct file to qemu if we use blktap2 Wen Congyang
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

 TODO:
 update block-remus to use async io to instead
 of mread/mwrite.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/Makefile            |    3 +
 tools/blktap2/drivers/block-colo.c        | 1151 +++++++++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.c |  196 +++++
 tools/blktap2/drivers/block-replication.h |   56 ++
 tools/blktap2/drivers/tapdisk-disktype.c  |    9 +
 tools/blktap2/drivers/tapdisk-disktype.h  |    3 +-
 6 files changed, 1417 insertions(+), 1 deletion(-)
 create mode 100644 tools/blktap2/drivers/block-colo.c

diff --git a/tools/blktap2/drivers/Makefile b/tools/blktap2/drivers/Makefile
index c7a2ca4..0f91ccf 100644
--- a/tools/blktap2/drivers/Makefile
+++ b/tools/blktap2/drivers/Makefile
@@ -28,6 +28,8 @@ REMUS-OBJS  += hashtable.o
 REMUS-OBJS  += hashtable_itr.o
 REMUS-OBJS  += hashtable_utility.o
 
+COLO-OBJS += block-colo.o
+
 tapdisk2 tapdisk-stream tapdisk-diff $(QCOW_UTIL): AIOLIBS := -laio
 
 MEMSHRLIBS :=
@@ -74,6 +76,7 @@ BLK-OBJS-y  += aes.o
 BLK-OBJS-y  += md5.o
 BLK-OBJS-y  += $(PORTABLE-OBJS-y)
 BLK-OBJS-y  += $(REMUS-OBJS)
+BLK-OBJS-y  += $(COLO-OBJS)
 
 all: $(IBIN) lock-util qcow-util
 
diff --git a/tools/blktap2/drivers/block-colo.c b/tools/blktap2/drivers/block-colo.c
new file mode 100644
index 0000000..565a386
--- /dev/null
+++ b/tools/blktap2/drivers/block-colo.c
@@ -0,0 +1,1151 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "tapdisk.h"
+#include "tapdisk-server.h"
+#include "tapdisk-driver.h"
+#include "tapdisk-interface.h"
+#include "block-replication.h"
+
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+/* connect retry timeout (seconds) */
+#define COLO_CONNRETRY_TIMEOUT  1
+
+/* timeout for reads and writes in second */
+#define HEARTBEAT_S 1
+
+/* TAPDISK_DATA_REQUESTS I/O requests + commit flag */
+#define MAX_COLO_REQUEST        TAPDISK_DATA_REQUESTS + 1
+
+#undef DPRINTF
+#undef EPRINTF
+#define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "COLO: " _f, ## _a)
+#define EPRINTF(_f, _a...) syslog (LOG_ERR, "COLO: " _f, ## _a)
+
+#define TDCOLO_WRITE "wreq"
+#define TDCOLO_COMMIT "creq"
+#define TDCOLO_DONE "done"
+#define TDCOLO_FAIL "fail"
+
+enum tdcolo_mode {
+	mode_invalid = 0,
+	mode_unprotected,
+	mode_primary,
+	mode_backup,
+
+	/*
+	 * If we find some internal error in backup mode, we cannot
+	 * switch to unprotected mode.
+	 */
+	mode_failed,
+};
+
+enum {
+	colo_io,
+	colo_commit,
+};
+
+typedef struct queued_io {
+	int type;
+	union {
+		td_request_t treq;
+		char *buff; /* TDCOLO_COMMIT */
+	};
+} queued_io_t;
+
+struct queued_io_ring {
+	/* waste one slot to distinguish between empty and full */
+	queued_io_t qio[MAX_COLO_REQUEST + 1];
+	unsigned int prod;
+	unsigned int cons;
+};
+
+typedef struct colo_control {
+	/*
+	 * socket file, the user writes "flush" to this socket, and then
+	 * we write the result to it.
+	 */
+	char *path;
+	int listen_fd;
+	event_id_t listen_id;
+
+	int io_fd;
+	event_id_t io_id;
+} colo_control_t;
+
+struct tdcolo_state {
+	colo_control_t ctl;
+
+	/* async connection */
+	td_replication_connect_t t;
+	/* replication channel */
+	td_async_io_t rio, wio;
+
+	/*
+	 * queue I/O requests, and they will be forwarded to backup
+	 * asynchronously.
+	 */
+	struct queued_io_ring qio_ring;
+
+	/* ramdisk data */
+	struct ramdisk ramdisk;
+	/*
+	 * The primary write request is queued in this
+	 * hashtable, and will be flushed to ramdisk when
+	 * the checkpoint finishes.
+	 */
+	struct hashtable *h;
+	/*
+	 * The secondary vm write request is queued in this
+	 * hashtable, and will be dropped when the checkpoint
+	 * finishes or flushed to ramdisk after failover.
+	 */
+	struct hashtable *local;
+
+	/* mode methods */
+	enum tdcolo_mode mode;
+	/* It will be called when switching mode */
+	int (*queue_flush)(struct tdcolo_state *c);
+
+	char request[5];
+	char header[sizeof(uint32_t) + sizeof(uint64_t)];
+	int commit;
+	void *buff;
+	int bsize;
+	int sector_size;
+};
+
+struct tap_disk tapdisk_colo;
+
+static void colo_control_respond(colo_control_t *ctl, const char *response);
+static int switch_mode(struct tdcolo_state *c, enum tdcolo_mode mode);
+
+/* ======== common functions ======== */
+static int check_read_result(td_async_io_t *rio, int realsize,
+			     const char *target)
+{
+	if (realsize < 0) {
+		/* internal error */
+		EPRINTF("error reading from %s\n", target);
+		return ERROR_INTERNAL;
+	} else if (realsize < rio->size) {
+		/* timeout or I/O error */
+		EPRINTF("error reading from %s\n", target);
+		return ERROR_IO;
+	}
+
+	return 0;
+}
+
+static int check_write_result(td_async_io_t *wio, int realsize,
+			      const char * target)
+{
+	if (realsize < 0) {
+		/* internal error */
+		EPRINTF("error writing to %s\n", target);
+		return ERROR_INTERNAL;
+	} else if (realsize == 0) {
+		/* timeout or I/O error */
+		EPRINTF("error writing to %s\n", target);
+		return ERROR_IO;
+	}
+
+	return 0;
+}
+
+/* ======= ring functions ======== */
+static inline unsigned int ring_next(unsigned int pos)
+{
+	if (++pos > MAX_COLO_REQUEST)
+		return 0;
+
+	return pos;
+}
+
+static inline int ring_isempty(struct queued_io_ring* ring)
+{
+	return ring->cons == ring->prod;
+}
+
+static inline int ring_isfull(struct queued_io_ring* ring)
+{
+	return ring_next(ring->prod) == ring->cons;
+}
+
+static void ring_add_request(struct queued_io_ring *ring,
+			     const td_request_t *treq)
+{
+	/* If ring is full, it means that tapdisk2 has some bug */
+	if (ring_isfull(ring)) {
+		EPRINTF("OOPS, ring is full\n");
+		exit(1);
+	}
+
+	ring->qio[ring->prod].type = colo_io;
+	ring->qio[ring->prod].treq = *treq;
+	ring->prod = ring_next(ring->prod);
+}
+
+static void ring_add_commit_flag(struct queued_io_ring *ring)
+{
+	/* If ring is full, it means that tapdisk2 has some bug */
+	if (ring_isfull(ring)) {
+		EPRINTF("OOPS, ring is full\n");
+		exit(1);
+	}
+
+	ring->qio[ring->prod].type = colo_commit;
+	ring->qio[ring->prod].buff = TDCOLO_COMMIT;
+	ring->prod = ring_next(ring->prod);
+}
+
+/* return the first queued I/O request */
+static queued_io_t *ring_peek(struct queued_io_ring *ring)
+{
+	queued_io_t *qio;
+
+	if (ring_isempty(ring))
+		return NULL;
+
+	qio = &ring->qio[ring->cons];
+	return qio;
+}
+
+/* consume the first queued I/O request, and return it */
+static queued_io_t *ring_get(struct queued_io_ring *ring)
+{
+	queued_io_t *qio;
+
+	if (ring_isempty(ring))
+		return NULL;
+
+	qio = &ring->qio[ring->cons];
+	ring->cons = ring_next(ring->cons);
+	return qio;
+}
+
+/* ======== primary read/write functions ======== */
+static void primary_write_header(td_async_io_t *wio, int realsize, int errnoval);
+static void primary_write_data(td_async_io_t *wio, int realsize, int errnoval);
+static void primary_forward_done(td_async_io_t *wio, int realsize, int errnoval);
+static void primary_read_done(td_async_io_t *rio, int realsize, int errnoval);
+
+/*
+ * It is called when we cannot connect to backup, or find I/O error when
+ * reading/writing.
+ */
+static void primary_failed(struct tdcolo_state *c, int rc)
+{
+	td_replication_connect_kill(&c->t);
+	td_async_io_kill(&c->rio);
+	td_async_io_kill(&c->wio);
+	if (rc == ERROR_INTERNAL)
+		EPRINTF("switch to unprotected mode due to internal error");
+	if (rc == ERROR_CLOSE)
+		DPRINTF("switch to unprotected mode before closing");
+	switch_mode(c, mode_unprotected);
+}
+
+static void primary_waio(struct tdcolo_state *c, void *buff, size_t size,
+			 taio_callback *callback)
+{
+	td_async_io_t *wio = &c->wio;
+
+	wio->fd = c->t.fd;
+	wio->timeout_s = HEARTBEAT_S;
+	wio->mode = td_async_write;
+	wio->buff = buff;
+	wio->size = size;
+	wio->callback = callback;
+
+	if (td_async_io_start(wio))
+		primary_failed(c, ERROR_INTERNAL);
+}
+
+static void primary_raio(struct tdcolo_state *c)
+{
+	td_async_io_t *rio = &c->rio;
+
+	if (c->t.fd < 0)
+		return;
+
+	rio->fd = c->t.fd;
+	rio->timeout_s = 0;
+	rio->mode = td_async_read;
+	rio->buff = c->request;
+	rio->size = sizeof(c->request) - 1;
+	rio->callback = primary_read_done;
+
+	if (td_async_io_start(rio))
+		primary_failed(c, ERROR_INTERNAL);
+}
+
+static void primary_handle_queued_io(struct tdcolo_state *c)
+{
+	struct queued_io_ring *qring = &c->qio_ring;
+	unsigned int cons;
+	queued_io_t *qio;
+	int rc;
+
+	while (!ring_isempty(qring)) {
+		qio = ring_peek(qring);
+		if (qio->type == colo_commit) {
+			primary_waio(c, qio->buff, strlen(qio->buff),
+				     primary_forward_done);
+			return;
+		}
+
+		if (qio->treq.op == TD_OP_WRITE) {
+			primary_waio(c, TDCOLO_WRITE, strlen(TDCOLO_WRITE),
+				     primary_write_header);
+			return;
+		}
+
+		td_forward_request(qio->treq);
+		ring_get(qring);
+	}
+}
+
+/* wait for "done" message to commit checkpoint */
+static void primary_read_done(td_async_io_t *rio, int realsize, int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(rio, *c, rio);
+	char *req = c->request;
+	int rc;
+
+	rc = check_read_result(rio, realsize, "backup");
+	if (rc)
+		goto err;
+
+	rc = ERROR_INTERNAL;
+	req[4] = '\0';
+
+	if (c->commit != 1) {
+		EPRINTF("received unexpected message: %s\n", req);
+		goto err;
+	}
+
+	c->commit--;
+
+	if (strcmp(req, TDCOLO_DONE)) {
+		EPRINTF("received unknown message: %s\n", req);
+		goto err;
+	}
+
+	/* checkpoint committed, inform msg_fd */
+	colo_control_respond(&c->ctl, TDCOLO_DONE);
+	primary_raio(c);
+
+	return;
+err:
+	colo_control_respond(&c->ctl, TDCOLO_FAIL);
+	primary_failed(c, rc);
+}
+
+static void primary_write_header(td_async_io_t *wio, int realsize, int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(wio, *c, wio);
+	queued_io_t *qio = ring_peek(&c->qio_ring);
+	uint32_t *sectors = (uint32_t *)c->header;
+	uint64_t *sector = (uint64_t *)(c->header + sizeof(uint32_t));
+	int rc;
+
+	rc = check_write_result(wio, realsize, "backup");
+	if (rc) {
+		primary_failed(c, rc);
+		return;
+	}
+
+	*sectors = qio->treq.secs;
+	*sector = qio->treq.sec;
+
+	primary_waio(c, c->header, sizeof(c->header), primary_write_data);
+}
+
+static void primary_write_data(td_async_io_t *wio, int realsize, int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(wio, *c, wio);
+	queued_io_t *qio = ring_peek(&c->qio_ring);
+	int rc;
+
+	rc = check_write_result(wio, realsize, "backup");
+	if (rc) {
+		primary_failed(c, rc);
+		return;
+	}
+
+	primary_waio(c, qio->treq.buf, qio->treq.secs * c->sector_size,
+		     primary_forward_done);
+}
+
+static void primary_forward_done(td_async_io_t *wio, int realsize, int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(wio, *c, wio);
+	queued_io_t *qio;
+	struct td_request_t *treq;
+	int rc;
+
+	rc = check_write_result(wio, realsize, "backup");
+	if (rc) {
+		primary_failed(c, rc);
+		return;
+	}
+
+	qio = ring_get(&c->qio_ring);
+	if (qio->type == colo_io)
+		td_forward_request(qio->treq);
+	else
+		c->commit--;
+
+	primary_handle_queued_io(c);
+}
+
+static void primary_queue_read(td_driver_t *driver, td_request_t treq)
+{
+	struct tdcolo_state *c = driver->data;
+	struct queued_io_ring *ring = &c->qio_ring;
+
+	if (ring_isempty(ring)) {
+		/* just pass read through */
+		td_forward_request(treq);
+		return;
+	}
+
+	ring_add_request(ring, &treq);
+	if (td_replication_connect_status(&c->t) != 1)
+		return;
+
+	if (!td_async_io_is_running(&c->wio))
+		primary_handle_queued_io(c);
+}
+
+static void primary_queue_write(td_driver_t *driver, td_request_t treq)
+{
+	struct tdcolo_state *c = driver->data;
+	struct queued_io_ring *ring = &c->qio_ring;
+
+	ring_add_request(ring, &treq);
+	if (td_replication_connect_status(&c->t) != 1)
+		return;
+
+	if (!td_async_io_is_running(&c->wio))
+		primary_handle_queued_io(c);
+}
+
+/* It is called when the user write "flush" to control file. */
+static int client_flush(struct tdcolo_state *c)
+{
+	if (td_replication_connect_status(&c->t) != 1)
+		return 0;
+
+	if (c->commit > 0) {
+		EPRINTF("the last commit is not finished\n");
+		colo_control_respond(&c->ctl, TDCOLO_FAIL);
+		primary_failed(c, ERROR_INTERNAL);
+		return -1;
+	}
+
+	ring_add_commit_flag(&c->qio_ring);
+	c->commit = 2;
+	if (!td_async_io_is_running(&c->wio))
+		primary_handle_queued_io(c);
+
+	return 0;
+}
+
+/* It is called when switching the mode from primary to unprotected */
+static int primary_flush(struct tdcolo_state *c)
+{
+	struct queued_io_ring *qring = &c->qio_ring;
+	queued_io_t *qio;
+
+	if (ring_isempty(qring))
+		return 0;
+
+	while (!ring_isempty(qring)) {
+		qio = ring_get(qring);
+
+		if (qio->type == colo_commit) {
+			colo_control_respond(&c->ctl, TDCOLO_FAIL);
+			c->commit = 0;
+			continue;
+		}
+
+		td_forward_request(qio->treq);
+	}
+
+	return 0;
+}
+
+static void colo_client_established(td_replication_connect_t *t, int rc)
+{
+	struct tdcolo_state *c = CONTAINER_OF(t, *c, t);
+
+	if (rc) {
+		primary_failed(c, rc);
+		return;
+	}
+
+	/* the connect succeeded and handle the queued requests */
+	primary_handle_queued_io(c);
+
+	primary_raio(c);
+}
+
+static int primary_start(struct tdcolo_state *c)
+{
+	DPRINTF("activating client mode\n");
+
+	tapdisk_colo.td_queue_read = primary_queue_read;
+	tapdisk_colo.td_queue_write = primary_queue_write;
+	c->queue_flush = primary_flush;
+
+	c->t.callback = colo_client_established;
+	return td_replication_client_start(&c->t);
+}
+
+/* ======== backup read/write functions ======== */
+static void backup_read_header_done(td_async_io_t *rio, int realsize,
+				    int errnoval);
+static void backup_read_data_done(td_async_io_t *rio, int realsize,
+				  int errnoval);
+static void backup_write_done(td_async_io_t *wio, int realsize, int errnoval);
+
+static void backup_failed(struct tdcolo_state *c, int rc)
+{
+	td_replication_connect_kill(&c->t);
+	td_async_io_kill(&c->rio);
+	td_async_io_kill(&c->wio);
+
+	if (rc == ERROR_INTERNAL) {
+		EPRINTF("switch to failed mode due to internal error");
+		switch_mode(c, mode_failed);
+		return;
+	}
+
+	if (rc == ERROR_CLOSE)
+		DPRINTF("switch to unprotected mode before closing");
+
+	switch_mode(c, mode_unprotected);
+}
+
+static void backup_raio(struct tdcolo_state *c, void *buff, int size,
+			int timeout_s, taio_callback *callback)
+{
+	td_async_io_t *rio = &c->rio;
+
+	rio->fd = c->t.fd;
+	rio->timeout_s = timeout_s;
+	rio->mode = td_async_read;
+	rio->buff = buff;
+	rio->size = size;
+	rio->callback = callback;
+
+	if (td_async_io_start(rio)) {
+		EPRINTF("cannot start read aio\n");
+		backup_failed(c, ERROR_INTERNAL);
+	}
+}
+
+static void backup_waio(struct tdcolo_state *c)
+{
+	td_async_io_t *wio = &c->wio;
+
+	wio->fd = c->t.fd;
+	wio->timeout_s = HEARTBEAT_S;
+	wio->mode = td_async_write;
+	wio->buff = TDCOLO_DONE;
+	wio->size = strlen(TDCOLO_DONE);
+	wio->callback = backup_write_done;
+
+	if (td_async_io_start(wio)) {
+		EPRINTF("cannot start write aio\n");
+		backup_failed(c, ERROR_INTERNAL);
+	}
+}
+
+static void backup_read_req_done(td_async_io_t *rio, int realsize,
+				 int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(rio, *c, rio);
+	char *req = c->request;
+	int rc;
+
+	rc = check_read_result(rio, realsize, "primary");
+	if (rc)
+		goto err;
+
+	rc = ERROR_INTERNAL;
+	req[4] = '\0';
+
+	if (!strcmp(req, TDCOLO_WRITE)) {
+		backup_raio(c, c->header, sizeof(c->header), HEARTBEAT_S,
+			    backup_read_header_done);
+		return;
+	} else if (!strcmp(req, TDCOLO_COMMIT)) {
+		ramdisk_destroy_hashtable(c->local);
+		c->local = ramdisk_new_hashtable();
+		if (!c->local) {
+			EPRINTF("error creating local hashtable\n");
+			goto err;
+		}
+		rc = ramdisk_start_flush(&c->ramdisk, &c->h);
+		if (rc) {
+			EPRINTF("error flushing queued I/O\n");
+			goto err;
+		}
+
+		backup_waio(c);
+	} else {
+		EPRINTF("unsupported request: %s\n", req);
+		goto err;
+	}
+
+	return;
+
+err:
+	backup_failed(c, ERROR_INTERNAL);
+	return;
+}
+
+static void backup_read_header_done(td_async_io_t *rio, int realsize,
+				    int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(rio, *c, rio);
+	uint32_t *sectors = (uint32_t *)c->header;
+	int rc;
+
+	rc = check_read_result(rio, realsize, "primary");
+	if (rc)
+		goto err;
+
+	rc = ERROR_INTERNAL;
+	if (*sectors * c->sector_size > c->bsize) {
+		EPRINTF("write request is too large: %d/%d\n",
+			*sectors * c->sector_size, c->bsize);
+		goto err;
+	}
+
+	backup_raio(c, c->buff, *sectors * c->sector_size, HEARTBEAT_S,
+		    backup_read_data_done);
+
+	return;
+err:
+	backup_failed(c, rc);
+}
+
+static void backup_read_data_done(td_async_io_t *rio, int realsize,
+				  int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(rio, *c, rio);
+	uint32_t *sectors = (uint32_t *)c->header;
+	uint64_t *sector = (uint64_t *)(c->header + sizeof(uint32_t));
+	int rc;
+
+	rc = check_read_result(rio, realsize, "primary");
+	if (rc)
+		goto err;
+
+	rc = ramdisk_write_to_hashtable(c->h, *sector, *sectors,
+					c->sector_size, c->buff, "COLO");
+	if (rc) {
+		EPRINTF("cannot write primary data to hashtable\n");
+		rc = ERROR_INTERNAL;
+		goto err;
+	}
+
+	backup_raio(c, c->request, sizeof(c->request) - 1, 0,
+		    backup_read_req_done);
+
+	return;
+err:
+	backup_failed(c, rc);
+}
+
+static void backup_write_done(td_async_io_t *wio, int realsize, int errnoval)
+{
+	struct tdcolo_state *c = CONTAINER_OF(wio, *c, wio);
+	int rc;
+
+	rc = check_write_result(wio, realsize, "primary");
+	if (rc) {
+		backup_failed(c, rc);
+		return;
+	}
+
+	backup_raio(c, c->request, sizeof(c->request) - 1, 0,
+		    backup_read_req_done);
+}
+
+static void colo_server_established(td_replication_connect_t *t, int rc)
+{
+	struct tdcolo_state *c = CONTAINER_OF(t, *c, t);
+
+	if (rc) {
+		backup_failed(c, rc);
+		return;
+	}
+
+	backup_raio(c, c->request, sizeof(c->request) - 1, 0,
+		    backup_read_req_done);
+}
+
+/* It is called when switching the mode from backup to unprotected */
+static int backup_flush(struct tdcolo_state *c)
+{
+	int rc;
+
+	rc = ramdisk_start_flush(&c->ramdisk, &c->local);
+	if (rc)
+		EPRINTF("error flushing local queued I/O\n");
+
+	return 0;
+}
+
+static void backup_queue_read(td_driver_t *driver, td_request_t treq)
+{
+	struct tdcolo_state *c = driver->data;
+
+	if (ramdisk_read_from_hashtable(c->local, treq.sec, treq.secs,
+					c->sector_size, treq.buf))
+		/* FIXME */
+		td_forward_request(treq);
+	else
+		/* complete the request */
+		td_complete_request(treq, 0);
+}
+
+static void backup_queue_write(td_driver_t *driver, td_request_t treq)
+{
+	struct tdcolo_state *c = driver->data;
+	int rc;
+
+	rc = ramdisk_write_to_hashtable(c->local, treq.sec, treq.secs,
+					c->sector_size, treq.buf,
+					"COLO");
+	if (rc)
+		td_complete_request(treq, -EBUSY);
+	else
+		td_complete_request(treq, 0);
+}
+
+static int backup_start(struct tdcolo_state *c)
+{
+	tapdisk_colo.td_queue_read = backup_queue_read;
+	tapdisk_colo.td_queue_write = backup_queue_write;
+	c->queue_flush = backup_flush;
+
+	c->h = ramdisk_new_hashtable();
+	c->local = ramdisk_new_hashtable();
+	if (!c->h || !c->local)
+		return -1;
+
+	c->bsize = sysconf(_SC_PAGESIZE);
+	c->buff = malloc(c->bsize);
+	if (!c->buff)
+		return -1;
+
+	return 0;
+}
+
+/* ======== unprotected read/write functions ======== */
+void unprotected_queue_io(td_driver_t *driver, td_request_t treq)
+{
+	struct tdcolo_state *c = driver->data;
+
+	/* wait for previous ramdisk to flush  before servicing I/O */
+	if (ramdisk_writes_inflight(&c->ramdisk)) {
+		ramdisk_flush(&c->ramdisk);
+		td_complete_request(treq, -EBUSY);
+	} else {
+		/* here we just pass I/O through */
+		td_forward_request(treq);
+	}
+}
+
+static int unprotected_start(struct tdcolo_state *c)
+{
+	DPRINTF("failure detected, activating passthrough\n");
+
+	/* install the unprotected read/write handlers */
+	tapdisk_colo.td_queue_read = unprotected_queue_io;
+	tapdisk_colo.td_queue_write = unprotected_queue_io;
+	c->queue_flush = NULL;
+
+	return 0;
+}
+
+/* ======== failed read/write functions ======== */
+static void failed_queue_io(td_driver_t *driver, td_request_t treq)
+{
+	td_complete_request(treq, -EIO);
+}
+
+static int failed_start(struct tdcolo_state *c)
+{
+	tapdisk_colo.td_queue_read = failed_queue_io;
+	tapdisk_colo.td_queue_write = failed_queue_io;
+	c->queue_flush = NULL;
+
+	return 0;
+}
+
+/* ======== control ======== */
+static void colo_control_accept(event_id_t id, char mode, void *private);
+static void colo_control_handle_request(event_id_t id, char mode,
+					void *private);
+static void colo_control_close(colo_control_t *ctl);
+
+static void colo_control_init(colo_control_t *ctl)
+{
+	ctl->listen_fd = -1;
+	ctl->listen_id = -1;
+	ctl->io_fd = -1;
+	ctl->io_id = -1;
+}
+
+static int colo_create_control_socket(colo_control_t *ctl, const char *name)
+{
+	int i, l;
+	struct sockaddr_un saddr;
+	event_id_t id;
+	int rc;
+
+	/* first we must ensure that BLKTAP_CTRL_DIR exists */
+	if (mkdir(BLKTAP_CTRL_DIR, 0755) && errno != EEXIST) {
+		rc = -errno;
+		EPRINTF("error creating directory %s: %d\n",
+			BLKTAP_CTRL_DIR, errno);
+		goto fail;
+	}
+
+	/* use the device name to create the control socket path */
+	if (asprintf(&ctl->path, BLKTAP_CTRL_DIR "/colo_%s", name) < 0) {
+		rc = -errno;
+		goto fail;
+	}
+
+	/* scrub socket pathname  */
+	l = strlen(ctl->path);
+	for (i = strlen(BLKTAP_CTRL_DIR) + 1; i < l; i++) {
+		if (strchr(":/", ctl->path[i]))
+			ctl->path[i] = '_';
+	}
+
+	if (unlink(ctl->path) && errno != ENOENT) {
+		rc = -errno;
+		EPRINTF("failed to unlink %s: %d\n", ctl->path, errno);
+		goto fail;
+	}
+
+	ctl->listen_fd = socket(AF_UNIX, SOCK_STREAM, 0);
+	if (ctl->listen_fd == -1) {
+		rc = -errno;
+		EPRINTF("failed to create control socket: %d\n", errno);
+		goto fail;
+	}
+
+	memset(&saddr, 0, sizeof(saddr));
+	strncpy(saddr.sun_path, ctl->path, sizeof(saddr.sun_path));
+	saddr.sun_family = AF_UNIX;
+
+	rc = bind(ctl->listen_fd, (const struct sockaddr *)&saddr,
+		  sizeof(saddr));
+	if (rc == -1) {
+		rc = -errno;
+		EPRINTF("failed to bind to %s: %d\n", saddr.sun_path, errno);
+		goto fail;
+	}
+
+	rc = listen(ctl->listen_fd, 10);
+	if (rc == -1) {
+		rc = -errno;
+		EPRINTF("failed to listen: %d\n", errno);
+		goto fail;
+	}
+
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+					   ctl->listen_fd, 0,
+					   colo_control_accept, ctl);
+	if (id < 0) {
+		EPRINTF("failed to add watch: %d\n", id);
+		rc = id;
+		goto fail;
+	}
+
+	ctl->listen_id = id;
+	return 0;
+
+fail:
+	colo_control_close(ctl);
+	return rc;
+}
+
+static void colo_control_accept(event_id_t id, char mode, void *private)
+{
+	colo_control_t *ctl = private;
+	int fd;
+
+	fd = accept(ctl->listen_fd, NULL, NULL);
+	if (fd == -1) {
+		EPRINTF("failed to accept new control connection: %d\n", errno);
+		return;
+	}
+
+	if (ctl->io_fd >= 0) {
+		EPRINTF("cannot accept two control connections\n");
+		close(fd);
+		return;
+	}
+
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+					   fd, 0,
+					   colo_control_handle_request,
+					   ctl);
+	if (id < 0) {
+		close(fd);
+		EPRINTF("failed to register new control event: %d\n", id);
+		return;
+	}
+
+	ctl->io_fd = fd;
+	ctl->io_id = id;
+}
+
+static void colo_control_handle_request(event_id_t id, char mode, void *private)
+{
+	colo_control_t *ctl = private;
+	struct tdcolo_state *c = CONTAINER_OF(ctl, *c, ctl);
+	char req[6];
+	int rc;
+
+	rc = read(ctl->io_fd, req, sizeof(req) - 1);
+	if (!rc) {
+		EPRINTF("0-byte read received, close control socket\n");
+		goto err;
+	}
+
+	if (rc < 0) {
+		EPRINTF("error reading from control socket: %d\n", errno);
+		goto err;
+	}
+
+	req[rc] = '\0';
+	if (strncmp(req, "flush", 5)) {
+		EPRINTF("unknown command: %s\n", req);
+		colo_control_respond(ctl, TDCOLO_FAIL);
+		return;
+	}
+
+	if (c->mode != mode_primary) {
+		EPRINTF("invalid mode: %d\n", c->mode);
+		colo_control_respond(ctl, TDCOLO_FAIL);
+		return;
+	}
+
+	client_flush(c);
+	return;
+
+err:
+	UNREGISTER_EVENT(ctl->io_id);
+	CLOSE_FD(ctl->io_fd);
+	return;
+}
+
+static void colo_control_respond(colo_control_t *ctl, const char *response)
+{
+	int rc;
+
+	if (ctl->io_fd < 0)
+		return;
+
+	rc = write(ctl->io_fd, response, strlen(response));
+	if (rc < 0) {
+		EPRINTF("error writing notification: %d\n", errno);
+		CLOSE_FD(ctl->io_fd);
+	}
+}
+
+static void colo_control_close(colo_control_t *ctl)
+{
+	UNREGISTER_EVENT(ctl->listen_id);
+	UNREGISTER_EVENT(ctl->io_id);
+	CLOSE_FD(ctl->listen_fd);
+	CLOSE_FD(ctl->io_fd);
+
+	if (ctl->path) {
+		unlink(ctl->path);
+		free(ctl->path);
+		ctl->path = NULL;
+	}
+}
+
+/* ======== interface ======== */
+static int tdcolo_close(td_driver_t *driver);
+
+static int switch_mode(struct tdcolo_state *c, enum tdcolo_mode mode)
+{
+	int rc;
+
+	if (mode == c->mode)
+		return 0;
+
+	if (c->queue_flush)
+		if ((rc = c->queue_flush(c)) < 0) {
+			/* fall back to unprotected mode on error */
+			EPRINTF("switch_mode: error flushing queue (old: %d, new: %d)",
+				c->mode, mode);
+			mode = mode_unprotected;
+		}
+
+	if (mode == mode_unprotected)
+		rc = unprotected_start(c);
+	else if (mode == mode_primary)
+		rc = primary_start(c);
+	else if (mode == mode_backup)
+		rc = backup_start(c);
+	else if (mode == mode_failed)
+		rc = failed_start(c);
+	else {
+		EPRINTF("unknown mode requested: %d\n", mode);
+		rc = -1;
+	}
+
+	if (!rc)
+		c->mode = mode;
+
+	return rc;
+}
+
+static int tdcolo_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
+{
+	struct tdcolo_state *c = driver->data;
+	td_replication_connect_t *t = &c->t;
+	colo_control_t *ctl = &c->ctl;
+	ramdisk_t *ramdisk = &c->ramdisk;
+	int rc;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
+
+	DPRINTF("opening %s\n", name);
+
+	memset(c, 0, sizeof(*c));
+
+	/* init ramdisk */
+	ramdisk->log_prefix = "COLO";
+	ramdisk->sector_size = driver->info.sector_size;
+	ramdisk->image = image;
+	ramdisk_init(&c->ramdisk);
+
+	/* init async I/O */
+	td_async_io_init(&c->rio);
+	td_async_io_init(&c->wio);
+
+	c->sector_size = driver->info.sector_size;
+
+	/* init control socket */
+	colo_control_init(ctl);
+	rc = colo_create_control_socket(ctl, name);
+	if (rc)
+		return rc;
+
+	/* init async connection */
+	t->log_prefix = "COLO";
+	t->retry_timeout_s = COLO_CONNRETRY_TIMEOUT;
+	t->max_connections = 1;
+	t->callback = colo_server_established;
+	rc = td_replication_connect_init(t, name);
+	if (rc) {
+		colo_control_close(ctl);
+		return rc;
+	}
+
+	rc = td_replication_server_start(t);
+	if (!rc)
+		rc = switch_mode(c, mode_backup);
+	else if (rc == -2)
+		rc = switch_mode(c, mode_primary);
+
+	if (!rc)
+		return 0;
+
+	tdcolo_close(driver);
+	return -EIO;
+}
+
+static int tdcolo_pre_close(td_driver_t *driver)
+{
+	struct tdcolo_state *c = driver->data;
+
+	if (c->mode != mode_primary)
+		return 0;
+
+	if (td_replication_connect_status(&c->t))
+		return 0;
+
+	/*
+	 * The connection is in progress, and we may queue some
+	 * I/O requests.
+	 */
+	primary_failed(c, ERROR_CLOSE);
+	return 0;
+}
+
+static int tdcolo_close(td_driver_t *driver)
+{
+	struct tdcolo_state *c = driver->data;
+
+	DPRINTF("closing\n");
+	ramdisk_destroy(&c->ramdisk);
+	ramdisk_destroy_hashtable(c->h);
+	ramdisk_destroy_hashtable(c->local);
+	td_replication_connect_kill(&c->t);
+	td_async_io_kill(&c->rio);
+	td_async_io_kill(&c->wio);
+	colo_control_close(&c->ctl);
+	free(c->buff);
+
+	return 0;
+}
+
+static int tdcolo_get_parent_id(td_driver_t *driver, td_disk_id_t *id)
+{
+	/* we shouldn't have a parent... for now */
+	return -EINVAL;
+}
+
+static int tdcolo_validate_parent(td_driver_t *driver,
+				  td_driver_t *pdriver, td_flag_t flags)
+{
+	return 0;
+}
+
+struct tap_disk tapdisk_colo = {
+	.disk_type          = "tapdisk_colo",
+	.private_data_size  = sizeof(struct tdcolo_state),
+	.td_open            = tdcolo_open,
+	.td_queue_read      = unprotected_queue_io,
+	.td_queue_write     = unprotected_queue_io,
+	.td_pre_close       = tdcolo_pre_close,
+	.td_close           = tdcolo_close,
+	.td_get_parent_id   = tdcolo_get_parent_id,
+	.td_validate_parent = tdcolo_validate_parent,
+};
diff --git a/tools/blktap2/drivers/block-replication.c b/tools/blktap2/drivers/block-replication.c
index 30eba8f..0992d19 100644
--- a/tools/blktap2/drivers/block-replication.c
+++ b/tools/blktap2/drivers/block-replication.c
@@ -911,6 +911,25 @@ int ramdisk_write_to_hashtable(struct hashtable *h, uint64_t sector,
 	return 0;
 }
 
+int ramdisk_read_from_hashtable(struct hashtable *h, uint64_t sector,
+				int nb_sectors, int sector_size,
+				char *buf)
+{
+	int i;
+	uint64_t key;
+	char *v;
+
+	for (i = 0; i < nb_sectors; i++) {
+		key = sector + i;
+		v = hashtable_search(h, &key);
+		if (!v)
+			return -1;
+		memcpy(buf + i * sector_size, v, sector_size);
+	}
+
+	return 0;
+}
+
 void ramdisk_destroy_hashtable(struct hashtable *h)
 {
 	if (!h)
@@ -918,3 +937,180 @@ void ramdisk_destroy_hashtable(struct hashtable *h)
 
 	hashtable_destroy(h, 1);
 }
+
+/* async I/O */
+static void td_async_io_readable(event_id_t id, char mode, void *private);
+static void td_async_io_writeable(event_id_t id, char mode, void *private);
+static void td_async_io_timeout(event_id_t id, char mode, void *private);
+
+void td_async_io_init(td_async_io_t *taio)
+{
+	memset(taio, 0, sizeof(*taio));
+	taio->fd = -1;
+	taio->timeout_id = -1;
+	taio->io_id = -1;
+}
+
+int td_async_io_start(td_async_io_t *taio)
+{
+	event_id_t id;
+
+	if (taio->running)
+		return -1;
+
+	if (taio->size <= 0 || taio->fd < 0)
+		return -1;
+
+	taio->running = 1;
+
+	if (taio->mode == td_async_read)
+		id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+						   taio->fd, 0,
+						   td_async_io_readable,
+						   taio);
+	else if (taio->mode == td_async_write)
+		id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD,
+						   taio->fd, 0,
+						   td_async_io_writeable,
+						   taio);
+	else
+		id = -1;
+	if (id < 0)
+		goto err;
+	taio->io_id = id;
+
+	if (taio->timeout_s) {
+		id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
+						   -1, taio->timeout_s,
+						   td_async_io_timeout, taio);
+		if (id < 0)
+			goto err;
+		taio->timeout_id = id;
+	}
+
+	taio->used = 0;
+	return 0;
+
+err:
+	td_async_io_kill(taio);
+	return -1;
+}
+
+static void td_async_io_callback(td_async_io_t *taio, int realsize,
+				 int errnoval)
+{
+	td_async_io_kill(taio);
+	taio->callback(taio, realsize, errnoval);
+}
+
+static void td_async_io_update_timeout(td_async_io_t *taio)
+{
+	event_id_t id;
+
+	if (!taio->timeout_s)
+		return;
+
+	tapdisk_server_unregister_event(taio->timeout_id);
+	taio->timeout_id = -1;
+
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
+					   -1, taio->timeout_s,
+					   td_async_io_timeout, taio);
+	if (id < 0)
+		td_async_io_callback(taio, -1, id);
+	else
+		taio->timeout_id = id;
+}
+
+static void td_async_io_readable(event_id_t id, char mode, void *private)
+{
+	td_async_io_t *taio = private;
+	int rc;
+
+	while (1) {
+		rc = read(taio->fd, taio->buff + taio->used,
+			  taio->size - taio->used);
+		if (rc < 0) {
+			if (errno == EINTR)
+				continue;
+			if (errno == EWOULDBLOCK || errno == EAGAIN)
+				break;
+
+			td_async_io_callback(taio, 0, errno);
+			return;
+		}
+
+		if (rc == 0) {
+			td_async_io_callback(taio, taio->used, 0);
+			return;
+		}
+
+		taio->used += rc;
+		if (taio->used == taio->size) {
+			td_async_io_callback(taio, taio->used, 0);
+			return;
+		}
+	}
+
+	td_async_io_update_timeout(taio);
+}
+
+static void td_async_io_writeable(event_id_t id, char mode, void *private)
+{
+	td_async_io_t *taio = private;
+	int rc;
+
+	while (1) {
+		rc = write(taio->fd, taio->buff + taio->used,
+			   taio->size - taio->used);
+
+		if (rc < 0) {
+			if (errno == EINTR)
+				continue;
+			if (errno == EWOULDBLOCK || errno == EAGAIN)
+				break;
+
+			td_async_io_callback(taio, 0, errno);
+			return;
+		}
+
+		taio->used += rc;
+		if (taio->used == taio->size) {
+			td_async_io_callback(taio, taio->used, 0);
+			return;
+		}
+	}
+
+	td_async_io_update_timeout(taio);
+}
+
+static void td_async_io_timeout(event_id_t id, char mode, void *private)
+{
+	td_async_io_t *taio = private;
+
+	td_async_io_kill(taio);
+	taio->callback(taio, 0, ETIME);
+}
+
+int td_async_io_is_running(td_async_io_t *taio)
+{
+	return taio->running;
+}
+
+void td_async_io_kill(td_async_io_t *taio)
+{
+	if (!taio->running)
+		return;
+
+	if (taio->timeout_id >= 0) {
+		tapdisk_server_unregister_event(taio->timeout_id);
+		taio->timeout_id = -1;
+	}
+
+	if (taio->io_id >= 0) {
+		tapdisk_server_unregister_event(taio->io_id);
+		taio->io_id = -1;
+	}
+
+	taio->running = 0;
+}
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
index fdc216e..d39c530 100644
--- a/tools/blktap2/drivers/block-replication.h
+++ b/tools/blktap2/drivers/block-replication.h
@@ -156,6 +156,62 @@ struct hashtable *ramdisk_new_hashtable(void);
 int ramdisk_write_to_hashtable(struct hashtable *h, uint64_t sector,
 			       int nb_sectors, size_t sector_size, char* buf,
 			       const char *log_prefix);
+int ramdisk_read_from_hashtable(struct hashtable *h, uint64_t sector,
+				int nb_sectors, int sector_size,
+				char *buf);
 void ramdisk_destroy_hashtable(struct hashtable *h);
 
+/* async I/O, don't support read/write at the same time */
+typedef struct td_async_io td_async_io_t;
+enum {
+	td_async_read,
+	td_async_write,
+};
+
+/*
+ * realsize >= 1 means all data was read/written
+ * realsize == 0 means failure happened when reading/writing, and
+ * errnoval is valid
+ * realsize == -1 means some other internal failure happended, and
+ * errnoval is also valid
+ * In all cases async_io is killed before calling this callback
+ *
+ * If we don't read/write any more data in timeout_s seconds, realsize is
+ * 0, and errnoval is ETIME
+ *
+ * If timeout_s is 0, timeout will be disabled.
+ *
+ * NOTE: realsize is less than taio->size, if we read EOF.
+ */
+typedef void taio_callback(td_async_io_t *taio, int realsize,
+			   int errnoval);
+
+struct td_async_io {
+	/* caller must fill these in, and they must all remain valid */
+	int fd;
+	int timeout_s;
+	int mode;
+	/*
+	 * read: store the data to buff
+	 * write: point to the data to be written
+	 */
+	void *buff;
+	int size;
+	taio_callback *callback;
+
+	/* private */
+	event_id_t timeout_id, io_id;
+	int used;
+	int running;
+};
+
+/* Don't call it when td_async_io is running */
+void td_async_io_init(td_async_io_t *taio);
+/* return -1 if we find some error. Otherwise, return 0 */
+int td_async_io_start(td_async_io_t *taio);
+/* return 1 if td_async_io is running, otherwise return 0 */
+int td_async_io_is_running(td_async_io_t *taio);
+/* The callback will not be called */
+void td_async_io_kill(td_async_io_t *taio);
+
 #endif
diff --git a/tools/blktap2/drivers/tapdisk-disktype.c b/tools/blktap2/drivers/tapdisk-disktype.c
index 8d1383b..aa2afab 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.c
+++ b/tools/blktap2/drivers/tapdisk-disktype.c
@@ -94,6 +94,12 @@ static const disk_info_t remus_disk = {
        0,
 };
 
+static const disk_info_t colo_disk = {
+	"colo",
+	"colo disk replicator (COLO)",
+	0,
+};
+
 const disk_info_t *tapdisk_disk_types[] = {
 	[DISK_TYPE_AIO]	= &aio_disk,
 	[DISK_TYPE_SYNC]	= &sync_disk,
@@ -105,6 +111,7 @@ const disk_info_t *tapdisk_disk_types[] = {
 	[DISK_TYPE_BLOCK_CACHE] = &block_cache_disk,
 	[DISK_TYPE_LOG]	= &log_disk,
 	[DISK_TYPE_REMUS]	= &remus_disk,
+	[DISK_TYPE_COLO]	= &colo_disk,
 	[DISK_TYPE_MAX]		= NULL,
 };
 
@@ -119,6 +126,7 @@ extern struct tap_disk tapdisk_block_cache;
 extern struct tap_disk tapdisk_vhd_index;
 extern struct tap_disk tapdisk_log;
 extern struct tap_disk tapdisk_remus;
+extern struct tap_disk tapdisk_colo;
 
 const struct tap_disk *tapdisk_disk_drivers[] = {
 	[DISK_TYPE_AIO]         = &tapdisk_aio,
@@ -132,6 +140,7 @@ const struct tap_disk *tapdisk_disk_drivers[] = {
 	[DISK_TYPE_BLOCK_CACHE] = &tapdisk_block_cache,
 	[DISK_TYPE_LOG]         = &tapdisk_log,
 	[DISK_TYPE_REMUS]       = &tapdisk_remus,
+	[DISK_TYPE_COLO]        = &tapdisk_colo,
 	[DISK_TYPE_MAX]         = NULL,
 };
 
diff --git a/tools/blktap2/drivers/tapdisk-disktype.h b/tools/blktap2/drivers/tapdisk-disktype.h
index c574990..ee8cb02 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.h
+++ b/tools/blktap2/drivers/tapdisk-disktype.h
@@ -39,7 +39,8 @@
 #define DISK_TYPE_BLOCK_CACHE 7
 #define DISK_TYPE_LOG         8
 #define DISK_TYPE_REMUS       9
-#define DISK_TYPE_MAX         10
+#define DISK_TYPE_COLO        10
+#define DISK_TYPE_MAX         11
 
 #define DISK_TYPE_NAME_MAX    32
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 40/45] pass correct file to qemu if we use blktap2
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (38 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 39/45] block-colo: implement colo disk replication Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 41/45] support blktap remus in xl Wen Congyang
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

If we use blktap2, the correct file should be blktap device
not the pdev_path.
---
 tools/libxl/libxl_dm.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index addacdb..d86ce15 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -680,6 +680,7 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                 libxl__device_disk_dev_number(disks[i].vdev, &disk, &part);
             const char *format = qemu_disk_format_string(disks[i].format);
             char *drive;
+            const char *pdev_path;
 
             if (dev_number == -1) {
                 LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "unable to determine"
@@ -709,6 +710,12 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                     continue;
                 }
 
+                if (disks[i].backend == LIBXL_DISK_BACKEND_TAP)
+                    pdev_path = libxl__blktap_devpath(gc, disks[i].pdev_path,
+                                                      disks[i].format);
+                else
+                    pdev_path = disks[i].pdev_path;
+
                 /*
                  * Explicit sd disks are passed through as is.
                  *
@@ -718,11 +725,11 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                 if (strncmp(disks[i].vdev, "sd", 2) == 0)
                     drive = libxl__sprintf
                         (gc, "file=%s,if=scsi,bus=0,unit=%d,format=%s,cache=writeback",
-                         disks[i].pdev_path, disk, format);
+                         pdev_path, disk, format);
                 else if (disk < 4)
                     drive = libxl__sprintf
                         (gc, "file=%s,if=ide,index=%d,media=disk,format=%s,cache=writeback",
-                         disks[i].pdev_path, disk, format);
+                         pdev_path, disk, format);
                 else
                     continue; /* Do not emulate this disk */
             }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 41/45] support blktap remus in xl
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (39 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 40/45] pass correct file to qemu if we use blktap2 Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 42/45] support blktap colo in xl: Wen Congyang
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

With this patch, we can use blktap remus like this:
disk = [ 'format=remus,devtype=disk,access=w,vdev=hda,backendtype=tap,target=192.168.3.1:9000|aio:filename' ]
---
 tools/libxl/libxl_blktap2.c   | 33 +++++++++++++++++++++++++++++++++
 tools/libxl/libxl_device.c    |  4 +++-
 tools/libxl/libxl_dm.c        |  8 ++++++++
 tools/libxl/libxl_internal.h  |  4 ++++
 tools/libxl/libxl_noblktap2.c |  6 ++++++
 tools/libxl/libxl_types.idl   |  1 +
 tools/libxl/libxlu_disk_l.l   |  1 +
 7 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/tools/libxl/libxl_blktap2.c b/tools/libxl/libxl_blktap2.c
index 2053403..7bbdfc8 100644
--- a/tools/libxl/libxl_blktap2.c
+++ b/tools/libxl/libxl_blktap2.c
@@ -32,6 +32,10 @@ char *libxl__blktap_devpath(libxl__gc *gc,
     tap_list_t tap;
     int err;
 
+    if (format == LIBXL_DISK_FORMAT_REMUS)
+        if (libxl__blktap_get_real_format(gc, disk, format) < 0)
+            return NULL;
+
     type = libxl__device_disk_string_of_format(format);
     err = tap_ctl_find(type, disk, &tap);
     if (err == 0) {
@@ -84,6 +88,35 @@ int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
     return 0;
 }
 
+libxl_disk_format libxl__blktap_get_real_format(libxl__gc *gc,
+                                                const char *disk,
+                                                libxl_disk_format format)
+{
+    const char *type;
+
+    if (format != LIBXL_DISK_FORMAT_REMUS)
+        return format;
+
+    /* The format of disk: ip:port|xxx:file */
+    type = strchr(disk, '|');
+    if (!type) {
+        LOG(ERROR, "Unable to parse params %s", disk);
+        return ERROR_FAIL;
+    }
+
+    type++;
+
+    /* libxl only supports aio/vhd(see disk_try_backend()) */
+    if (!strncmp(type, "aio:", 4))
+        return LIBXL_DISK_FORMAT_RAW;
+    else if (!strncmp(type, "vhd:", 4))
+        return LIBXL_DISK_FORMAT_VHD;
+
+    LOG(ERROR, "Unsupported format: %s", type);
+
+    return ERROR_FAIL;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 89dc824..a460d33 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -211,7 +211,8 @@ static int disk_try_backend(disk_try_backend_args *a,
             return 0;
         }
         if (!(a->disk->format == LIBXL_DISK_FORMAT_RAW ||
-              a->disk->format == LIBXL_DISK_FORMAT_VHD)) {
+              a->disk->format == LIBXL_DISK_FORMAT_VHD ||
+              a->disk->format == LIBXL_DISK_FORMAT_REMUS)) {
             goto bad_format;
         }
         return backend;
@@ -295,6 +296,7 @@ char *libxl__device_disk_string_of_format(libxl_disk_format format)
         case LIBXL_DISK_FORMAT_VHD: return "vhd";
         case LIBXL_DISK_FORMAT_RAW:
         case LIBXL_DISK_FORMAT_EMPTY: return "aio";
+        case LIBXL_DISK_FORMAT_REMUS: return "remus";
         default: return NULL;
     }
 }
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index d86ce15..a7ce6d2 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -681,6 +681,7 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
             const char *format = qemu_disk_format_string(disks[i].format);
             char *drive;
             const char *pdev_path;
+            libxl_disk_format real_format;
 
             if (dev_number == -1) {
                 LIBXL__LOG(ctx, LIBXL__LOG_WARNING, "unable to determine"
@@ -688,6 +689,13 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                 continue;
             }
 
+            if (disks[i].format == LIBXL_DISK_FORMAT_REMUS) {
+                real_format = libxl__blktap_get_real_format(gc,
+                                                            disks[i].pdev_path,
+                                                            disks[i].format);
+                format = qemu_disk_format_string(real_format);
+            }
+
             if (disks[i].is_cdrom) {
                 if (disks[i].format == LIBXL_DISK_FORMAT_EMPTY)
                     drive = libxl__sprintf
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 6fc26c9..4478c70 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1535,6 +1535,10 @@ _hidden char *libxl__blktap_devpath(libxl__gc *gc,
  */
 _hidden int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params);
 
+_hidden libxl_disk_format libxl__blktap_get_real_format(libxl__gc *gc,
+                                                        const char *disk,
+                                                        libxl_disk_format format);
+
 _hidden int libxl__device_from_disk(libxl__gc *gc, uint32_t domid,
                                    libxl_device_disk *disk,
                                    libxl__device *device);
diff --git a/tools/libxl/libxl_noblktap2.c b/tools/libxl/libxl_noblktap2.c
index 5a86ed1..38696ec 100644
--- a/tools/libxl/libxl_noblktap2.c
+++ b/tools/libxl/libxl_noblktap2.c
@@ -33,6 +33,12 @@ int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
     return 0;
 }
 
+libxl_disk_format libxl__blktap_get_real_format(const char *disk,
+                                                libxl_disk_format format)
+{
+    return format;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 599f137..6bcb8b6 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -87,6 +87,7 @@ libxl_disk_format = Enumeration("disk_format", [
     (3, "VHD"),
     (4, "RAW"),
     (5, "EMPTY"),
+    (6, "REMUS"),
     ])
 
 libxl_disk_backend = Enumeration("disk_backend", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 1a5deb5..d9ff8a1 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -102,6 +102,7 @@ static void setformat(DiskParseContext *dpc, const char *str) {
     else if (!strcmp(str,"qcow2"))  DSET(dpc,format,FORMAT,str,QCOW2);
     else if (!strcmp(str,"vhd"))    DSET(dpc,format,FORMAT,str,VHD);
     else if (!strcmp(str,"empty"))  DSET(dpc,format,FORMAT,str,EMPTY);
+    else if (!strcmp(str,"remus"))  DSET(dpc,format,FORMAT,str,REMUS);
     else xlu__disk_err(dpc,str,"unknown value for format");
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 42/45] support blktap colo in xl:
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (40 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 41/45] support blktap remus in xl Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 43/45] update libxl__device_disk_from_xs_be() to support blktap device Wen Congyang
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

With this patch, we can use blktap remus like this:
disk = [ 'format=colo,devtype=disk,access=w,vdev=hda,backendtype=tap,target=192.168.3.1:9000|aio:filename' ]

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl_blktap2.c | 6 ++++--
 tools/libxl/libxl_device.c  | 4 +++-
 tools/libxl/libxl_dm.c      | 3 ++-
 tools/libxl/libxl_types.idl | 1 +
 tools/libxl/libxlu_disk_l.l | 1 +
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/libxl/libxl_blktap2.c b/tools/libxl/libxl_blktap2.c
index 7bbdfc8..cde0dee 100644
--- a/tools/libxl/libxl_blktap2.c
+++ b/tools/libxl/libxl_blktap2.c
@@ -32,7 +32,8 @@ char *libxl__blktap_devpath(libxl__gc *gc,
     tap_list_t tap;
     int err;
 
-    if (format == LIBXL_DISK_FORMAT_REMUS)
+    if (format == LIBXL_DISK_FORMAT_REMUS ||
+        format == LIBXL_DISK_FORMAT_COLO)
         if (libxl__blktap_get_real_format(gc, disk, format) < 0)
             return NULL;
 
@@ -94,7 +95,8 @@ libxl_disk_format libxl__blktap_get_real_format(libxl__gc *gc,
 {
     const char *type;
 
-    if (format != LIBXL_DISK_FORMAT_REMUS)
+    if (format != LIBXL_DISK_FORMAT_REMUS &&
+        format != LIBXL_DISK_FORMAT_COLO)
         return format;
 
     /* The format of disk: ip:port|xxx:file */
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index a460d33..6e23858 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -212,7 +212,8 @@ static int disk_try_backend(disk_try_backend_args *a,
         }
         if (!(a->disk->format == LIBXL_DISK_FORMAT_RAW ||
               a->disk->format == LIBXL_DISK_FORMAT_VHD ||
-              a->disk->format == LIBXL_DISK_FORMAT_REMUS)) {
+              a->disk->format == LIBXL_DISK_FORMAT_REMUS ||
+              a->disk->format == LIBXL_DISK_FORMAT_COLO)) {
             goto bad_format;
         }
         return backend;
@@ -297,6 +298,7 @@ char *libxl__device_disk_string_of_format(libxl_disk_format format)
         case LIBXL_DISK_FORMAT_RAW:
         case LIBXL_DISK_FORMAT_EMPTY: return "aio";
         case LIBXL_DISK_FORMAT_REMUS: return "remus";
+        case LIBXL_DISK_FORMAT_COLO:  return "colo";
         default: return NULL;
     }
 }
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index a7ce6d2..a3cc768 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -689,7 +689,8 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
                 continue;
             }
 
-            if (disks[i].format == LIBXL_DISK_FORMAT_REMUS) {
+            if (disks[i].format == LIBXL_DISK_FORMAT_REMUS||
+                disks[i].format == LIBXL_DISK_FORMAT_COLO) {
                 real_format = libxl__blktap_get_real_format(gc,
                                                             disks[i].pdev_path,
                                                             disks[i].format);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 6bcb8b6..3fe0812 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -88,6 +88,7 @@ libxl_disk_format = Enumeration("disk_format", [
     (4, "RAW"),
     (5, "EMPTY"),
     (6, "REMUS"),
+    (7, "COLO"),
     ])
 
 libxl_disk_backend = Enumeration("disk_backend", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index d9ff8a1..a6028b7 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -103,6 +103,7 @@ static void setformat(DiskParseContext *dpc, const char *str) {
     else if (!strcmp(str,"vhd"))    DSET(dpc,format,FORMAT,str,VHD);
     else if (!strcmp(str,"empty"))  DSET(dpc,format,FORMAT,str,EMPTY);
     else if (!strcmp(str,"remus"))  DSET(dpc,format,FORMAT,str,REMUS);
+    else if (!strcmp(str,"colo"))   DSET(dpc,format,FORMAT,str,COLO);
     else xlu__disk_err(dpc,str,"unknown value for format");
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 43/45] update libxl__device_disk_from_xs_be() to support blktap device
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (41 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 42/45] support blktap colo in xl: Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 44/45] libxl/colo: setup and control disk replication for blktap2 backends Wen Congyang
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

If the disk backend is blktap device, we store "format:pdev_path"
in tapdisk-params, and store "phy" in type. So use tapdisk-params
to set libxl_device_disk instead of params and type.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/libxl.c       | 44 ++++++++++++++++++++++++++++++++++++++++++--
 tools/libxl/libxl_utils.c | 23 +++++++++++++++++++++++
 tools/libxl/libxl_utils.h |  1 +
 3 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index e0817e8..dd69f02 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2373,6 +2373,47 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
         goto cleanup;
     }
 
+    disk->format = LIBXL_DISK_FORMAT_UNKNOWN;
+
+    /* "tapdisk-params" is only for tapdisk */
+    tmp = xs_read(ctx->xsh, XBT_NULL,
+                  libxl__sprintf(gc, "%s/tapdisk-params", be_path), &len);
+    if (tmp) {
+        char *pdev_path;
+        /* tmp is "format:pdev_path" */
+        pdev_path = strchr(tmp, ':');
+        if (!pdev_path) {
+            LOG(ERROR, "corrupted tapdisk-params: \"%s\"\n", tmp);
+            free(tmp);
+            goto cleanup;
+        }
+        disk->pdev_path = strdup(pdev_path + 1);
+        *pdev_path = '\0';
+        rc = libxl_string_to_format(ctx, tmp, &disk->format);
+        if (rc) {
+            LOG(ERROR, "unknown disk format: %s\n", tmp);
+            free(tmp);
+            goto cleanup;
+        }
+        if (disk->format != LIBXL_DISK_FORMAT_VHD &&
+            disk->format != LIBXL_DISK_FORMAT_RAW &&
+            disk->format != LIBXL_DISK_FORMAT_REMUS &&
+            disk->format != LIBXL_DISK_FORMAT_COLO) {
+            LOG(ERROR, "unsupported tapdisk format: %s\n", tmp);
+            free(tmp);
+            goto cleanup;
+        }
+        free(tmp);
+
+        /*
+         * The backend is tapdisk, so we store tapdev in params, and
+         * phy in type(see device_disk_add())
+         */
+        disk->backend = LIBXL_DISK_BACKEND_TAP;
+
+        goto skip_type;
+    }
+
     /* "params" may not be present; but everything else must be. */
     tmp = xs_read(ctx->xsh, XBT_NULL,
                   libxl__sprintf(gc, "%s/params", be_path), &len);
@@ -2392,6 +2433,7 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
     }
     libxl_string_to_backend(ctx, tmp, &(disk->backend));
 
+skip_type:
     disk->vdev = xs_read(ctx->xsh, XBT_NULL,
                          libxl__sprintf(gc, "%s/dev", be_path), &len);
     if (!disk->vdev) {
@@ -2425,8 +2467,6 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
     }
     disk->is_cdrom = !strcmp(tmp, "cdrom");
 
-    disk->format = LIBXL_DISK_FORMAT_UNKNOWN;
-
     return 0;
 cleanup:
     libxl_device_disk_dispose(disk);
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
index 58df4f3..6c35ba8 100644
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -319,6 +319,29 @@ out:
     return rc;
 }
 
+int libxl_string_to_format(libxl_ctx *ctx, char *s, libxl_disk_format *format)
+{
+    int rc = 0;
+    if (!strcmp(s, "aio")) {
+        *format = LIBXL_DISK_FORMAT_RAW;
+    } else if (!strcmp(s, "qcow")) {
+        *format = LIBXL_DISK_FORMAT_QCOW;
+    } else if (!strcmp(s, "qcow2")) {
+        *format = LIBXL_DISK_FORMAT_QCOW2;
+    } else if (!strcmp(s, "vhd")) {
+        *format = LIBXL_DISK_FORMAT_VHD;
+    } else if (!strcmp(s, "remus")) {
+        *format = LIBXL_DISK_FORMAT_REMUS;
+    } else if (!strcmp(s, "colo")) {
+        *format = LIBXL_DISK_FORMAT_COLO;
+    } else {
+        *format = LIBXL_DISK_FORMAT_UNKNOWN;
+        rc = ERROR_FAIL;
+    }
+
+    return rc;
+}
+
 int libxl_read_file_contents(libxl_ctx *ctx, const char *filename,
                              void **data_r, int *datalen_r) {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
index 117b229..9178836 100644
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -33,6 +33,7 @@ int libxl_get_stubdom_id(libxl_ctx *ctx, int guest_domid);
 int libxl_is_stubdom(libxl_ctx *ctx, uint32_t domid, uint32_t *target_domid);
 int libxl_create_logfile(libxl_ctx *ctx, const char *name, char **full_name);
 int libxl_string_to_backend(libxl_ctx *ctx, char *s, libxl_disk_backend *backend);
+int libxl_string_to_format(libxl_ctx *ctx, char *s, libxl_disk_format *format);
 
 int libxl_read_file_contents(libxl_ctx *ctx, const char *filename,
                              void **data_r, int *datalen_r);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 44/45] libxl/colo: setup and control disk replication for blktap2 backends
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (42 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 43/45] update libxl__device_disk_from_xs_be() to support blktap device Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

This patch adds the machinery required for protecting a guest's
disk state, when the guest disk uses a blktap2 disk backend.
1. COLO blktap2 disk device: Implements the interfaces required by the
   checkpoint abstract device layer. A note about the implementation:
   a) setup() is called for each disk attached to the guest.
      During setup():
      i) perform the sanity check: backend type should be LIBXL_DISK_BACKEND_TAP
         and format should be LIBXL_DISK_FORMAT_COLO.
      ii) connect to the control socket: /var/run/tap/colo_xxx, xxx is
          "host:port"(The character ':/' will be changed to '_').
   b) The postsuspend callback() will write "flush" to this socket
   c) The commit callback() will wait and read "done" from this socket

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile                       |   2 +-
 tools/libxl/libxl_colo_save.c              |   5 +-
 tools/libxl/libxl_colo_save_disk_blktap2.c | 216 +++++++++++++++++++++++++++++
 tools/libxl/libxl_create.c                 |   7 +
 tools/libxl/libxl_noblktap2.c              |  29 ++++
 5 files changed, 257 insertions(+), 2 deletions(-)
 create mode 100644 tools/libxl/libxl_colo_save_disk_blktap2.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 1c32ae2..b4755c8 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -45,7 +45,7 @@ LIBXLU_LIBS =
 
 LIBXL_OBJS-y = osdeps.o libxl_paths.o libxl_bootloader.o flexarray.o
 ifeq ($(LIBXL_BLKTAP),y)
-LIBXL_OBJS-y += libxl_blktap2.o
+LIBXL_OBJS-y += libxl_blktap2.o libxl_colo_save_disk_blktap2.o
 else
 LIBXL_OBJS-y += libxl_noblktap2.o
 endif
diff --git a/tools/libxl/libxl_colo_save.c b/tools/libxl/libxl_colo_save.c
index 75d83c8..b7070b2 100644
--- a/tools/libxl/libxl_colo_save.c
+++ b/tools/libxl/libxl_colo_save.c
@@ -18,7 +18,10 @@
 #include "libxl_internal.h"
 #include "libxl_colo.h"
 
+extern const libxl__checkpoint_device_subkind_ops colo_save_device_blktap2_disk;
+
 static const libxl__checkpoint_device_subkind_ops *colo_ops[] = {
+    &colo_save_device_blktap2_disk,
     NULL,
 };
 
@@ -49,7 +52,7 @@ void libxl__colo_save_setup(libxl__egc *egc, libxl__colo_save_state *css)
     css->svm_running = false;
 
     /* TODO: disk/nic support */
-    cds->device_kind_flags = 0;
+    cds->device_kind_flags = LIBXL__CHECKPOINT_DEVICE_DISK;
     cds->ops = colo_ops;
     cds->callback = colo_save_setup_done;
     cds->ao = ao;
diff --git a/tools/libxl/libxl_colo_save_disk_blktap2.c b/tools/libxl/libxl_colo_save_disk_blktap2.c
new file mode 100644
index 0000000..1c35971
--- /dev/null
+++ b/tools/libxl/libxl_colo_save_disk_blktap2.c
@@ -0,0 +1,216 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+#include <string.h>
+#include <sys/un.h>
+
+#define     BLKTAP2_REQUEST     "flush"
+#define     BLKTAP2_RESPONSE    "done"
+#define     BLKTAP_CTRL_DIR     "/var/run/tap"
+
+typedef struct libxl__colo_blktap2_disk {
+    char *name;
+    char *ctl_socket_path;
+    int fd;
+    libxl__ev_fd ev;
+    libxl__checkpoint_device *dev;
+}libxl__colo_blktap2_disk;
+
+/* ========== init() and cleanup() ========== */
+static int blktap2_colo_init(libxl__checkpoint_devices_state *cds)
+{
+    return 0;
+}
+
+static void blktap2_colo_cleanup(libxl__checkpoint_devices_state *cds)
+{
+}
+
+/* ========== setup() and teardown() ========== */
+static int blktap2_control_connect(libxl__gc *gc,
+                                   libxl__colo_blktap2_disk *blktap2_disk)
+{
+    struct sockaddr_un saddr;
+    int fd, err;
+
+    fd = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (fd < 0) {
+        LOG(ERROR, "cannot creating socket fd");
+        return ERROR_FAIL;
+    }
+
+    memset(&saddr, 0, sizeof(saddr));
+    saddr.sun_family = AF_UNIX;
+    strcpy(saddr.sun_path, blktap2_disk->ctl_socket_path);
+
+    err = connect(fd, (const struct sockaddr *)&saddr, sizeof(saddr));
+    if (err) {
+        LOG(ERROR, "cannot connecte to %s", blktap2_disk->ctl_socket_path);
+        close(fd);
+        return ERROR_FAIL;
+    }
+
+    blktap2_disk->fd = fd;
+    return 0;
+}
+
+static void blktap2_colo_setup(libxl__checkpoint_device *dev)
+{
+    const libxl_device_disk *disk = dev->backend_dev;
+    libxl__colo_blktap2_disk *blktap2_disk;
+    int rc;
+    char *type;
+    int i, l;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    if (disk->backend != LIBXL_DISK_BACKEND_TAP ||
+        disk->format != LIBXL_DISK_FORMAT_COLO) {
+        rc = ERROR_CHECKPOINT_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+
+    dev->set_up = 1;
+    GCNEW(blktap2_disk);
+    dev->concrete_data = blktap2_disk;
+    blktap2_disk->fd = -1;
+    blktap2_disk->dev = dev;
+
+    type = strchr(disk->pdev_path, '|');
+    if (!type) {
+        LOG(ERROR, "unexpected pdev_path: %s", disk->pdev_path);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+    blktap2_disk->name = libxl__strndup(gc, disk->pdev_path,
+                                        type - disk->pdev_path);
+    blktap2_disk->ctl_socket_path = libxl__sprintf(gc, "%s/colo_%s",
+                                                   BLKTAP_CTRL_DIR,
+                                                   blktap2_disk->name);
+    /* scrub socket pathname */
+    l = strlen(blktap2_disk->ctl_socket_path);
+    for (i = strlen(BLKTAP_CTRL_DIR) + 1; i < l; i++) {
+        if (strchr(":/", blktap2_disk->ctl_socket_path[i]))
+            blktap2_disk->ctl_socket_path[i] = '_';
+    }
+
+    libxl__ev_fd_init(&blktap2_disk->ev);
+
+    rc = blktap2_control_connect(gc, blktap2_disk);
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
+}
+
+static void blktap2_colo_teardown(libxl__checkpoint_device *dev)
+{
+    libxl__colo_blktap2_disk *blktap2_disk = dev->concrete_data;
+
+    if (blktap2_disk->fd > 0) {
+        close(blktap2_disk->fd);
+        blktap2_disk->fd = -1;
+    }
+
+    dev->aodev.rc = 0;
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
+}
+
+/* ========== checkpointing APIs ========== */
+static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
+                                     int fd, short events, short revents);
+
+static void blktap2_colo_postsuspend(libxl__checkpoint_device *dev)
+{
+    int ret;
+    libxl__colo_blktap2_disk *blktap2_disk = dev->concrete_data;
+    int rc = 0;
+
+    /* unit socket fd, so not block */
+    ret = write(blktap2_disk->fd, BLKTAP2_REQUEST, strlen(BLKTAP2_REQUEST));
+    if (ret < strlen(BLKTAP2_REQUEST))
+        rc = ERROR_FAIL;
+
+    dev->aodev.rc = rc;
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
+}
+
+static void blktap2_colo_commit(libxl__checkpoint_device *dev)
+{
+    libxl__colo_blktap2_disk *blktap2_disk = dev->concrete_data;
+    int rc;
+
+    /* Convenience aliases */
+    const int fd = blktap2_disk->fd;
+    libxl__ev_fd *const ev = &blktap2_disk->ev;
+
+    STATE_AO_GC(dev->cds->ao);
+
+    rc = libxl__ev_fd_register(gc, ev, blktap2_control_readable, fd, POLLIN);
+    if (rc) {
+        dev->aodev.rc = rc;
+        dev->aodev.callback(dev->cds->egc, &dev->aodev);
+    }
+}
+
+static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
+                                     int fd, short events, short revents)
+{
+    libxl__colo_blktap2_disk *blktap2_disk =
+                CONTAINER_OF(ev, *blktap2_disk, ev);
+    int rc = 0, ret;
+    char response[5];
+
+    /* Convenience aliases */
+    libxl__checkpoint_device *const dev = blktap2_disk->dev;
+
+    EGC_GC;
+
+    libxl__ev_fd_deregister(gc, ev);
+
+    if (revents & ~POLLIN) {
+        LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)", revents);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ret = read(blktap2_disk->fd, response, sizeof(response) - 1);
+    if (ret < sizeof(response) - 1) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    response[4] = '\0';
+    if (strcmp(response, BLKTAP2_RESPONSE))
+        rc = ERROR_FAIL;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
+}
+
+const libxl__checkpoint_device_subkind_ops colo_save_device_blktap2_disk = {
+    .kind = LIBXL__CHECKPOINT_DEVICE_DISK,
+    .init = blktap2_colo_init,
+    .cleanup = blktap2_colo_cleanup,
+    .setup = blktap2_colo_setup,
+    .teardown = blktap2_colo_teardown,
+    .postsuspend = blktap2_colo_postsuspend,
+    .commit = blktap2_colo_commit,
+};
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 46bd02d..d1facef 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -831,6 +831,13 @@ static void initiate_domain_create(libxl__egc *egc,
     for (i = 0; i < d_config->num_disks; i++) {
         ret = libxl__device_disk_setdefault(gc, &d_config->disks[i]);
         if (ret) goto error_out;
+
+        /* TODO: cleanup it when destroying the domain */
+        if (d_config->disks[i].backend == LIBXL_DISK_BACKEND_TAP &&
+            (d_config->disks[i].format == LIBXL_DISK_FORMAT_REMUS ||
+             d_config->disks[i].format == LIBXL_DISK_FORMAT_COLO))
+            libxl__blktap_devpath(gc, d_config->disks[i].pdev_path,
+                                  d_config->disks[i].format);
     }
 
     dcs->bl.ao = ao;
diff --git a/tools/libxl/libxl_noblktap2.c b/tools/libxl/libxl_noblktap2.c
index 38696ec..46207b3 100644
--- a/tools/libxl/libxl_noblktap2.c
+++ b/tools/libxl/libxl_noblktap2.c
@@ -39,6 +39,35 @@ libxl_disk_format libxl__blktap_get_real_format(const char *disk,
     return format;
 }
 
+static int blktap2_colo_init(libxl__checkpoint_device *cds)
+{
+    return 0;
+}
+
+static void blktap2_colo_cleanup(libxl__checkpoint_device *cds)
+{
+    return;
+}
+
+static void blktap2_colo_setup(libxl__checkpoint_device *cds)
+{
+    dev->aodev.rc = ERROR_FAIL;
+    dev->aodev.callback(dev->cds->egc, &dev->aodev);
+}
+
+static void blktap2_colo_teardown(libxl__checkpoint_device *cds)
+{
+    return;
+}
+
+const libxl__checkpoint_device_subkind_ops colo_save_device_blktap2_disk = {
+    .kind = LIBXL__CHECKPOINT_DEVICE_DISK,
+    .init = blktap2_colo_init,
+    .cleanup = blktap2_colo_cleanup,
+    .setup = blktap2_colo_setup,
+    .teardown = blktap2_colo_teardown,
+};
+
 /*
  * Local variables:
  * mode: C
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (43 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 44/45] libxl/colo: setup and control disk replication for blktap2 backends Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:24   ` Jan Beulich
                     ` (2 more replies)
  2014-08-08  7:01 ` [RFC Patch v2 46/45] Introduce "xen-load-devices-state" Wen Congyang
  2014-08-08  7:19 ` [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Jan Beulich
  46 siblings, 3 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Tim Deegan, Yang Hongyang, Lai Jiangshan

In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
valid before restoring vmcs. If there is no pending event after
restoring vm, we should clear it.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

Also clear pending software exceptions.
Copy the fix to SVM as well.

Signed-off-by: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
 xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 71b8a6a..f7a0cb8 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct hvm_hw_cpu *c)
         vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
     }
 
-    if ( c->pending_valid ) 
+    if ( c->pending_valid
+         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
     {
         gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
                  c->pending_event, c->error_code);
-
-        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
-        {
-            vmcb->eventinj.bytes = c->pending_event;
-            vmcb->eventinj.fields.errorcode = c->error_code;
-        }
+        vmcb->eventinj.bytes = c->pending_event;
+        vmcb->eventinj.fields.errorcode = c->error_code;
+    }
+    else
+    {
+        vmcb->eventinj.bytes = 0;
+        vmcb->eventinj.fields.errorcode = 0;
     }
 
     vmcb->cleanbits.bytes = 0;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index fb65c7d..5f143c0 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct hvm_hw_cpu *c)
 
     __vmwrite(GUEST_DR7, c->dr7);
 
-    vmx_vmcs_exit(v);
-
-    paging_update_paging_modes(v);
-
-    if ( c->pending_valid )
+    if ( c->pending_valid
+         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
     {
         gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
                  c->pending_event, c->error_code);
-
-        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
-        {
-            vmx_vmcs_enter(v);
-            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
-            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
-            vmx_vmcs_exit(v);
-        }
+        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
+        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
     }
+    else
+    {
+        __vmwrite(VM_ENTRY_INTR_INFO, 0);
+        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
+    }
+    vmx_vmcs_exit(v);
+
+    paging_update_paging_modes(v);
 
     return 0;
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC Patch v2 46/45] Introduce "xen-load-devices-state"
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (44 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
@ 2014-08-08  7:01 ` Wen Congyang
  2014-08-08  7:19 ` [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Jan Beulich
  46 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:01 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

introduce a "xen-load-devices-state" QAPI command that can be used to load
the state of all devices, but not the RAM or the block devices of the
VM.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 qapi-schema.json |  18 ++++++++
 qmp-commands.hx  |  27 ++++++++++++
 savevm.c         | 126 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 171 insertions(+)

diff --git a/qapi-schema.json b/qapi-schema.json
index 391356f..c569856 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -4689,3 +4689,21 @@
               'btn'     : 'InputBtnEvent',
               'rel'     : 'InputMoveEvent',
               'abs'     : 'InputMoveEvent' } }
+
+##
+# @xen-load-devices-state:
+#
+# Load the state of all devices from file. The RAM and the block devices
+# of the VM are not loaded by this command.
+#
+# @filename: the file to load the state of the devices from as binary
+# data. See xen-save-devices-state.txt for a description of the binary
+# format.
+#
+# Returns: Nothing on success
+#          If @filename cannot be opened, OpenFileFailed
+#          If an I/O error occurs while reading the file, IOError
+#
+# Since: 2.0
+##
+{ 'command': 'xen-load-devices-state', 'data': {'filename': 'str'} }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index ed3ab92..b796be5 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -586,6 +586,33 @@ Example:
 EQMP
 
     {
+        .name       = "xen-load-devices-state",
+        .args_type  = "filename:F",
+        .mhandler.cmd_new = qmp_marshal_input_xen_load_devices_state,
+    },
+
+SQMP
+xen-load-devices-state
+-------
+
+Load the state of all devices from file. The RAM and the block devices
+of the VM are not loaded by this command.
+
+Arguments:
+
+- "filename": the file to load the state of the devices from as binary
+data. See xen-save-devices-state.txt for a description of the binary
+format.
+
+Example:
+
+-> { "execute": "xen-load-devices-state",
+     "arguments": { "filename": "/tmp/resume" } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "xen-set-global-dirty-log",
         .args_type  = "enable:b",
         .mhandler.cmd_new = qmp_marshal_input_xen_set_global_dirty_log,
diff --git a/savevm.c b/savevm.c
index 22123be..c6aa502 100644
--- a/savevm.c
+++ b/savevm.c
@@ -863,6 +863,105 @@ out:
     return ret;
 }
 
+static int qemu_load_devices_state(QEMUFile *f)
+{
+    uint8_t section_type;
+    unsigned int v;
+    int ret;
+
+    if (qemu_savevm_state_blocked(NULL)) {
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v != QEMU_VM_FILE_MAGIC) {
+        return -EINVAL;
+    }
+
+    v = qemu_get_be32(f);
+    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+        fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n");
+        return -ENOTSUP;
+    }
+    if (v != QEMU_VM_FILE_VERSION) {
+        return -ENOTSUP;
+    }
+
+    while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
+        uint32_t instance_id, version_id, section_id;
+        SaveStateEntry *se;
+        char idstr[257];
+        int len;
+
+        switch (section_type) {
+        case QEMU_VM_SECTION_FULL:
+            /* Read section start */
+            section_id = qemu_get_be32(f);
+            len = qemu_get_byte(f);
+            qemu_get_buffer(f, (uint8_t *)idstr, len);
+            idstr[len] = 0;
+            instance_id = qemu_get_be32(f);
+            version_id = qemu_get_be32(f);
+
+            /* Find savevm section */
+            se = find_se(idstr, instance_id);
+            if (se == NULL) {
+                fprintf(stderr, "Unknown savevm section or instance '%s' %d\n",
+                        idstr, instance_id);
+                ret = -EINVAL;
+                goto out;
+            }
+
+            /* Validate version */
+            if (version_id > se->version_id) {
+                fprintf(stderr, "loadvm: unsupported version %d for '%s' v%d\n",
+                        version_id, idstr, se->version_id);
+                ret = -EINVAL;
+                goto out;
+            }
+
+            /* Validate if it is a device's state */
+            if (se->is_ram) {
+                fprintf(stderr, "loadvm: %s is not devices state\n", idstr);
+                ret = -EINVAL;
+                goto out;
+            }
+
+            ret = vmstate_load(f, se, version_id);
+            if (ret < 0) {
+                fprintf(stderr, "qemu: warning: error while loading state for instance 0x%x of device '%s'\n",
+                        instance_id, idstr);
+                goto out;
+            }
+            break;
+        case QEMU_VM_SECTION_START:
+        case QEMU_VM_SECTION_PART:
+        case QEMU_VM_SECTION_END:
+            /*
+             * The file is saved by the command xen-save-devices-state,
+             * So it should not contain section start/part/end.
+             */
+        default:
+            fprintf(stderr, "Unknown savevm section type %d\n", section_type);
+            ret = -EINVAL;
+            goto out;
+        }
+    }
+
+    cpu_synchronize_all_post_init();
+
+    ret = 0;
+
+out:
+    if (ret == 0) {
+        if (qemu_file_get_error(f)) {
+            ret = -EIO;
+        }
+    }
+
+    return ret;
+}
+
 static BlockDriverState *find_vmstate_bs(void)
 {
     BlockDriverState *bs = NULL;
@@ -1027,6 +1126,33 @@ void qmp_xen_save_devices_state(const char *filename, Error **errp)
     }
 }
 
+void qmp_xen_load_devices_state(const char *filename, Error **errp)
+{
+    QEMUFile *f;
+    int saved_vm_running;
+    int ret;
+
+    saved_vm_running = runstate_is_running();
+    vm_stop(RUN_STATE_RESTORE_VM);
+
+    f = qemu_fopen(filename, "rb");
+    if (!f) {
+        error_setg_file_open(errp, errno, filename);
+        goto out;
+    }
+
+    ret = qemu_load_devices_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_set(errp, QERR_IO_ERROR);
+    }
+
+out:
+    if (saved_vm_running) {
+        vm_start();
+    }
+}
+
 int load_vmstate(const char *name)
 {
     BlockDriverState *bs, *bs_vm_state;
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
  2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
                   ` (45 preceding siblings ...)
  2014-08-08  7:01 ` [RFC Patch v2 46/45] Introduce "xen-load-devices-state" Wen Congyang
@ 2014-08-08  7:19 ` Jan Beulich
  2014-08-08  7:39   ` Wen Congyang
  2014-08-08  8:21   ` Wen Congyang
  46 siblings, 2 replies; 64+ messages in thread
From: Jan Beulich @ 2014-08-08  7:19 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Lai Jiangshan

>>> On 08.08.14 at 09:00, <wency@cn.fujitsu.com> wrote:
> Patch 1-3  : bugfix

Such should imo be submitted as a separate prereq series.

> Patch 4-11 : update some APIs which will be used by colo
> Patch 12-15: temporarily update remus to reuse remus device codes
> Patch 16-23: COLO framework related codes
> Patch 24   : Hack patch, just for test
> Patch 25-34: bugfix for blktap2
> Patch 35-38: move some block-remus's codes to block-replication.c. These codes will
>              be reused by COLO.
> Patch 39   : implement block-colo
> Patch 40-43: update libxl to support blktap2
> Patch 44   : implement disk replication
> Patch 45   : hypervisor bugfix. We find this bug before rebasing colo to newest xen.
>              But we don't trigger this bug now.
> Patch 46   : A patch for qemu-xen

And this probably goes for the whole series: Splitting this up by
component would make it much clearer who of the maintainers
needs to take a look at which pieces.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
@ 2014-08-08  7:24   ` Jan Beulich
  2014-08-08  7:29     ` Wen Congyang
  2014-08-26 16:02   ` Jan Beulich
  2014-08-27 15:02   ` Andrew Cooper
  2 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2014-08-08  7:24 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tim Deegan, Ian Campbell, Dong Eddie, Jiang Yunhong, Ian Jackson,
	xen devel, Aravind Gopalakrishnan, suravee.suthikulpanit,
	Boris Ostrovsky, Yang Hongyang, Lai Jiangshan

>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
> valid before restoring vmcs. If there is no pending event after
> restoring vm, we should clear it.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> Also clear pending software exceptions.
> Copy the fix to SVM as well.

And with this you should also have Cc-ed the SVM maintainers, which
I now did.


> Signed-off-by: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>  2 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 71b8a6a..f7a0cb8 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct 
> hvm_hw_cpu *c)
>          vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>      }
>  
> -    if ( c->pending_valid ) 
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )

Conventionally the operator goes at the end of the first line, not the
beginning of the second.

Jan

>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmcb->eventinj.bytes = c->pending_event;
> -            vmcb->eventinj.fields.errorcode = c->error_code;
> -        }
> +        vmcb->eventinj.bytes = c->pending_event;
> +        vmcb->eventinj.fields.errorcode = c->error_code;
> +    }
> +    else
> +    {
> +        vmcb->eventinj.bytes = 0;
> +        vmcb->eventinj.fields.errorcode = 0;
>      }
>  
>      vmcb->cleanbits.bytes = 0;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index fb65c7d..5f143c0 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct 
> hvm_hw_cpu *c)
>  
>      __vmwrite(GUEST_DR7, c->dr7);
>  
> -    vmx_vmcs_exit(v);
> -
> -    paging_update_paging_modes(v);
> -
> -    if ( c->pending_valid )
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmx_vmcs_enter(v);
> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
> -            vmx_vmcs_exit(v);
> -        }
> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>      }
> +    else
> +    {
> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
> +    }
> +    vmx_vmcs_exit(v);
> +
> +    paging_update_paging_modes(v);
>  
>      return 0;
>  }
> -- 
> 1.9.3
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-08  7:24   ` Jan Beulich
@ 2014-08-08  7:29     ` Wen Congyang
  0 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Ian Campbell, Dong Eddie, Jiang Yunhong, Ian Jackson,
	xen devel, Aravind Gopalakrishnan, suravee.suthikulpanit,
	Boris Ostrovsky, Yang Hongyang, Lai Jiangshan

At 08/08/2014 03:24 PM, Jan Beulich Write:
>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>> valid before restoring vmcs. If there is no pending event after
>> restoring vm, we should clear it.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Also clear pending software exceptions.
>> Copy the fix to SVM as well.
> 
> And with this you should also have Cc-ed the SVM maintainers, which
> I now did.
> 
> 
>> Signed-off-by: Tim Deegan <tim@xen.org>
>> ---
>>  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>>  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>>  2 files changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>> index 71b8a6a..f7a0cb8 100644
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct 
>> hvm_hw_cpu *c)
>>          vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>>      }
>>  
>> -    if ( c->pending_valid ) 
>> +    if ( c->pending_valid
>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> 
> Conventionally the operator goes at the end of the first line, not the
> beginning of the second.

I will fixed.

Thanks
Wen Congyang

> 
> Jan
> 
>>      {
>>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>                   c->pending_event, c->error_code);
>> -
>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>> -        {
>> -            vmcb->eventinj.bytes = c->pending_event;
>> -            vmcb->eventinj.fields.errorcode = c->error_code;
>> -        }
>> +        vmcb->eventinj.bytes = c->pending_event;
>> +        vmcb->eventinj.fields.errorcode = c->error_code;
>> +    }
>> +    else
>> +    {
>> +        vmcb->eventinj.bytes = 0;
>> +        vmcb->eventinj.fields.errorcode = 0;
>>      }
>>  
>>      vmcb->cleanbits.bytes = 0;
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index fb65c7d..5f143c0 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct 
>> hvm_hw_cpu *c)
>>  
>>      __vmwrite(GUEST_DR7, c->dr7);
>>  
>> -    vmx_vmcs_exit(v);
>> -
>> -    paging_update_paging_modes(v);
>> -
>> -    if ( c->pending_valid )
>> +    if ( c->pending_valid
>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>      {
>>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>                   c->pending_event, c->error_code);
>> -
>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>> -        {
>> -            vmx_vmcs_enter(v);
>> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>> -            vmx_vmcs_exit(v);
>> -        }
>> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>      }
>> +    else
>> +    {
>> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
>> +    }
>> +    vmx_vmcs_exit(v);
>> +
>> +    paging_update_paging_modes(v);
>>  
>>      return 0;
>>  }
>> -- 
>> 1.9.3
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
  2014-08-08  7:19 ` [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Jan Beulich
@ 2014-08-08  7:39   ` Wen Congyang
  2014-08-08  8:21   ` Wen Congyang
  1 sibling, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  7:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Lai Jiangshan

At 08/08/2014 03:19 PM, Jan Beulich Write:
>>>> On 08.08.14 at 09:00, <wency@cn.fujitsu.com> wrote:
>> Patch 1-3  : bugfix
> 
> Such should imo be submitted as a separate prereq series.

OK

> 
>> Patch 4-11 : update some APIs which will be used by colo
>> Patch 12-15: temporarily update remus to reuse remus device codes
>> Patch 16-23: COLO framework related codes
>> Patch 24   : Hack patch, just for test
>> Patch 25-34: bugfix for blktap2
>> Patch 35-38: move some block-remus's codes to block-replication.c. These codes will
>>              be reused by COLO.
>> Patch 39   : implement block-colo
>> Patch 40-43: update libxl to support blktap2
>> Patch 44   : implement disk replication
>> Patch 45   : hypervisor bugfix. We find this bug before rebasing colo to newest xen.
>>              But we don't trigger this bug now.
>> Patch 46   : A patch for qemu-xen
> 
> And this probably goes for the whole series: Splitting this up by
> component would make it much clearer who of the maintainers
> needs to take a look at which pieces.

OK.

Thanks
Wen Congyang

> 
> Jan
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
  2014-08-08  7:19 ` [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Jan Beulich
  2014-08-08  7:39   ` Wen Congyang
@ 2014-08-08  8:21   ` Wen Congyang
  2014-08-08  9:02     ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Wen Congyang @ 2014-08-08  8:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Lai Jiangshan

At 08/08/2014 03:19 PM, Jan Beulich Write:
>>>> On 08.08.14 at 09:00, <wency@cn.fujitsu.com> wrote:
>> Patch 1-3  : bugfix
> 
> Such should imo be submitted as a separate prereq series.

I have repost patch 1-3, 25-34, 40, 41, 43 in another series:
http://lists.xenproject.org/archives/html/xen-devel/2014-08/msg00971.html

Thanks
Wen Congyang

> 
>> Patch 4-11 : update some APIs which will be used by colo
>> Patch 12-15: temporarily update remus to reuse remus device codes
>> Patch 16-23: COLO framework related codes
>> Patch 24   : Hack patch, just for test
>> Patch 25-34: bugfix for blktap2
>> Patch 35-38: move some block-remus's codes to block-replication.c. These codes will
>>              be reused by COLO.
>> Patch 39   : implement block-colo
>> Patch 40-43: update libxl to support blktap2
>> Patch 44   : implement disk replication
>> Patch 45   : hypervisor bugfix. We find this bug before rebasing colo to newest xen.
>>              But we don't trigger this bug now.
>> Patch 46   : A patch for qemu-xen
> 
> And this probably goes for the whole series: Splitting this up by
> component would make it much clearer who of the maintainers
> needs to take a look at which pieces.
> 
> Jan
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service
  2014-08-08  8:21   ` Wen Congyang
@ 2014-08-08  9:02     ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2014-08-08  9:02 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Lai Jiangshan

>>> On 08.08.14 at 10:21, <wency@cn.fujitsu.com> wrote:
> At 08/08/2014 03:19 PM, Jan Beulich Write:
>>>>> On 08.08.14 at 09:00, <wency@cn.fujitsu.com> wrote:
>>> Patch 1-3  : bugfix
>> 
>> Such should imo be submitted as a separate prereq series.
> 
> I have repost patch 1-3, 25-34, 40, 41, 43 in another series:
> http://lists.xenproject.org/archives/html/xen-devel/2014-08/msg00971.html 

Which then you said we should ignore.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
  2014-08-08  7:24   ` Jan Beulich
@ 2014-08-26 16:02   ` Jan Beulich
  2014-08-27  0:46     ` Wen Congyang
  2014-08-27 23:24     ` Tian, Kevin
  2014-08-27 15:02   ` Andrew Cooper
  2 siblings, 2 replies; 64+ messages in thread
From: Jan Beulich @ 2014-08-26 16:02 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tim Deegan, Kevin Tian, Ian Campbell, Jun Nakajima, Dong Eddie,
	Ian Jackson, xen devel, Aravind Gopalakrishnan,
	suravee.suthikulpanit, Boris Ostrovsky, Yang Hongyang,
	Lai Jiangshan

>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
> valid before restoring vmcs. If there is no pending event after
> restoring vm, we should clear it.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> Also clear pending software exceptions.
> Copy the fix to SVM as well.
> 
> Signed-off-by: Tim Deegan <tim@xen.org>

I only now realized that it's no surprise we're not getting acks from
the VMX maintainers on this - the majority of them wasn't Cc-ed.
Now done, but please take care to do so yourself in the future.

As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?

Jan

> ---
>  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>  2 files changed, 21 insertions(+), 20 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 71b8a6a..f7a0cb8 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct 
> hvm_hw_cpu *c)
>          vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>      }
>  
> -    if ( c->pending_valid ) 
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmcb->eventinj.bytes = c->pending_event;
> -            vmcb->eventinj.fields.errorcode = c->error_code;
> -        }
> +        vmcb->eventinj.bytes = c->pending_event;
> +        vmcb->eventinj.fields.errorcode = c->error_code;
> +    }
> +    else
> +    {
> +        vmcb->eventinj.bytes = 0;
> +        vmcb->eventinj.fields.errorcode = 0;
>      }
>  
>      vmcb->cleanbits.bytes = 0;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index fb65c7d..5f143c0 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct 
> hvm_hw_cpu *c)
>  
>      __vmwrite(GUEST_DR7, c->dr7);
>  
> -    vmx_vmcs_exit(v);
> -
> -    paging_update_paging_modes(v);
> -
> -    if ( c->pending_valid )
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmx_vmcs_enter(v);
> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
> -            vmx_vmcs_exit(v);
> -        }
> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>      }
> +    else
> +    {
> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
> +    }
> +    vmx_vmcs_exit(v);
> +
> +    paging_update_paging_modes(v);
>  
>      return 0;
>  }
> -- 
> 1.9.3
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-26 16:02   ` Jan Beulich
@ 2014-08-27  0:46     ` Wen Congyang
  2014-08-27 14:58       ` Aravind Gopalakrishnan
  2014-08-27 23:24     ` Tian, Kevin
  1 sibling, 1 reply; 64+ messages in thread
From: Wen Congyang @ 2014-08-27  0:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tim Deegan, Kevin Tian, Ian Campbell, Jun Nakajima, Dong Eddie,
	Ian Jackson, xen devel, Aravind Gopalakrishnan,
	suravee.suthikulpanit, Boris Ostrovsky, Yang Hongyang,
	Lai Jiangshan

At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>> valid before restoring vmcs. If there is no pending event after
>> restoring vm, we should clear it.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Also clear pending software exceptions.
>> Copy the fix to SVM as well.
>>
>> Signed-off-by: Tim Deegan <tim@xen.org>
> 
> I only now realized that it's no surprise we're not getting acks from
> the VMX maintainers on this - the majority of them wasn't Cc-ed.
> Now done, but please take care to do so yourself in the future.
> 
> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?

Thanks for doing this.
I have repost it in the bugfix patchset, and cc vmx and svm maintainers

Wen Congyang

> 
> Jan
> 
>> ---
>>  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>>  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>>  2 files changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>> index 71b8a6a..f7a0cb8 100644
>> --- a/xen/arch/x86/hvm/svm/svm.c
>> +++ b/xen/arch/x86/hvm/svm/svm.c
>> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct 
>> hvm_hw_cpu *c)
>>          vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>>      }
>>  
>> -    if ( c->pending_valid ) 
>> +    if ( c->pending_valid
>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>      {
>>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>                   c->pending_event, c->error_code);
>> -
>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>> -        {
>> -            vmcb->eventinj.bytes = c->pending_event;
>> -            vmcb->eventinj.fields.errorcode = c->error_code;
>> -        }
>> +        vmcb->eventinj.bytes = c->pending_event;
>> +        vmcb->eventinj.fields.errorcode = c->error_code;
>> +    }
>> +    else
>> +    {
>> +        vmcb->eventinj.bytes = 0;
>> +        vmcb->eventinj.fields.errorcode = 0;
>>      }
>>  
>>      vmcb->cleanbits.bytes = 0;
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index fb65c7d..5f143c0 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct 
>> hvm_hw_cpu *c)
>>  
>>      __vmwrite(GUEST_DR7, c->dr7);
>>  
>> -    vmx_vmcs_exit(v);
>> -
>> -    paging_update_paging_modes(v);
>> -
>> -    if ( c->pending_valid )
>> +    if ( c->pending_valid
>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>      {
>>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>                   c->pending_event, c->error_code);
>> -
>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>> -        {
>> -            vmx_vmcs_enter(v);
>> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>> -            vmx_vmcs_exit(v);
>> -        }
>> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>      }
>> +    else
>> +    {
>> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
>> +    }
>> +    vmx_vmcs_exit(v);
>> +
>> +    paging_update_paging_modes(v);
>>  
>>      return 0;
>>  }
>> -- 
>> 1.9.3
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-27  0:46     ` Wen Congyang
@ 2014-08-27 14:58       ` Aravind Gopalakrishnan
  2014-08-28  1:04         ` Wen Congyang
  2014-08-28  9:53         ` Tim Deegan
  0 siblings, 2 replies; 64+ messages in thread
From: Aravind Gopalakrishnan @ 2014-08-27 14:58 UTC (permalink / raw)
  To: Wen Congyang, Jan Beulich
  Cc: Tim Deegan, Kevin Tian, Ian Campbell, Jun Nakajima, Ian Jackson,
	Dong Eddie, xen devel, suravee.suthikulpanit, Boris Ostrovsky,
	Yang Hongyang, Lai Jiangshan

On 8/26/2014 7:46 PM, Wen Congyang wrote:
> At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>>> valid before restoring vmcs. If there is no pending event after
>>> restoring vm, we should clear it.
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> Also clear pending software exceptions.
>>> Copy the fix to SVM as well.
>>>
>>> Signed-off-by: Tim Deegan <tim@xen.org>
>> I only now realized that it's no surprise we're not getting acks from
>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
>> Now done, but please take care to do so yourself in the future.
>>
>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
> Thanks for doing this.
> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
>

Hi,
Apologies for the delay.

As for the svm changes, the patch seems fairly straightforward and harmless.
However, I am not familiar with 'colo mode', so I'm not sure I 
understand the problem..

Is this a 'fix' we need so that we don't duplicate a pending software 
interrupt on the secondary VM?
Is there a way to test this?

Thanks,
-Aravind.

>>> ---
>>>   xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>>>   xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>>>   2 files changed, 21 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>>> index 71b8a6a..f7a0cb8 100644
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct
>>> hvm_hw_cpu *c)
>>>           vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>>>       }
>>>   
>>> -    if ( c->pending_valid )
>>> +    if ( c->pending_valid
>>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>       {
>>>           gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>>                    c->pending_event, c->error_code);
>>> -
>>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>> -        {
>>> -            vmcb->eventinj.bytes = c->pending_event;
>>> -            vmcb->eventinj.fields.errorcode = c->error_code;
>>> -        }
>>> +        vmcb->eventinj.bytes = c->pending_event;
>>> +        vmcb->eventinj.fields.errorcode = c->error_code;
>>> +    }
>>> +    else
>>> +    {
>>> +        vmcb->eventinj.bytes = 0;
>>> +        vmcb->eventinj.fields.errorcode = 0;
>>>       }
>>>   
>>>       vmcb->cleanbits.bytes = 0;
>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>> index fb65c7d..5f143c0 100644
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct
>>> hvm_hw_cpu *c)
>>>   
>>>       __vmwrite(GUEST_DR7, c->dr7);
>>>   
>>> -    vmx_vmcs_exit(v);
>>> -
>>> -    paging_update_paging_modes(v);
>>> -
>>> -    if ( c->pending_valid )
>>> +    if ( c->pending_valid
>>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>       {
>>>           gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>>                    c->pending_event, c->error_code);
>>> -
>>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>> -        {
>>> -            vmx_vmcs_enter(v);
>>> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>>> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>> -            vmx_vmcs_exit(v);
>>> -        }
>>> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>>       }
>>> +    else
>>> +    {
>>> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
>>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
>>> +    }
>>> +    vmx_vmcs_exit(v);
>>> +
>>> +    paging_update_paging_modes(v);
>>>   
>>>       return 0;
>>>   }
>>> -- 
>>> 1.9.3
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>
>>
>> .
>>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
  2014-08-08  7:24   ` Jan Beulich
  2014-08-26 16:02   ` Jan Beulich
@ 2014-08-27 15:02   ` Andrew Cooper
  2 siblings, 0 replies; 64+ messages in thread
From: Andrew Cooper @ 2014-08-27 15:02 UTC (permalink / raw)
  To: Wen Congyang, xen devel
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, Tim Deegan,
	Yang Hongyang, Lai Jiangshan

On 08/08/14 08:01, Wen Congyang wrote:
> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
> valid before restoring vmcs. If there is no pending event after
> restoring vm, we should clear it.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>
> Also clear pending software exceptions.
> Copy the fix to SVM as well.
>
> Signed-off-by: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>  2 files changed, 21 insertions(+), 20 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> index 71b8a6a..f7a0cb8 100644
> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct hvm_hw_cpu *c)
>          vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>      }
>  
> -    if ( c->pending_valid ) 
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmcb->eventinj.bytes = c->pending_event;
> -            vmcb->eventinj.fields.errorcode = c->error_code;
> -        }
> +        vmcb->eventinj.bytes = c->pending_event;
> +        vmcb->eventinj.fields.errorcode = c->error_code;
> +    }
> +    else
> +    {
> +        vmcb->eventinj.bytes = 0;
> +        vmcb->eventinj.fields.errorcode = 0;
>      }

vmcb->eventinj.bytes is part of a union which fully covers .fields. 
Explicitly setting errorcode=0 is redundant.

~Andrew

>  
>      vmcb->cleanbits.bytes = 0;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index fb65c7d..5f143c0 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct hvm_hw_cpu *c)
>  
>      __vmwrite(GUEST_DR7, c->dr7);
>  
> -    vmx_vmcs_exit(v);
> -
> -    paging_update_paging_modes(v);
> -
> -    if ( c->pending_valid )
> +    if ( c->pending_valid
> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>      {
>          gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>                   c->pending_event, c->error_code);
> -
> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
> -        {
> -            vmx_vmcs_enter(v);
> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
> -            vmx_vmcs_exit(v);
> -        }
> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>      }
> +    else
> +    {
> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
> +    }
> +    vmx_vmcs_exit(v);
> +
> +    paging_update_paging_modes(v);
>  
>      return 0;
>  }

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-26 16:02   ` Jan Beulich
  2014-08-27  0:46     ` Wen Congyang
@ 2014-08-27 23:24     ` Tian, Kevin
  1 sibling, 0 replies; 64+ messages in thread
From: Tian, Kevin @ 2014-08-27 23:24 UTC (permalink / raw)
  To: Jan Beulich, Wen Congyang
  Cc: Tim Deegan, Ian Campbell, Nakajima, Jun, Dong, Eddie, Ian Jackson,
	xen devel, Aravind Gopalakrishnan, suravee.suthikulpanit@amd.com,
	Boris Ostrovsky, Yang Hongyang, Lai Jiangshan

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, August 26, 2014 9:02 AM
> 
> >>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
> > In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
> > valid before restoring vmcs. If there is no pending event after
> > restoring vm, we should clear it.
> >
> > Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >
> > Also clear pending software exceptions.
> > Copy the fix to SVM as well.
> >
> > Signed-off-by: Tim Deegan <tim@xen.org>
> 
> I only now realized that it's no surprise we're not getting acks from
> the VMX maintainers on this - the majority of them wasn't Cc-ed.
> Now done, but please take care to do so yourself in the future.

thanks for forwarding. VMX part looks good to me.

Acked-by: Kevin Tian <kevin.tian@intel.com>

> 
> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
> 
> Jan
> 
> > ---
> >  xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
> >  xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
> >  2 files changed, 21 insertions(+), 20 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
> > index 71b8a6a..f7a0cb8 100644
> > --- a/xen/arch/x86/hvm/svm/svm.c
> > +++ b/xen/arch/x86/hvm/svm/svm.c
> > @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct
> > hvm_hw_cpu *c)
> >          vmcb_set_h_cr3(vmcb,
> pagetable_get_paddr(p2m_get_pagetable(p2m)));
> >      }
> >
> > -    if ( c->pending_valid )
> > +    if ( c->pending_valid
> > +         && hvm_event_needs_reinjection(c->pending_type,
> c->pending_vector) )
> >      {
> >          gdprintk(XENLOG_INFO,
> "Re-injecting %#"PRIx32", %#"PRIx32"\n",
> >                   c->pending_event, c->error_code);
> > -
> > -        if ( hvm_event_needs_reinjection(c->pending_type,
> c->pending_vector) )
> > -        {
> > -            vmcb->eventinj.bytes = c->pending_event;
> > -            vmcb->eventinj.fields.errorcode = c->error_code;
> > -        }
> > +        vmcb->eventinj.bytes = c->pending_event;
> > +        vmcb->eventinj.fields.errorcode = c->error_code;
> > +    }
> > +    else
> > +    {
> > +        vmcb->eventinj.bytes = 0;
> > +        vmcb->eventinj.fields.errorcode = 0;
> >      }
> >
> >      vmcb->cleanbits.bytes = 0;
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index fb65c7d..5f143c0 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct
> > hvm_hw_cpu *c)
> >
> >      __vmwrite(GUEST_DR7, c->dr7);
> >
> > -    vmx_vmcs_exit(v);
> > -
> > -    paging_update_paging_modes(v);
> > -
> > -    if ( c->pending_valid )
> > +    if ( c->pending_valid
> > +         && hvm_event_needs_reinjection(c->pending_type,
> c->pending_vector) )
> >      {
> >          gdprintk(XENLOG_INFO,
> "Re-injecting %#"PRIx32", %#"PRIx32"\n",
> >                   c->pending_event, c->error_code);
> > -
> > -        if ( hvm_event_needs_reinjection(c->pending_type,
> c->pending_vector) )
> > -        {
> > -            vmx_vmcs_enter(v);
> > -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> > -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
> c->error_code);
> > -            vmx_vmcs_exit(v);
> > -        }
> > +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
> > +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
> c->error_code);
> >      }
> > +    else
> > +    {
> > +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
> > +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
> > +    }
> > +    vmx_vmcs_exit(v);
> > +
> > +    paging_update_paging_modes(v);
> >
> >      return 0;
> >  }
> > --
> > 1.9.3
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-27 14:58       ` Aravind Gopalakrishnan
@ 2014-08-28  1:04         ` Wen Congyang
  2014-08-28  8:54           ` Andrew Cooper
  2014-08-28  9:53         ` Tim Deegan
  1 sibling, 1 reply; 64+ messages in thread
From: Wen Congyang @ 2014-08-28  1:04 UTC (permalink / raw)
  To: Aravind Gopalakrishnan, Jan Beulich
  Cc: Tim Deegan, Kevin Tian, Ian Campbell, Jun Nakajima, Ian Jackson,
	Dong Eddie, xen devel, suravee.suthikulpanit, Boris Ostrovsky,
	Yang Hongyang, Lai Jiangshan

At 08/27/2014 10:58 PM, Aravind Gopalakrishnan Write:
> On 8/26/2014 7:46 PM, Wen Congyang wrote:
>> At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>>>> valid before restoring vmcs. If there is no pending event after
>>>> restoring vm, we should clear it.
>>>>
>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>
>>>> Also clear pending software exceptions.
>>>> Copy the fix to SVM as well.
>>>>
>>>> Signed-off-by: Tim Deegan <tim@xen.org>
>>> I only now realized that it's no surprise we're not getting acks from
>>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
>>> Now done, but please take care to do so yourself in the future.
>>>
>>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
>> Thanks for doing this.
>> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
>>
> 
> Hi,
> Apologies for the delay.
> 
> As for the svm changes, the patch seems fairly straightforward and harmless.
> However, I am not familiar with 'colo mode', so I'm not sure I understand the problem..

In colo mode, secondary vm runs like this:
1. suspend
2. update the vm's state(All state is transfered from primary)
3. resume

Before resuming secondary vm, it is the same with primary vm. Because the secondary vm
is running before step1, so VM_ENTRY_INTR_INFO may be valid before resuming
secondary vm(step3). If there is no pending event after resuming secondary vm,
and we don't clear it, vmentry will fail(I only test vmx). I don't know the behavior
in svm. This part of patch is wrote by Tim Deegan.

> 
> Is this a 'fix' we need so that we don't duplicate a pending software interrupt on the secondary VM?
> Is there a way to test this?

I don't know any other way except colo to test it.

Thanks
Wen Congyang

> 
> Thanks,
> -Aravind.
> 
>>>> ---
>>>>   xen/arch/x86/hvm/svm/svm.c | 16 +++++++++-------
>>>>   xen/arch/x86/hvm/vmx/vmx.c | 25 ++++++++++++-------------
>>>>   2 files changed, 21 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
>>>> index 71b8a6a..f7a0cb8 100644
>>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>>> @@ -321,16 +321,18 @@ static int svm_vmcb_restore(struct vcpu *v, struct
>>>> hvm_hw_cpu *c)
>>>>           vmcb_set_h_cr3(vmcb, pagetable_get_paddr(p2m_get_pagetable(p2m)));
>>>>       }
>>>>   -    if ( c->pending_valid )
>>>> +    if ( c->pending_valid
>>>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>>       {
>>>>           gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>>>                    c->pending_event, c->error_code);
>>>> -
>>>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>> -        {
>>>> -            vmcb->eventinj.bytes = c->pending_event;
>>>> -            vmcb->eventinj.fields.errorcode = c->error_code;
>>>> -        }
>>>> +        vmcb->eventinj.bytes = c->pending_event;
>>>> +        vmcb->eventinj.fields.errorcode = c->error_code;
>>>> +    }
>>>> +    else
>>>> +    {
>>>> +        vmcb->eventinj.bytes = 0;
>>>> +        vmcb->eventinj.fields.errorcode = 0;
>>>>       }
>>>>         vmcb->cleanbits.bytes = 0;
>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>>> index fb65c7d..5f143c0 100644
>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>>> @@ -509,23 +509,22 @@ static int vmx_vmcs_restore(struct vcpu *v, struct
>>>> hvm_hw_cpu *c)
>>>>         __vmwrite(GUEST_DR7, c->dr7);
>>>>   -    vmx_vmcs_exit(v);
>>>> -
>>>> -    paging_update_paging_modes(v);
>>>> -
>>>> -    if ( c->pending_valid )
>>>> +    if ( c->pending_valid
>>>> +         && hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>>       {
>>>>           gdprintk(XENLOG_INFO, "Re-injecting %#"PRIx32", %#"PRIx32"\n",
>>>>                    c->pending_event, c->error_code);
>>>> -
>>>> -        if ( hvm_event_needs_reinjection(c->pending_type, c->pending_vector) )
>>>> -        {
>>>> -            vmx_vmcs_enter(v);
>>>> -            __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>>>> -            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>>> -            vmx_vmcs_exit(v);
>>>> -        }
>>>> +        __vmwrite(VM_ENTRY_INTR_INFO, c->pending_event);
>>>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, c->error_code);
>>>>       }
>>>> +    else
>>>> +    {
>>>> +        __vmwrite(VM_ENTRY_INTR_INFO, 0);
>>>> +        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, 0);
>>>> +    }
>>>> +    vmx_vmcs_exit(v);
>>>> +
>>>> +    paging_update_paging_modes(v);
>>>>         return 0;
>>>>   }
>>>> -- 
>>>> 1.9.3
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xen.org
>>>> http://lists.xen.org/xen-devel
>>>
>>>
>>> .
>>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-28  1:04         ` Wen Congyang
@ 2014-08-28  8:54           ` Andrew Cooper
  2014-08-28 11:17             ` Wen Congyang
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2014-08-28  8:54 UTC (permalink / raw)
  To: Wen Congyang, Aravind Gopalakrishnan, Jan Beulich
  Cc: Kevin Tian, Yang Hongyang, Ian Campbell, Dong Eddie, Ian Jackson,
	Tim Deegan, Jun Nakajima, Boris Ostrovsky, xen devel,
	suravee.suthikulpanit, Lai Jiangshan

On 28/08/14 02:04, Wen Congyang wrote:
> At 08/27/2014 10:58 PM, Aravind Gopalakrishnan Write:
>> On 8/26/2014 7:46 PM, Wen Congyang wrote:
>>> At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>>>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>>>>> valid before restoring vmcs. If there is no pending event after
>>>>> restoring vm, we should clear it.
>>>>>
>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>>
>>>>> Also clear pending software exceptions.
>>>>> Copy the fix to SVM as well.
>>>>>
>>>>> Signed-off-by: Tim Deegan <tim@xen.org>
>>>> I only now realized that it's no surprise we're not getting acks from
>>>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
>>>> Now done, but please take care to do so yourself in the future.
>>>>
>>>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
>>> Thanks for doing this.
>>> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
>>>
>> Hi,
>> Apologies for the delay.
>>
>> As for the svm changes, the patch seems fairly straightforward and harmless.
>> However, I am not familiar with 'colo mode', so I'm not sure I understand the problem..
> In colo mode, secondary vm runs like this:
> 1. suspend
> 2. update the vm's state(All state is transfered from primary)
> 3. resume

Is this accurate?  From previous review, I seem to remember that you are
pausing the vm, not suspending it, which is where all of these event
issues derive from.

~Andrew

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-27 14:58       ` Aravind Gopalakrishnan
  2014-08-28  1:04         ` Wen Congyang
@ 2014-08-28  9:53         ` Tim Deegan
  1 sibling, 0 replies; 64+ messages in thread
From: Tim Deegan @ 2014-08-28  9:53 UTC (permalink / raw)
  To: Aravind Gopalakrishnan
  Cc: Kevin Tian, Lai Jiangshan, Wen Congyang, Jun Nakajima, Dong Eddie,
	Ian Jackson, xen devel, Jan Beulich, suravee.suthikulpanit,
	Boris Ostrovsky, Yang Hongyang, Ian Campbell

At 09:58 -0500 on 27 Aug (1409129888), Aravind Gopalakrishnan wrote:
> On 8/26/2014 7:46 PM, Wen Congyang wrote:
> > At 08/27/2014 12:02 AM, Jan Beulich Write:
> >>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
> >>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
> >>> valid before restoring vmcs. If there is no pending event after
> >>> restoring vm, we should clear it.
> >>>
> >>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >>>
> >>> Also clear pending software exceptions.
> >>> Copy the fix to SVM as well.
> >>>
> >>> Signed-off-by: Tim Deegan <tim@xen.org>
> >> I only now realized that it's no surprise we're not getting acks from
> >> the VMX maintainers on this - the majority of them wasn't Cc-ed.
> >> Now done, but please take care to do so yourself in the future.
> >>
> >> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
> > Thanks for doing this.
> > I have repost it in the bugfix patchset, and cc vmx and svm maintainers
> >
> 
> Hi,
> Apologies for the delay.
> 
> As for the svm changes, the patch seems fairly straightforward and harmless.
> However, I am not familiar with 'colo mode', so I'm not sure I 
> understand the problem..
> 
> Is this a 'fix' we need so that we don't duplicate a pending software 
> interrupt on the secondary VM?

Yes.  The fix is needed for any code that uses the HVM state
restore to overwrite the state of a live vcpu.  If the vcpu had a pending
injection before the restore op it will still be pending afterwards, even
though all the rest of the vcpu state has been reset.

We didn't see it before becasue we only ever load HVM state int a
freshly creadted domain. 

> Is there a way to test this?

No canned test.  You could run a guest that takes a lot of faults and
then restore another guest over it and look for spurious faults in the
guest.

Tim.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-28  8:54           ` Andrew Cooper
@ 2014-08-28 11:17             ` Wen Congyang
  2014-08-28 11:31               ` Paul Durrant
  0 siblings, 1 reply; 64+ messages in thread
From: Wen Congyang @ 2014-08-28 11:17 UTC (permalink / raw)
  To: Andrew Cooper, Aravind Gopalakrishnan, Jan Beulich
  Cc: Kevin Tian, Yang Hongyang, Ian Campbell, Dong Eddie, Ian Jackson,
	Tim Deegan, Jun Nakajima, Boris Ostrovsky, xen devel,
	suravee.suthikulpanit, Lai Jiangshan

At 08/28/2014 04:54 PM, Andrew Cooper Write:
> On 28/08/14 02:04, Wen Congyang wrote:
>> At 08/27/2014 10:58 PM, Aravind Gopalakrishnan Write:
>>> On 8/26/2014 7:46 PM, Wen Congyang wrote:
>>>> At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>>>>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO may
>>>>>> valid before restoring vmcs. If there is no pending event after
>>>>>> restoring vm, we should clear it.
>>>>>>
>>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>
>>>>>> Also clear pending software exceptions.
>>>>>> Copy the fix to SVM as well.
>>>>>>
>>>>>> Signed-off-by: Tim Deegan <tim@xen.org>
>>>>> I only now realized that it's no surprise we're not getting acks from
>>>>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
>>>>> Now done, but please take care to do so yourself in the future.
>>>>>
>>>>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
>>>> Thanks for doing this.
>>>> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
>>>>
>>> Hi,
>>> Apologies for the delay.
>>>
>>> As for the svm changes, the patch seems fairly straightforward and harmless.
>>> However, I am not familiar with 'colo mode', so I'm not sure I understand the problem..
>> In colo mode, secondary vm runs like this:
>> 1. suspend
>> 2. update the vm's state(All state is transfered from primary)
>> 3. resume
> 
> Is this accurate?  From previous review, I seem to remember that you are
> pausing the vm, not suspending it, which is where all of these event
> issues derive from.

Not pause. We suspend the guest(not save the state). Pausing vm meant that
the vm is not running, but the state cannot be updated. For example, if the
vm uses pvdriver(not supported now), the backend and frontend share some
information, and we only update frontend(backend state is not transfered
from primary dom0)...

Thanks
Wen Congyang

> 
> ~Andrew
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-28 11:17             ` Wen Congyang
@ 2014-08-28 11:31               ` Paul Durrant
  2014-08-29  5:59                 ` Wen Congyang
  0 siblings, 1 reply; 64+ messages in thread
From: Paul Durrant @ 2014-08-28 11:31 UTC (permalink / raw)
  To: Wen Congyang, Andrew Cooper, Aravind Gopalakrishnan, Jan Beulich
  Cc: Kevin Tian, Ian Campbell, Tim (Xen.org), Eddie Dong, xen devel,
	Jun Nakajima, Ian Jackson, Boris Ostrovsky, Yang Hongyang,
	suravee.suthikulpanit@amd.com, Lai Jiangshan

> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
> bounces@lists.xen.org] On Behalf Of Wen Congyang
> Sent: 28 August 2014 12:18
> To: Andrew Cooper; Aravind Gopalakrishnan; Jan Beulich
> Cc: Kevin Tian; Yang Hongyang; Ian Campbell; Eddie Dong; Ian Jackson; Tim
> (Xen.org); Jun Nakajima; Boris Ostrovsky; xen devel;
> suravee.suthikulpanit@amd.com; Lai Jiangshan
> Subject: Re: [Xen-devel] [RFC Patch v2 45/45] x86/hvm: Always set pending
> event injection when loading VMC[BS] state.
> 
> At 08/28/2014 04:54 PM, Andrew Cooper Write:
> > On 28/08/14 02:04, Wen Congyang wrote:
> >> At 08/27/2014 10:58 PM, Aravind Gopalakrishnan Write:
> >>> On 8/26/2014 7:46 PM, Wen Congyang wrote:
> >>>> At 08/27/2014 12:02 AM, Jan Beulich Write:
> >>>>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
> >>>>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO
> may
> >>>>>> valid before restoring vmcs. If there is no pending event after
> >>>>>> restoring vm, we should clear it.
> >>>>>>
> >>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >>>>>>
> >>>>>> Also clear pending software exceptions.
> >>>>>> Copy the fix to SVM as well.
> >>>>>>
> >>>>>> Signed-off-by: Tim Deegan <tim@xen.org>
> >>>>> I only now realized that it's no surprise we're not getting acks from
> >>>>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
> >>>>> Now done, but please take care to do so yourself in the future.
> >>>>>
> >>>>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
> >>>> Thanks for doing this.
> >>>> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
> >>>>
> >>> Hi,
> >>> Apologies for the delay.
> >>>
> >>> As for the svm changes, the patch seems fairly straightforward and
> harmless.
> >>> However, I am not familiar with 'colo mode', so I'm not sure I understand
> the problem..
> >> In colo mode, secondary vm runs like this:
> >> 1. suspend
> >> 2. update the vm's state(All state is transfered from primary)
> >> 3. resume
> >
> > Is this accurate?  From previous review, I seem to remember that you are
> > pausing the vm, not suspending it, which is where all of these event
> > issues derive from.
> 
> Not pause. We suspend the guest(not save the state). Pausing vm meant
> that
> the vm is not running, but the state cannot be updated. For example, if the
> vm uses pvdriver(not supported now), the backend and frontend share
> some
> information, and we only update frontend(backend state is not transfered
> from primary dom0)...
> 

If you're doing suspend/resume then PV drivers should re-attached to backends anyway so any state you did transfer would be somewhat pointless. Because of the re-attach though, resume is a pretty heavyweight operation. Is that really what you are doing?

  Paul


> Thanks
> Wen Congyang
> 
> >
> > ~Andrew
> > .
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state.
  2014-08-28 11:31               ` Paul Durrant
@ 2014-08-29  5:59                 ` Wen Congyang
  0 siblings, 0 replies; 64+ messages in thread
From: Wen Congyang @ 2014-08-29  5:59 UTC (permalink / raw)
  To: Paul Durrant, Andrew Cooper, Aravind Gopalakrishnan, Jan Beulich
  Cc: Kevin Tian, Ian Campbell, Tim (Xen.org), Eddie Dong, xen devel,
	Jun Nakajima, Ian Jackson, Boris Ostrovsky, Yang Hongyang,
	suravee.suthikulpanit@amd.com, Lai Jiangshan

At 08/28/2014 07:31 PM, Paul Durrant Write:
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
>> bounces@lists.xen.org] On Behalf Of Wen Congyang
>> Sent: 28 August 2014 12:18
>> To: Andrew Cooper; Aravind Gopalakrishnan; Jan Beulich
>> Cc: Kevin Tian; Yang Hongyang; Ian Campbell; Eddie Dong; Ian Jackson; Tim
>> (Xen.org); Jun Nakajima; Boris Ostrovsky; xen devel;
>> suravee.suthikulpanit@amd.com; Lai Jiangshan
>> Subject: Re: [Xen-devel] [RFC Patch v2 45/45] x86/hvm: Always set pending
>> event injection when loading VMC[BS] state.
>>
>> At 08/28/2014 04:54 PM, Andrew Cooper Write:
>>> On 28/08/14 02:04, Wen Congyang wrote:
>>>> At 08/27/2014 10:58 PM, Aravind Gopalakrishnan Write:
>>>>> On 8/26/2014 7:46 PM, Wen Congyang wrote:
>>>>>> At 08/27/2014 12:02 AM, Jan Beulich Write:
>>>>>>>>>> On 08.08.14 at 09:01, <wency@cn.fujitsu.com> wrote:
>>>>>>>> In colo mode, secondary vm is running, so VM_ENTRY_INTR_INFO
>> may
>>>>>>>> valid before restoring vmcs. If there is no pending event after
>>>>>>>> restoring vm, we should clear it.
>>>>>>>>
>>>>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>>>>>
>>>>>>>> Also clear pending software exceptions.
>>>>>>>> Copy the fix to SVM as well.
>>>>>>>>
>>>>>>>> Signed-off-by: Tim Deegan <tim@xen.org>
>>>>>>> I only now realized that it's no surprise we're not getting acks from
>>>>>>> the VMX maintainers on this - the majority of them wasn't Cc-ed.
>>>>>>> Now done, but please take care to do so yourself in the future.
>>>>>>>
>>>>>>> As to the SVM maintainers - Ping (I Cc-ed you on an earlier reply)?
>>>>>> Thanks for doing this.
>>>>>> I have repost it in the bugfix patchset, and cc vmx and svm maintainers
>>>>>>
>>>>> Hi,
>>>>> Apologies for the delay.
>>>>>
>>>>> As for the svm changes, the patch seems fairly straightforward and
>> harmless.
>>>>> However, I am not familiar with 'colo mode', so I'm not sure I understand
>> the problem..
>>>> In colo mode, secondary vm runs like this:
>>>> 1. suspend
>>>> 2. update the vm's state(All state is transfered from primary)
>>>> 3. resume
>>>
>>> Is this accurate?  From previous review, I seem to remember that you are
>>> pausing the vm, not suspending it, which is where all of these event
>>> issues derive from.
>>
>> Not pause. We suspend the guest(not save the state). Pausing vm meant
>> that
>> the vm is not running, but the state cannot be updated. For example, if the
>> vm uses pvdriver(not supported now), the backend and frontend share
>> some
>> information, and we only update frontend(backend state is not transfered
>> from primary dom0)...
>>
> 
> If you're doing suspend/resume then PV drivers should re-attached to backends anyway so any state you did transfer would be somewhat pointless. Because of the re-attach though, resume is a pretty heavyweight operation. Is that really what you are doing?

Yes, so I need to do more things to support pvm and pvhvm.

Thanks
Wen Congyang

> 
>   Paul
> 
> 
>> Thanks
>> Wen Congyang
>>
>>>
>>> ~Andrew
>>> .
>>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> .
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2014-08-29  5:59 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-08  7:00 [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 01/45] copy the correct page to memory Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 02/45] csum the correct page Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 03/45] don't zero out ioreq page Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 04/45] Refactor domain_suspend_callback_common() Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 05/45] Update libxl__domain_resume() for colo Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 06/45] Update libxl__domain_suspend_common_switch_qemu_logdirty() " Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 07/45] Introduce a new internal API libxl__domain_unpause() Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 08/45] Update libxl__domain_unpause() to support qemu-xen Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 09/45] support to resume uncooperative HVM guests Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 10/45] update datecopier to support sending data only Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 11/45] introduce a new API to aync read data from fd Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 12/45] move remus related codes to libxl_remus.c Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 13/45] rename remus device to checkpoint device Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 14/45] adjust the indentation Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 15/45] don't touch remus in checkpoint_device Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 16/45] Update libxl_save_msgs_gen.pl to support return data from xl to xc Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 17/45] Allow slave sends data to master Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 18/45] secondary vm suspend/resume/checkpoint code Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 19/45] primary vm suspend/get_dirty_pfn/resume/checkpoint code Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 20/45] xc_domain_save: flush cache before calling callbacks->postcopy() in colo mode Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 21/45] COLO: xc related codes Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 22/45] send store mfn and console mfn to xl before resuming secondary vm Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 23/45] implement the cmdline for COLO Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 24/45] HACK: do checkpoint per 20ms Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 25/45] colo: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 26/45] fix memory leak in block-remus Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 27/45] pass uuid to the callback td_open Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 28/45] return the correct dev path Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 29/45] blktap2: use correct way to get remus_image Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 30/45] don't call client_flush() when switching to unprotected mode Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 31/45] remus: fix bug in tdremus_close() Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 32/45] blktap2: use correct way to get free event id Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 33/45] blktap2: don't return negative " Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 34/45] blktap2: use correct way to define array Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 35/45] blktap2: connect to backup asynchronously Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 36/45] switch to unprotected mode before closing Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 37/45] blktap2: move async connect related codes to block-replication.c Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 38/45] blktap2: move ramdisk " Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 39/45] block-colo: implement colo disk replication Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 40/45] pass correct file to qemu if we use blktap2 Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 41/45] support blktap remus in xl Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 42/45] support blktap colo in xl: Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 43/45] update libxl__device_disk_from_xs_be() to support blktap device Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 44/45] libxl/colo: setup and control disk replication for blktap2 backends Wen Congyang
2014-08-08  7:01 ` [RFC Patch v2 45/45] x86/hvm: Always set pending event injection when loading VMC[BS] state Wen Congyang
2014-08-08  7:24   ` Jan Beulich
2014-08-08  7:29     ` Wen Congyang
2014-08-26 16:02   ` Jan Beulich
2014-08-27  0:46     ` Wen Congyang
2014-08-27 14:58       ` Aravind Gopalakrishnan
2014-08-28  1:04         ` Wen Congyang
2014-08-28  8:54           ` Andrew Cooper
2014-08-28 11:17             ` Wen Congyang
2014-08-28 11:31               ` Paul Durrant
2014-08-29  5:59                 ` Wen Congyang
2014-08-28  9:53         ` Tim Deegan
2014-08-27 23:24     ` Tian, Kevin
2014-08-27 15:02   ` Andrew Cooper
2014-08-08  7:01 ` [RFC Patch v2 46/45] Introduce "xen-load-devices-state" Wen Congyang
2014-08-08  7:19 ` [RFC Patch v2 00/45] COarse-grain LOck-stepping Virtual Machines for Non-stop Service Jan Beulich
2014-08-08  7:39   ` Wen Congyang
2014-08-08  8:21   ` Wen Congyang
2014-08-08  9:02     ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).