[PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk
@ 2014-07-15  6:46 Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 1/7] remus: add libnl3 dependency for network buffering support Yang Hongyang
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

This patch series adds support for network buffering and drbd disk
in the Remus codebase in libxl.

the code is also hosted on github:

url: https://github.com/laijs/xen
branch: remus-v16

Changes in v16:
  Merge libxl__remus_state and libxl__remus_device_state.
  Pass the ops to device abstract layer instead of defined it in the layer.
  Optimized subkind ops APIs.
  Addressed Ian J's comments.
  Rebased.

Changes in v15:
  The first patch in v14 has been taken, so remove it from the patchset.
  Add a patch to Update maintained files of REMUS.
  Rebased.

Changes in v14:
  Addressed IanJ's comments.
  Rebased.

Changes in v13:
  Addressed Konrad's comments.
  Rebased.

Changes in v12:
  Add disk buffering cmdline switch.

Changes in v11:
  Addressed comments from Ian J and Shriram.
  Add drbd disk implement into this patch series.

Changes in V10:
  Restructured the whole patch series.
  Introduce the remus device abstract layer.
  Make remus checkpoint asynchronous.

Changes in V9:
  Use async exec script api to exec scripts.

Changes in V8:
  Applied some comments(by IanJ).
  Merge some struct definitions to it's implementation.
  (2/3/5 in V7 => 3 in V8)

Changes in V7:
  Applied missing comments(by IanJ).
  Applied Shriram comments.

  merge netbufering tangled setup/teardown code into one patch.
  (2/6/8 in V6 => 5 in V7. 9/10 in V6 => 7 in V7)

Changes in V6:
  Applied Ian Jackson's comments of V5 series.
  the [PATCH 2/4 V5] is split by small functionalities.

  [PATCH 4/4 V5] --> [PATCH 13/13] netbuffer is default enabled.

Changes in V5:

Merge hotplug script patch (2/5) and hotplug script setup/teardown
patch (3/5) into a single patch.

Changes in V4:

[1/5] Remove check for libnl command line utils in autoconf checks

[2/5] minor nits

[3/5] define LIBXL_HAVE_REMUS_NETBUF in libxl.h

[4/5] clean ups. Make the usleep in checkpoint callback asynchronous

[5/5] minor nits

Changes in V3:
[1/5] Fix redundant checks in configure scripts
      (based on Ian Campbell's suggestions)

[2/5] Introduce locking in the script, during IFB setup.
      Add xenstore paths used by netbuf scripts
      to xenstore-paths.markdown

[3/5] Hotplug scripts setup/teardown invocations are now asynchronous
      following IanJ's feedback.  However, the invocations are still
      sequential. 

[5/5] Allow per-domain specification of netbuffer scripts in xl remus
      commmand.

And minor nits throughout the series based on feedback from
the last version

Changes in V2:
[1/5] Configure script will automatically enable/disable network
      buffer support depending on the availability of the appropriate
      libnl3 version. [If libnl3 is unavailable, a warning message will be
      printed to let the user know that the feature has been disabled.]

      use macros from pkg.m4 instead of pkg-config commands
      removed redundant checks for libnl3 libraries.

[3,4/5] - Minor nits.

Version 1:

[1/5] Changes to autoconf scripts to check for libnl3. Add linker flags
      to libxl Makefile.

[2/5] External script to setup/teardown network buffering using libnl3's
      CLI. This script will be invoked by libxl before starting Remus.
      The script's main job is to bring up an IFB device with plug qdisc
      attached to it.  It then re-routes egress traffic from the guest's
      vif to the IFB device.

[3/5] Libxl code to invoke the external setup script, followed by netlink
      related setup to obtain a handle on the output buffers attached
      to each vif.

[4/5] Libxl interaction with network buffer module in the kernel via
      libnl3 API.

[5/5] xl cmdline switch to explicitly enable network buffering when
      starting remus.


  Few things to note(by shriram): 

    a) Based on previous email discussions, the setup/teardown task has
    been moved to a hotplug style shell script which can be customized as
    desired, instead of implementing it as C code inside libxl.

    b) Libnl3 is not available on NetBSD. Nor is it available on CentOS
   (Linux).  So I have made network buffering support an optional feature
   so that it can be disabled if desired.

   c) NetBSD does not have libnl3. So I have put the setup script under
   tools/hotplug/Linux folder.

thanks

Legend:
  A - acked
  D - previous acked, but new change introduced so acked-by dropped
  M - Modified
  S - the same version as last round
  No marker - new patch

Yang Hongyang (7):
  A remus: add libnl3 dependency for network buffering support
  M remus: introduce remus device
  M remus netbuffer: implement remus network buffering for nic devices
  M remus drbd: Implement remus drbd replicated disk
  S libxl: network buffering cmdline switch
  S libxl: disk buffering cmdline switch
  S MAINTAINERS: Update maintained files of REMUS

 MAINTAINERS                            |   5 +
 README                                 |   4 +
 config/Tools.mk.in                     |   4 +
 docs/README.remus                      |   6 +
 docs/man/xl.conf.pod.5                 |   6 +
 docs/man/xl.pod.1                      |  15 +-
 docs/misc/xenstore-paths.markdown      |   4 +
 tools/configure.ac                     |  16 +
 tools/hotplug/Linux/Makefile           |   2 +
 tools/hotplug/Linux/block-drbd-probe   |  85 ++++++
 tools/hotplug/Linux/remus-netbuf-setup | 206 +++++++++++++
 tools/libxl/Makefile                   |  15 +
 tools/libxl/libxl.c                    |  72 ++++-
 tools/libxl/libxl.h                    |   6 +
 tools/libxl/libxl_dom.c                | 155 +++++++++-
 tools/libxl/libxl_internal.h           | 185 ++++++++++++
 tools/libxl/libxl_netbuffer.c          | 521 +++++++++++++++++++++++++++++++++
 tools/libxl/libxl_nonetbuffer.c        |  55 ++++
 tools/libxl/libxl_remus_device.c       | 367 +++++++++++++++++++++++
 tools/libxl/libxl_remus_disk_drbd.c    | 240 +++++++++++++++
 tools/libxl/libxl_types.idl            |   5 +
 tools/libxl/xl.c                       |   4 +
 tools/libxl/xl.h                       |   1 +
 tools/libxl/xl_cmdimpl.c               |  32 +-
 tools/libxl/xl_cmdtable.c              |   4 +
 25 files changed, 1987 insertions(+), 28 deletions(-)
 create mode 100755 tools/hotplug/Linux/block-drbd-probe
 create mode 100644 tools/hotplug/Linux/remus-netbuf-setup
 create mode 100644 tools/libxl/libxl_netbuffer.c
 create mode 100644 tools/libxl/libxl_nonetbuffer.c
 create mode 100644 tools/libxl/libxl_remus_device.c
 create mode 100644 tools/libxl/libxl_remus_disk_drbd.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v16 1/7] remus: add libnl3 dependency for network buffering support
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 2/7] remus: introduce remus device Yang Hongyang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

Libnl3 is required for controlling Remus network buffering.
This patch adds dependency on libnl3 (>= 3.2.8) to autoconf scripts.
Also provide ability to configure tools without libnl3 support, that
is without network buffering support.

when there's no network buffering support,libxl__netbuffer_enabled()
returns 0, otherwise returns 1. The callers of this api will be
introduced in the rest of the series.

NOTE: This patch changes tools/configure.ac, please rerun
      autogen.sh while apply the patch.
Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
 README                          |  4 ++++
 config/Tools.mk.in              |  4 ++++
 docs/README.remus               |  6 ++++++
 tools/configure.ac              | 16 ++++++++++++++++
 tools/libxl/Makefile            | 13 +++++++++++++
 tools/libxl/libxl_internal.h    |  1 +
 tools/libxl/libxl_netbuffer.c   | 31 +++++++++++++++++++++++++++++++
 tools/libxl/libxl_nonetbuffer.c | 31 +++++++++++++++++++++++++++++++
 8 files changed, 106 insertions(+)
 create mode 100644 tools/libxl/libxl_netbuffer.c
 create mode 100644 tools/libxl/libxl_nonetbuffer.c

diff --git a/README b/README
index 9bbe734..e770932 100644
--- a/README
+++ b/README
@@ -72,6 +72,10 @@ disabled at compile time:
     * cmake (if building vtpm stub domains)
     * markdown
     * figlet (for generating the traditional Xen start of day banner)
+    * Development install of libnl3 (e.g., libnl-3-200,
+      libnl-3-dev, etc).  Required if network buffering is desired
+      when using Remus with libxl.  See tools/remus/README for detailed
+      information.
 
 Second, you need to acquire a suitable kernel for use in domain 0. If
 possible you should use a kernel provided by your OS distributor. If
diff --git a/config/Tools.mk.in b/config/Tools.mk.in
index 748cc69..c47eafa 100644
--- a/config/Tools.mk.in
+++ b/config/Tools.mk.in
@@ -43,6 +43,9 @@ PTHREAD_LIBS        := @PTHREAD_LIBS@
 
 PTYFUNCS_LIBS       := @PTYFUNCS_LIBS@
 
+LIBNL3_LIBS         := @LIBNL3_LIBS@
+LIBNL3_CFLAGS       := @LIBNL3_CFLAGS@
+
 # Download GIT repositories via HTTP or GIT's own protocol?
 # GIT's protocol is faster and more robust, when it works at all (firewalls
 # may block it). We make it the default, but if your GIT repository downloads
@@ -62,6 +65,7 @@ CONFIG_BLKTAP1      := @blktap1@
 CONFIG_BLKTAP2      := @blktap2@
 CONFIG_VTPM         := @vtpm@
 CONFIG_QEMUU_EXTRA_ARGS:= @EXTRA_QEMUU_CONFIGURE_ARGS@
+CONFIG_REMUS_NETBUF := @remus_netbuf@
 
 #System options
 ZLIB                := @zlib@
diff --git a/docs/README.remus b/docs/README.remus
index 9fa00fe..ddf5b55 100644
--- a/docs/README.remus
+++ b/docs/README.remus
@@ -2,3 +2,9 @@ Remus provides fault tolerance for virtual machines by sending continuous
 checkpoints to a backup, which will activate if the target VM fails.
 
 See the website at http://wiki.xen.org/wiki/Remus for details.
+
+Using Remus with libxl on Xen 4.5 and higher:
+ To enable network buffering, you need libnl 3.2.8
+ or higher along with the development headers and command line utilities.
+ If your distro does not have the appropriate libnl3 version, you can find
+ the latest source tarball of libnl3 at http://www.carisma.slowglass.com/~tgr/libnl/
diff --git a/tools/configure.ac b/tools/configure.ac
index 629d6a0..b10dac3 100644
--- a/tools/configure.ac
+++ b/tools/configure.ac
@@ -295,6 +295,22 @@ esac
 # Checks for header files.
 AC_CHECK_HEADERS([yajl/yajl_version.h sys/eventfd.h valgrind/memcheck.h utmp.h])
 
+# Check for libnl3 >=3.2.8. If present enable remus network buffering.
+PKG_CHECK_MODULES(LIBNL3, [libnl-3.0 >= 3.2.8 libnl-route-3.0 >= 3.2.8],
+    [libnl3_lib="y"], [libnl3_lib="n"])
+
+AS_IF([test "x$libnl3_lib" = "xn" ], [
+    AC_MSG_WARN([Disabling support for Remus network buffering.
+    Please install libnl3 libraries, command line tools and devel
+    headers - version 3.2.8 or higher])
+    AC_SUBST(remus_netbuf, [n])
+    ],[
+    AC_SUBST(remus_netbuf, [y])
+])
+
+AC_SUBST(LIBNL3_LIBS)
+AC_SUBST(LIBNL3_CFLAGS)
+
 fi # ! $rump
 
 AC_OUTPUT()
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index bd0db3b..eb63510 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -21,11 +21,17 @@ endif
 
 LIBXL_LIBS =
 LIBXL_LIBS = $(LDLIBS_libxenctrl) $(LDLIBS_libxenguest) $(LDLIBS_libxenstore) $(LDLIBS_libblktapctl) $(PTYFUNCS_LIBS) $(LIBUUID_LIBS)
+ifeq ($(CONFIG_REMUS_NETBUF),y)
+LIBXL_LIBS += $(LIBNL3_LIBS)
+endif
 
 CFLAGS_LIBXL += $(CFLAGS_libxenctrl)
 CFLAGS_LIBXL += $(CFLAGS_libxenguest)
 CFLAGS_LIBXL += $(CFLAGS_libxenstore)
 CFLAGS_LIBXL += $(CFLAGS_libblktapctl) 
+ifeq ($(CONFIG_REMUS_NETBUF),y)
+CFLAGS_LIBXL += $(LIBNL3_CFLAGS)
+endif
 CFLAGS_LIBXL += -Wshadow
 
 LIBXL_LIBS-$(CONFIG_ARM) += -lfdt
@@ -43,6 +49,13 @@ LIBXL_OBJS-y += libxl_blktap2.o
 else
 LIBXL_OBJS-y += libxl_noblktap2.o
 endif
+
+ifeq ($(CONFIG_REMUS_NETBUF),y)
+LIBXL_OBJS-y += libxl_netbuffer.o
+else
+LIBXL_OBJS-y += libxl_nonetbuffer.o
+endif
+
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index beb052e..db10efb 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2470,6 +2470,7 @@ typedef struct libxl__save_helper_state {
                       * marshalling and xc callback functions */
 } libxl__save_helper_state;
 
+_hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
 
diff --git a/tools/libxl/libxl_netbuffer.c b/tools/libxl/libxl_netbuffer.c
new file mode 100644
index 0000000..52d593c
--- /dev/null
+++ b/tools/libxl/libxl_netbuffer.c
@@ -0,0 +1,31 @@
+/*
+ * Copyright (C) 2014
+ * Author Shriram Rajagopalan <rshriram@cs.ubc.ca>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+int libxl__netbuffer_enabled(libxl__gc *gc)
+{
+    return 1;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/libxl/libxl_nonetbuffer.c b/tools/libxl/libxl_nonetbuffer.c
new file mode 100644
index 0000000..1c72a7f
--- /dev/null
+++ b/tools/libxl/libxl_nonetbuffer.c
@@ -0,0 +1,31 @@
+/*
+ * Copyright (C) 2014
+ * Author Shriram Rajagopalan <rshriram@cs.ubc.ca>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+int libxl__netbuffer_enabled(libxl__gc *gc)
+{
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 2/7] remus: introduce remus device
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 1/7] remus: add libnl3 dependency for network buffering support Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15 18:28   ` Ian Jackson
  2014-07-15  6:46 ` [PATCH v16 3/7] remus netbuffer: implement remus network buffering for nic devices Yang Hongyang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

introduce remus device, an abstract layer of remus devices(nic, disk,
etc).It provides the following APIs for libxl:
  >libxl__remus_devices_setup
    setup remus devices, like attach qdisc, enable disk buffering, etc
  >libxl__remus_devices_teardown
    teardown devices
  >libxl__remus_devices_postsuspend
  >libxl__remus_devices_preresume
  >libxl__remus_devices_commit
    above three are for checkpoint.
through remus device layer, the remus execution flow will be like
this:
  xl remus -> remus device setup
                |-> remus checkpoint(postsuspend, preresume, commit)
                      ...
                       |-> remus device teardown, failover or abort
the remus device layer provides an interface
  libxl__remus_device_subkind_ops
which a remus device must implement. the whole remus structure:
                            |remus|
                               |
                        |remus device|
                               |
                |nic| |drbd disks| |qemu disks| ...
a device(nic, drbd disks, qemu disks, etc) must implement
libxl__remus_device_subkind_ops to support remus.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 tools/libxl/Makefile             |   2 +
 tools/libxl/libxl.c              |  52 +++++-
 tools/libxl/libxl.h              |   6 +
 tools/libxl/libxl_dom.c          | 155 ++++++++++++++++--
 tools/libxl/libxl_internal.h     | 171 ++++++++++++++++++++
 tools/libxl/libxl_remus_device.c | 339 +++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_types.idl      |   2 +
 7 files changed, 706 insertions(+), 21 deletions(-)
 create mode 100644 tools/libxl/libxl_remus_device.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index eb63510..202f1bb 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -56,6 +56,8 @@ else
 LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
+LIBXL_OBJS-y += libxl_remus_device.o
+
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index a9205d1..95d9953 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -781,9 +781,17 @@ out:
     return ptr;
 }
 
+static void libxl__remus_setup_done(libxl__egc *egc,
+                                    libxl__remus_device_state *rds, int rc);
+static void libxl__remus_setup_failed(libxl__egc *egc,
+                                      libxl__remus_device_state *rds, int rc);
 static void remus_failover_cb(libxl__egc *egc,
                               libxl__domain_suspend_state *dss, int rc);
 
+static const libxl__remus_device_subkind_ops *remus_ops[] = {
+    NULL,
+};
+
 /* TODO: Explicit Checkpoint acknowledgements via recv_fd. */
 int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
                              uint32_t domid, int send_fd, int recv_fd,
@@ -812,16 +820,52 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
 
     assert(info);
 
-    /* TBD: Remus setup - i.e. attach qdisc, enable disk buffering, etc */
+    /* Convenience aliases */
+    libxl__remus_device_state *const rds = &dss->rds;
+    rds->ao = ao;
+    rds->egc = egc;
+    rds->domid = domid;
+    rds->callback = libxl__remus_setup_done;
+    rds->ops = remus_ops;
 
     /* Point of no return */
-    libxl__domain_suspend(egc, dss);
+    libxl__remus_devices_setup(egc, rds);
     return AO_INPROGRESS;
 
  out:
     return AO_ABORT(rc);
 }
 
+static void libxl__remus_setup_done(libxl__egc *egc,
+                                    libxl__remus_device_state *rds, int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (!rc) {
+        libxl__domain_suspend(egc, dss);
+        return;
+    }
+
+    LOG(ERROR, "Remus: failed to setup device for guest with domid %u, rc %d",
+        dss->domid, rc);
+    rds->callback = libxl__remus_setup_failed;
+    libxl__remus_devices_teardown(egc, rds);
+}
+
+static void libxl__remus_setup_failed(libxl__egc *egc,
+                                      libxl__remus_device_state *rds, int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        LOG(ERROR, "Remus: failed to teardown device after setup failed"
+            " for guest with domid %u, rc %d", dss->domid, rc);
+
+    dss->callback(egc, dss, rc);
+}
+
 static void remus_failover_cb(libxl__egc *egc,
                               libxl__domain_suspend_state *dss, int rc)
 {
@@ -831,10 +875,6 @@ static void remus_failover_cb(libxl__egc *egc,
      * backup died or some network error occurred preventing us
      * from sending checkpoints.
      */
-
-    /* TBD: Remus cleanup - i.e. detach qdisc, release other
-     * resources.
-     */
     libxl__ao_complete(egc, ao, rc);
 }
 
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 5ae6532..81905b3 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -579,6 +579,12 @@ typedef struct libxl__ctx libxl_ctx;
  */
 #define LIBXL_HAVE_CPUPOOL_NAME 1
 
+/*
+ * LIBXL_HAVE_REMUS
+ * If this is defined, then libxl supports remus.
+ */
+#define LIBXL_HAVE_REMUS 1
+
 typedef uint8_t libxl_mac[6];
 #define LIBXL_MAC_FMT "%02hhx:%02hhx:%02hhx:%02hhx:%02hhx:%02hhx"
 #define LIBXL_MAC_FMTLEN ((2*6)+5) /* 6 hex bytes plus 5 colons */
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 83eb29a..91f0bf1 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -798,8 +798,6 @@ static void domain_suspend_done(libxl__egc *egc,
                         libxl__domain_suspend_state *dss, int rc);
 static void domain_suspend_callback_common_done(libxl__egc *egc,
                                 libxl__domain_suspend_state *dss, int ok);
-static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
-                                libxl__domain_suspend_state *dss, int ok);
 
 /*----- complicated callback, called by xc_domain_save -----*/
 
@@ -1461,6 +1459,14 @@ static void domain_suspend_callback_common_done(libxl__egc *egc,
 }
 
 /*----- remus callbacks -----*/
+static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
+                                libxl__domain_suspend_state *dss, int ok);
+static void remus_device_postsuspend_cb(libxl__egc *egc,
+                                        libxl__remus_device_state *rds,
+                                        int rc);
+static void remus_device_preresume_cb(libxl__egc *egc,
+                                      libxl__remus_device_state *rds,
+                                      int rc);
 
 static void libxl__remus_domain_suspend_callback(void *data)
 {
@@ -1475,32 +1481,67 @@ static void libxl__remus_domain_suspend_callback(void *data)
 static void remus_domain_suspend_callback_common_done(libxl__egc *egc,
                                 libxl__domain_suspend_state *dss, int ok)
 {
-    /* REMUS TODO: Issue disk and network checkpoint reqs. */
+    if (!ok)
+        goto out;
+
+    libxl__remus_device_state *const rds = &dss->rds;
+    rds->callback = remus_device_postsuspend_cb;
+    libxl__remus_devices_postsuspend(egc, rds);
+    return;
+
+out:
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
 }
 
-static void libxl__remus_domain_resume_callback(void *data)
+static void remus_device_postsuspend_cb(libxl__egc *egc,
+                                        libxl__remus_device_state *rds,
+                                        int rc)
 {
     int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+
+    if (!rc)
+        ok = 1;
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
+}
+
+static void libxl__remus_domain_resume_callback(void *data)
+{
     libxl__save_helper_state *shs = data;
     libxl__egc *egc = shs->egc;
     libxl__domain_suspend_state *dss = CONTAINER_OF(shs, *dss, shs);
     STATE_AO_GC(dss->ao);
 
-    /* Resumes the domain and the device model */
-    if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
-        goto out;
+    libxl__remus_device_state *const rds = &dss->rds;
+    rds->callback = remus_device_preresume_cb;
+    libxl__remus_devices_preresume(egc, rds);
+}
 
-    /* REMUS TODO: Deal with disk. Start a new network output buffer */
-    ok = 1;
-out:
-    libxl__xc_domain_saverestore_async_callback_done(egc, shs, ok);
+static void remus_device_preresume_cb(libxl__egc *egc,
+                                      libxl__remus_device_state *rds,
+                                      int rc)
+{
+    int ok = 0;
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (!rc) {
+        /* Resumes the domain and the device model */
+        if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
+            ok = 1;
+    }
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
 }
 
 /*----- remus asynchronous checkpoint callback -----*/
 
 static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc);
+static void remus_device_commit_cb(libxl__egc *egc,
+                                   libxl__remus_device_state *rds,
+                                   int rc);
+static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                  const struct timeval *requested_abs);
 
 static void libxl__remus_domain_checkpoint_callback(void *data)
 {
@@ -1520,10 +1561,63 @@ static void libxl__remus_domain_checkpoint_callback(void *data)
 static void remus_checkpoint_dm_saved(libxl__egc *egc,
                                       libxl__domain_suspend_state *dss, int rc)
 {
-    /* REMUS TODO: Wait for disk and memory ack, release network buffer */
-    /* REMUS TODO: make this asynchronous */
-    assert(!rc); /* REMUS TODO handle this error properly */
-    usleep(dss->interval * 1000);
+    /* Convenience aliases */
+    libxl__remus_device_state *const rds = &dss->rds;
+
+    STATE_AO_GC(dss->ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to save device model. Terminating Remus..");
+        goto out;
+    }
+
+    rds->callback = remus_device_commit_cb;
+    libxl__remus_devices_commit(egc, rds);
+
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void remus_device_commit_cb(libxl__egc *egc,
+                                   libxl__remus_device_state *rds,
+                                   int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+
+    STATE_AO_GC(dss->ao);
+
+    if (rc) {
+        LOG(ERROR, "Failed to do device commit op."
+            " Terminating Remus..");
+        goto out;
+    } else {
+        /* Set checkpoint interval timeout */
+        rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
+                                         remus_next_checkpoint,
+                                         dss->interval);
+        if (rc) {
+            LOG(ERROR, "unable to register timeout for next epoch."
+                " Terminating Remus..");
+            goto out;
+        }
+    }
+    return;
+
+out:
+    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 0);
+}
+
+static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
+                                  const struct timeval *requested_abs)
+{
+    libxl__domain_suspend_state *dss =
+                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
+
+    STATE_AO_GC(dss->ao);
+
+    libxl__ev_time_deregister(gc, &dss->checkpoint_timeout);
     libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
 }
 
@@ -1738,6 +1832,9 @@ static void save_device_model_datacopier_done(libxl__egc *egc,
     dss->save_dm_callback(egc, dss, our_rc);
 }
 
+static void libxl__remus_teardown_done(libxl__egc *egc,
+                                       libxl__remus_device_state *rds,
+                                       int rc);
 static void domain_suspend_done(libxl__egc *egc,
                         libxl__domain_suspend_state *dss, int rc)
 {
@@ -1752,6 +1849,34 @@ static void domain_suspend_done(libxl__egc *egc,
         xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
                            dss->guest_evtchn.port, &dss->guest_evtchn_lockfd);
 
+    if (dss->remus) {
+        /*
+         * With Remus, if we reach this point, it means either
+         * backup died or some network error occurred preventing us
+         * from sending checkpoints. Teardown the network buffers and
+         * release netlink resources.  This is an async op.
+         */
+        LOGE(WARN, "Remus: Domain suspend terminated with rc %d,"
+             " teardown Remus devices...", rc);
+        dss->rds.callback = libxl__remus_teardown_done;
+        libxl__remus_devices_teardown(egc, &dss->rds);
+        return;
+    }
+
+    dss->callback(egc, dss, rc);
+}
+
+static void libxl__remus_teardown_done(libxl__egc *egc,
+                                       libxl__remus_device_state *rds,
+                                       int rc)
+{
+    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
+    STATE_AO_GC(dss->ao);
+
+    if (rc)
+        LOG(ERROR, "Remus: failed to teardown device for guest with domid %u,"
+            " rc %d", dss->domid, rc);
+
     dss->callback(egc, dss, rc);
 }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index db10efb..1eba7af 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2470,6 +2470,175 @@ typedef struct libxl__save_helper_state {
                       * marshalling and xc callback functions */
 } libxl__save_helper_state;
 
+/*----- remus device related state structure -----*/
+/* remus device is an abstract layer of remus devices(nic, disk,
+ * etc).It provides the following APIs for libxl:
+ *   >libxl__remus_devices_setup
+ *     setup remus devices, like attach qdisc, enable disk buffering, etc
+ *   >libxl__remus_devices_teardown
+ *     teardown devices
+ *   >libxl__remus_devices_postsuspend
+ *   >libxl__remus_devices_preresume
+ *   >libxl__remus_devices_commit
+ *     above three are for checkpoint.
+ * through remus device layer, the remus execution flow will be like
+ * this:
+ * xl remus -> remus device setup
+ *               |-> remus checkpoint(postsuspend, preresume, commit)
+ *                     ...
+ *                      |-> remus device teardown, failover or abort
+ * the remus device layer provides an interface
+ *   libxl__remus_device_subkind_ops
+ * which a remus device must implement. the whole remus structure:
+ *                           |remus|
+ *                              |
+ *                       |remus device|
+ *                              |
+ *               |nic| |drbd disks| |qemu disks| ...
+ * a device(nic, drbd disks, qemu disks, etc) must implement
+ * libxl__remus_device_subkind_ops to support remus.
+ */
+
+typedef enum libxl__remus_device_kind {
+    LIBXL__REMUS_DEVICE_NIC,
+    LIBXL__REMUS_DEVICE_DISK,
+} libxl__remus_device_kind;
+
+typedef struct libxl__remus_device libxl__remus_device;
+typedef struct libxl__remus_device_state libxl__remus_device_state;
+typedef struct libxl__remus_device_subkind_ops libxl__remus_device_subkind_ops;
+
+struct libxl__remus_device_subkind_ops {
+    /* the device kind this ops belongs to... */
+    libxl__remus_device_kind kind;
+
+    /*
+     * init() and destroy() APIs are produced by a device subkind and
+     * consumed by the main remus code, a device subkind must implement
+     * these two APIs.
+     * the APIs init/destroy device subkind's private data which stored
+     * in CTX. must implement.
+     */
+    int (*init)(libxl__remus_device_state *rds);
+    void (*destroy)(libxl__remus_device_state *rds);
+
+    /*
+     * checkpoint callbacks, these are async ops, call dev->callback
+     * when done. These function pointers may be NULL, means the op is
+     * not implemented, and it will do nothing when checkpoint.
+     * The callers of these APIs must check the function pointer first.
+     * These callbacks can be implemented synchronously, call
+     * dev->callback at last directly.
+     */
+    void (*postsuspend)(libxl__remus_device *dev);
+    void (*preresume)(libxl__remus_device *dev);
+    void (*commit)(libxl__remus_device *dev);
+
+    /*
+     * This API determines whether the subkind matchs the specific device. In
+     * the implementation, we first init all device subkind, for example, NIC,
+     * DRBD disk... Then we will find out the libxl devices, and match the
+     * device with the subkind, if the device is a drbd disk, then it will be
+     * matched with DRBD subkind, and the further ops(such as checkpoint etc.)
+     * of this device will using DRBD subkind ops. This API is mainly for
+     * disks, because we must use an external script to determine whether a
+     * libxl_disk is a DRBD disk.
+     * This function pointer may be NULL. That means this *kind* of
+     * device's ops is always matched with the *kind* of device.
+     * It's an async op and must be implemented asynchronously,
+     * call dev->callback when done.
+     */
+    void (*match)(libxl__remus_device *dev);
+
+    /*
+     * setup() and teardown() are refer to the actual remus device,
+     * a device subkind must implement these two APIs. They are async
+     * ops, and call dev->callback when done.
+     * These callbacks can be implemented synchronously, call
+     * dev->callback at last directly.
+     */
+    void (*setup)(libxl__remus_device *dev);
+    void (*teardown)(libxl__remus_device *dev);
+};
+
+typedef void libxl__remus_callback(libxl__egc *,
+                                   libxl__remus_device_state *, int rc);
+
+/*
+ * This structure is for remus device layer, it records remus devices
+ * that have been set up.
+ */
+struct libxl__remus_device_state {
+    /* must set by caller of libxl__remus_device_(setup|teardown) */
+    libxl__ao *ao;
+    libxl__egc *egc;
+    uint32_t domid;
+    libxl__remus_callback *callback;
+    /* the last ops must be NULL */
+    const libxl__remus_device_subkind_ops **ops;
+
+    /* private */
+    /* devices that have been set up */
+    int saved_rc;
+    libxl__remus_device **dev;
+
+    int num_nics;
+    int num_disks;
+
+    /* for counting devices that have been handled */
+    int num_devices;
+    /* for counting devices that have been set up */
+    int num_set_up;
+};
+
+typedef void libxl__remus_device_callback(libxl__egc *,
+                                          libxl__remus_device *,
+                                          int rc);
+/*
+ * This structure is init and setup by remus device abstruct layer,
+ * and pass to remus device ops
+ */
+struct libxl__remus_device {
+    /*----- shared between abstract and concrete layers -----*/
+    /* set by remus device abstruct layer */
+    /* libxl__device_* which this remus device related to */
+    const void *backend_dev;
+    libxl__remus_device_kind kind;
+
+    /*----- private for abstract layer only -----*/
+    /*
+     * we must go through all device ops until we find a matched ops
+     * for the device.
+     */
+    int ops_index;
+    const libxl__remus_device_subkind_ops *ops;
+    libxl__remus_device_callback *callback;
+    libxl__remus_device_state *rds;
+
+    /*----- private for concrete (device-specific) layer -----*/
+    /* *kind* of device's private data */
+    void *data;
+    /* for calling scripts, eg. setup|teardown|match scripts */
+    libxl__async_exec_state aes;
+    /*
+     * for async func calls, in the implementation of device ops, we
+     * may use fork to do async ops. this is owned by device-specific
+     * ops methods
+     */
+    libxl__ev_child child;
+};
+
+/* the following 5 APIs are async ops, call rds->callback when done */
+_hidden void libxl__remus_devices_setup(libxl__egc *egc,
+                                        libxl__remus_device_state *rds);
+_hidden void libxl__remus_devices_teardown(libxl__egc *egc,
+                                           libxl__remus_device_state *rds);
+_hidden void libxl__remus_devices_postsuspend(libxl__egc *egc,
+                                              libxl__remus_device_state *rds);
+_hidden void libxl__remus_devices_preresume(libxl__egc *egc,
+                                            libxl__remus_device_state *rds);
+_hidden void libxl__remus_devices_commit(libxl__egc *egc,
+                                         libxl__remus_device_state *rds);
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
@@ -2510,6 +2679,8 @@ struct libxl__domain_suspend_state {
     libxl__ev_xswatch guest_watch;
     libxl__ev_time guest_timeout;
     const char *dm_savefile;
+    libxl__remus_device_state rds;
+    libxl__ev_time checkpoint_timeout; /* used for Remus checkpoint */
     int interval; /* checkpoint interval (for Remus) */
     libxl__save_helper_state shs;
     libxl__logdirty_switch logdirty;
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_remus_device.c
new file mode 100644
index 0000000..14cca79
--- /dev/null
+++ b/tools/libxl/libxl_remus_device.c
@@ -0,0 +1,339 @@
+/*
+ * Copyright (C) 2014
+ * Author: Yang Hongyang <yanghy@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+/*----- helper functions -----*/
+
+static int init_device_subkind(libxl__remus_device_state *rds)
+{
+    int rc;
+    const libxl__remus_device_subkind_ops **ops;
+
+    for (ops = rds->ops; *ops; ops++) {
+        rc = (*ops)->init(rds);
+        if (rc) {
+            goto out;
+        }
+    }
+
+    rc = 0;
+out:
+    return rc;
+
+}
+
+static void destroy_device_subkind(libxl__remus_device_state *rds)
+{
+    const libxl__remus_device_subkind_ops **ops;
+
+    for (ops = rds->ops; *ops; ops++)
+        (*ops)->destroy(rds);
+}
+
+static bool all_devices_handled(libxl__remus_device_state *rds)
+{
+    return rds->num_devices == (rds->num_nics + rds->num_disks);
+}
+
+/*----- setup() and teardown() -----*/
+
+/* callbacks */
+
+static void device_match_cb(libxl__egc *egc,
+                            libxl__remus_device *dev,
+                            int rc);
+static void device_setup_cb(libxl__egc *egc,
+                            libxl__remus_device *dev,
+                            int rc);
+static void device_teardown_cb(libxl__egc *egc,
+                               libxl__remus_device *dev,
+                               int rc);
+
+/* remus device setup and teardown */
+
+static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
+                                     libxl__remus_device_state *rds,
+                                     libxl__remus_device_kind kind,
+                                     void *libxl_dev);
+void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
+{
+    STATE_AO_GC(rds->ao);
+
+    if (!rds->ops[0])
+        goto out;
+
+    rds->saved_rc = init_device_subkind(rds);
+    if (rds->saved_rc)
+        goto out;
+
+    rds->num_devices = 0;
+    rds->num_nics = 0;
+    rds->num_disks = 0;
+
+    /* TBD: Remus setup - i.e. attach qdisc, enable disk buffering, etc */
+
+    if (rds->num_nics == 0 && rds->num_disks == 0)
+        goto out;
+
+    GCNEW_ARRAY(rds->dev, rds->num_nics + rds->num_disks);
+
+    /* TBD: CALL libxl__remus_device_init to init remus devices */
+
+    return;
+
+out:
+    rds->callback(egc, rds, rds->saved_rc);
+    return;
+}
+
+static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
+                                     libxl__remus_device_state *rds,
+                                     libxl__remus_device_kind kind,
+                                     void *libxl_dev)
+{
+    libxl__remus_device *dev = NULL;
+
+    STATE_AO_GC(rds->ao);
+    GCNEW(dev);
+    dev->backend_dev = libxl_dev;
+    dev->kind = kind;
+    dev->rds = rds;
+
+    libxl__async_exec_init(&dev->aes);
+    libxl__ev_child_init(&dev->child);
+
+    /* match the ops begin */
+    dev->ops_index = 0;
+    dev->ops = rds->ops[dev->ops_index];
+    for (; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
+        if (dev->ops->kind == dev->kind) {
+            if (dev->ops->match) {
+                dev->callback = device_match_cb;
+                dev->ops->match(dev);
+            } else {
+                /*
+                 * This devops do not have match() implementation.
+                 * That means this *kind* of device's ops is always
+                 * matched with the *kind* of device.
+                 */
+                dev->callback = device_setup_cb;
+                dev->ops->setup(dev);
+            }
+            break;
+        }
+    }
+
+    if (!dev->ops) {
+        rds->num_devices++;
+        rds->saved_rc = ERROR_REMUS_DEVICE_NOT_SUPPORTED;
+        if (all_devices_handled(rds))
+            rds->callback(egc, rds, rds->saved_rc);
+    }
+}
+
+static void device_match_cb(libxl__egc *egc,
+                            libxl__remus_device *dev,
+                            int rc)
+{
+    libxl__remus_device_state *const rds = dev->rds;
+
+    STATE_AO_GC(rds->ao);
+
+    if (rds->saved_rc) {
+        /* there's already an error happened, we do not need to continue */
+        rds->num_devices++;
+        if (all_devices_handled(rds))
+            rds->callback(egc, rds, rds->saved_rc);
+        return;
+    }
+
+    if (rc) {
+        /* the ops does not match, try next ops */
+        dev->ops = rds->ops[++dev->ops_index];
+        if (!dev->ops || rc != ERROR_REMUS_DEVOPS_NOT_MATCH) {
+            /* the device can not be matched */
+            rds->num_devices++;
+            rds->saved_rc = ERROR_REMUS_DEVICE_NOT_SUPPORTED;
+            if (all_devices_handled(rds))
+                rds->callback(egc, rds, rds->saved_rc);
+            return;
+        }
+        for ( ; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
+            if (dev->ops->kind == dev->kind) {
+                /*
+                 * we have entered match process, that means this *kind* of
+                 * device's ops must have a match() implementation.
+                 */
+                assert(dev->ops->match);
+                dev->ops->match(dev);
+                break;
+            }
+        }
+    } else {
+        /* the ops matched, setup the device */
+        dev->callback = device_setup_cb;
+        dev->ops->setup(dev);
+    }
+}
+
+static void device_setup_cb(libxl__egc *egc,
+                            libxl__remus_device *dev,
+                            int rc)
+{
+    /* Convenience aliases */
+    libxl__remus_device_state *const rds = dev->rds;
+
+    STATE_AO_GC(rds->ao);
+
+    rds->num_devices++;
+    /*
+     * the netbuf script was designed as below:
+     * 1. when setup failed, the script won't teardown the device itself.
+     * 2. the teardown op is ok to be executed many times.
+     *
+     * we add devices that have been set up to the array no matter
+     * the setup process succeed or failed because we need to ensure
+     * the device been teardown while setup failed. If any of the
+     * device setup failed, we will quit remus, but before we exit,
+     * we will teardown the devices that have been added to **dev
+     */
+    rds->dev[rds->num_set_up++] = dev;
+    /* we preserve the first error that happened */
+    if (rc && !rds->saved_rc)
+        rds->saved_rc = rc;
+
+    if (all_devices_handled(rds))
+        rds->callback(egc, rds, rds->saved_rc);
+}
+
+void libxl__remus_devices_teardown(libxl__egc *egc, libxl__remus_device_state *rds)
+{
+    int i, num_set_up;
+    libxl__remus_device *dev;
+
+    STATE_AO_GC(rds->ao);
+
+    rds->saved_rc = 0;
+
+    if (rds->num_set_up == 0) {
+        destroy_device_subkind(rds);
+        goto out;
+    }
+
+    /* we will decrease rds->num_set_up in the teardown callback */
+    num_set_up = rds->num_set_up;
+    for (i = 0; i < num_set_up; i++) {
+        dev = rds->dev[i];
+        dev->callback = device_teardown_cb;
+        dev->ops->teardown(dev);
+    }
+
+    return;
+
+out:
+    rds->callback(egc, rds, rds->saved_rc);
+    return;
+}
+
+static void device_teardown_cb(libxl__egc *egc,
+                               libxl__remus_device *dev,
+                               int rc)
+{
+    libxl__remus_device_state *const rds = dev->rds;
+
+    STATE_AO_GC(rds->ao);
+
+    /* we preserve the first error that happened */
+    if (rc && !rds->saved_rc)
+        rds->saved_rc = rc;
+
+    /* ignore teardown errors to teardown as many devs as possible*/
+    rds->num_set_up--;
+
+    if (rds->num_set_up == 0) {
+        destroy_device_subkind(rds);
+        rds->callback(egc, rds, rds->saved_rc);
+    }
+}
+
+/*----- checkpointing APIs -----*/
+
+/* callbacks */
+
+static void device_checkpoint_cb(libxl__egc *egc,
+                                 libxl__remus_device *dev,
+                                 int rc);
+
+/* API implementations */
+
+#define define_remus_device_checkpoint_api(api)                             \
+void libxl__remus_devices_##api(libxl__egc *egc,                            \
+                                libxl__remus_device_state *rds)             \
+{                                                                           \
+    int i;                                                                  \
+    libxl__remus_device *dev;                                               \
+                                                                            \
+    STATE_AO_GC(rds->ao);                                                   \
+                                                                            \
+    rds->num_devices = 0;                                                   \
+    rds->saved_rc = 0;                                                      \
+                                                                            \
+    if (rds->num_set_up == 0)                                               \
+        goto out;                                                           \
+                                                                            \
+    for (i = 0; i < rds->num_set_up; i++) {                                 \
+        dev = rds->dev[i];                                                  \
+        dev->callback = device_checkpoint_cb;                               \
+        if (dev->ops->api) {                                                \
+            dev->ops->api(dev);                                             \
+        } else {                                                            \
+            rds->num_devices++;                                             \
+            if (rds->num_devices == rds->num_set_up)                        \
+                rds->callback(egc, rds, rds->saved_rc);                     \
+        }                                                                   \
+    }                                                                       \
+                                                                            \
+    return;                                                                 \
+                                                                            \
+out:                                                                        \
+    rds->callback(egc, rds, rds->saved_rc);                                 \
+}
+
+define_remus_device_checkpoint_api(postsuspend);
+
+define_remus_device_checkpoint_api(preresume);
+
+define_remus_device_checkpoint_api(commit);
+
+static void device_checkpoint_cb(libxl__egc *egc,
+                                 libxl__remus_device *dev,
+                                 int rc)
+{
+    /* Convenience aliases */
+    libxl__remus_device_state *const rds = dev->rds;
+
+    STATE_AO_GC(rds->ao);
+
+    rds->num_devices++;
+
+    if (rc)
+        rds->saved_rc = ERROR_FAIL;
+
+    if (rds->num_devices == rds->num_set_up)
+        rds->callback(egc, rds, rds->saved_rc);
+}
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index a412f9c..61d31c1 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -58,6 +58,8 @@ libxl_error = Enumeration("error", [
     (-12, "OSEVENT_REG_FAIL"),
     (-13, "BUFFERFULL"),
     (-14, "UNKNOWN_CHILD"),
+    (-15, "REMUS_DEVOPS_NOT_MATCH"),
+    (-16, "REMUS_DEVICE_NOT_SUPPORTED"),
     ], value_namespace = "")
 
 libxl_domain_type = Enumeration("domain_type", [
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 3/7] remus netbuffer: implement remus network buffering for nic devices
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 1/7] remus: add libnl3 dependency for network buffering support Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 2/7] remus: introduce remus device Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 4/7] remus drbd: Implement remus drbd replicated disk Yang Hongyang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

1.Add two members in libxl_domain_remus_info:
    netbuf: whether netbuf is enabled
    netbufscript: the path of the script which will be run to setup
                  and tear down the guest's interface.
2.Introduce remus-netbuf-setup hotplug script responsible for
  setting up and tearing down the necessary infrastructure required for
  network output buffering in Remus.  This script is intended to be invoked
  by libxl for each guest interface, when starting or stopping Remus.

  Apart from returning success/failure indication via the usual hotplug
  entries in xenstore, this script also writes to xenstore, the name of
  the REMUS_IFB device to be used to control the vif's network output.

  The script relies on libnl3 command line utilities to perform various
  setup/teardown functions. The script is confined to Linux platforms only
  since NetBSD does not seem to have libnl3.

  The following steps are taken during init:
    a) establish a dedicated remus context containing libnl related
       state (netlink sockets)

  The following steps are taken for each vif during setup:
    a) call the hotplug script to setup its network buffer and
       init qdisc caches

    b) Obtain handles to plug qdiscs installed on the REMUS_IFB devices
       chosen by the hotplug scripts.

  And during teardown, the netlink resources are released, followed by
  invocation of hotplug scripts to remove the ifb devices.
3.Implement the remus nic device interface. setup, teardown, etc. The
  checkpoint callbacks for netbuffer are implemented as sync op because
  net ops are quick enough and will not block.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
 docs/misc/xenstore-paths.markdown      |   4 +
 tools/hotplug/Linux/Makefile           |   1 +
 tools/hotplug/Linux/remus-netbuf-setup | 206 ++++++++++++++
 tools/libxl/libxl.c                    |  18 ++
 tools/libxl/libxl_internal.h           |   8 +
 tools/libxl/libxl_netbuffer.c          | 490 +++++++++++++++++++++++++++++++++
 tools/libxl/libxl_nonetbuffer.c        |  24 ++
 tools/libxl/libxl_remus_device.c       |  21 +-
 tools/libxl/libxl_types.idl            |   2 +
 9 files changed, 771 insertions(+), 3 deletions(-)
 create mode 100644 tools/hotplug/Linux/remus-netbuf-setup

diff --git a/docs/misc/xenstore-paths.markdown b/docs/misc/xenstore-paths.markdown
index ea67536..d94ea9d 100644
--- a/docs/misc/xenstore-paths.markdown
+++ b/docs/misc/xenstore-paths.markdown
@@ -393,6 +393,10 @@ The guest's virtual time offset from UTC in seconds.
 
 The device model version for a domain.
 
+#### /libxl/$DOMID/remus/netbuf/$DEVID/ifb = STRING [n,INTERNAL]
+
+ifb device used by Remus to buffer network output from the associated vif.
+
 [BLKIF]: http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html
 [FBIF]: http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,fbif.h.html
 [HVMPARAMS]: http://xenbits.xen.org/docs/unstable/hypercall/include,public,hvm,params.h.html
diff --git a/tools/hotplug/Linux/Makefile b/tools/hotplug/Linux/Makefile
index d5de9e6..721f8c0 100644
--- a/tools/hotplug/Linux/Makefile
+++ b/tools/hotplug/Linux/Makefile
@@ -16,6 +16,7 @@ XEN_SCRIPTS += vif-nat
 XEN_SCRIPTS += vif-openvswitch
 XEN_SCRIPTS += vif2
 XEN_SCRIPTS += vif-setup
+XEN_SCRIPTS-$(CONFIG_REMUS_NETBUF) += remus-netbuf-setup
 XEN_SCRIPTS += block
 XEN_SCRIPTS += block-enbd block-nbd
 XEN_SCRIPTS-$(CONFIG_BLKTAP1) += blktap
diff --git a/tools/hotplug/Linux/remus-netbuf-setup b/tools/hotplug/Linux/remus-netbuf-setup
new file mode 100644
index 0000000..b857b02
--- /dev/null
+++ b/tools/hotplug/Linux/remus-netbuf-setup
@@ -0,0 +1,206 @@
+#!/bin/bash
+#============================================================================
+# ${XEN_SCRIPT_DIR}/remus-netbuf-setup
+#
+# Script for attaching a network buffer to the specified vif (in any mode).
+# The hotplugging system will call this script when starting remus via libxl
+# API, libxl_domain_remus_start.
+#
+# Usage:
+# remus-netbuf-setup (setup|teardown)
+#
+# Environment vars:
+# vifname     vif interface name (required).
+# XENBUS_PATH path in Xenstore, where the REMUS_IFB device details will be
+#             stored or read from (required).
+#             (libxl passes /libxl/<domid>/remus/netbuf/<devid>)
+# REMUS_IFB   ifb interface to be cleaned up (required). [for teardown op only]
+
+# Written to the store: (setup operation)
+# XENBUS_PATH/ifb=<ifbdevName> the REMUS_IFB device serving
+#  as the intermediate buffer through which the interface's network output
+#  can be controlled.
+#
+# To install a network buffer on a guest vif (vif1.0) using ifb (ifb0)
+# we need to do the following
+#
+#  ip link set dev ifb0 up
+#  tc qdisc add dev vif1.0 ingress
+#  tc filter add dev vif1.0 parent ffff: proto ip \
+#    prio 10 u32 match u32 0 0 action mirred egress redirect dev ifb0
+#  nl-qdisc-add --dev=ifb0 --parent root plug
+#  nl-qdisc-add --dev=ifb0 --parent root --update plug --limit=10000000
+#                                                (10MB limit on buffer)
+#
+# So order of operations when installing a network buffer on vif1.0
+# 1. find a free ifb and bring up the device
+# 2. redirect traffic from vif1.0 to ifb:
+#   2.1 add ingress qdisc to vif1.0 (to capture outgoing packets from guest)
+#   2.2 use tc filter command with actions mirred egress + redirect
+# 3. install plug_qdisc on ifb device, with which we can buffer/release
+#    guest's network output from vif1.0
+#
+# Please note:
+# 1. when setup failed, the script won't teardown the device itself.
+# 2. the teardown op is ok to be executed many times.
+#
+
+#============================================================================
+
+# Unlike other vif scripts, vif-common is not needed here as it executes vif
+#specific setup code such as renaming.
+dir=$(dirname "$0")
+. "$dir/xen-hotplug-common.sh"
+
+findCommand "$@"
+
+if [ "$command" != "setup" -a  "$command" != "teardown" ]
+then
+  echo "Invalid command: $command"
+  log err "Invalid command: $command"
+  exit 1
+fi
+
+evalVariables "$@"
+
+: ${vifname:?}
+: ${XENBUS_PATH:?}
+
+check_libnl_tools() {
+    if ! command -v nl-qdisc-list > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-list tool"
+    fi
+    if ! command -v nl-qdisc-add > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-add tool"
+    fi
+    if ! command -v nl-qdisc-delete > /dev/null 2>&1; then
+        fatal "Unable to find nl-qdisc-delete tool"
+    fi
+}
+
+# We only check for modules. We don't load them.
+# User/Admin is supposed to load ifb during boot time,
+# ensuring that there are enough free ifbs in the system.
+# Other modules will be loaded automatically by tc commands.
+check_modules() {
+    for m in ifb sch_plug sch_ingress act_mirred cls_u32
+    do
+        if ! modinfo $m > /dev/null 2>&1; then
+            fatal "Unable to find $m kernel module"
+        fi
+    done
+}
+
+xs_write_failed() {
+    local vif=$1
+    local ifb=$2
+    teardown_netbuf "$vifname" "$REMUS_IFB"
+    fatal "failed to write ifb name to xenstore"
+}
+
+#return 0 if the ifb is free
+check_ifb() {
+    local installed=`nl-qdisc-list -d $1`
+    [ -n "$installed" ] && return 1
+
+    for domid in `xenstore-list "/local/domain" 2>/dev/null || true`
+    do
+        [ $domid -eq 0 ] && continue
+        xenstore-exists "/libxl/$domid/remus/netbuf" || continue
+        for devid in `xenstore-list "/libxl/$domid/remus/netbuf" 2>/dev/null || true`
+        do
+            local path="/libxl/$domid/remus/netbuf/$devid/ifb"
+            xenstore-exists $path || continue
+            local ifb=`xenstore-read "$path" 2>/dev/null || true`
+            [ "$ifb" = "$1" ] && return 1
+        done
+    done
+
+    return 0
+}
+
+setup_ifb() {
+
+    for ifb in `ifconfig -a -s|egrep ^ifb|cut -d ' ' -f1`
+    do
+        check_ifb "$ifb" || continue
+        REMUS_IFB="$ifb"
+        break
+    done
+
+    if [ -z "$REMUS_IFB" ]
+    then
+        fatal "Unable to find a free ifb device for $vifname"
+    fi
+
+    #not using xenstore_write that automatically exits on error
+    #because we need to cleanup
+    _xenstore_write "$XENBUS_PATH/ifb" "$REMUS_IFB" || xs_write_failed "$vifname" "$REMUS_IFB"
+    do_or_die ip link set dev "$REMUS_IFB" up
+}
+
+redirect_vif_traffic() {
+    local vif=$1
+    local ifb=$2
+
+    do_or_die tc qdisc add dev "$vif" ingress
+
+    tc filter add dev "$vif" parent ffff: proto ip prio 10 \
+        u32 match u32 0 0 action mirred egress redirect dev "$ifb" >/dev/null 2>&1
+
+    if [ $? -ne 0 ]
+    then
+        do_without_error tc qdisc del dev "$vif" ingress
+        fatal "Failed to redirect traffic from $vif to $ifb"
+    fi
+}
+
+add_plug_qdisc() {
+    local vif=$1
+    local ifb=$2
+
+    nl-qdisc-add --dev="$ifb" --parent root plug >/dev/null 2>&1
+    if [ $? -ne 0 ]
+    then
+        do_without_error tc qdisc del dev "$vif" ingress
+        fatal "Failed to add plug qdisc to $ifb"
+    fi
+
+    #set ifb buffering limit in bytes. Its okay if this command fails
+    nl-qdisc-add --dev="$ifb" --parent root \
+        --update plug --limit=10000000 >/dev/null 2>&1 || true
+}
+
+teardown_netbuf() {
+    local vif=$1
+    local ifb=$2
+
+    if [ "$ifb" ]; then
+        do_without_error ip link set dev "$ifb" down
+        do_without_error nl-qdisc-delete --dev="$ifb" --parent root plug >/dev/null 2>&1
+        xenstore-rm -t "$XENBUS_PATH/ifb" 2>/dev/null || true
+    fi
+    do_without_error tc qdisc del dev "$vif" ingress
+    xenstore-rm -t "$XENBUS_PATH/hotplug-status" 2>/dev/null || true
+    xenstore-rm -t "$XENBUS_PATH/hotplug-error" 2>/dev/null || true
+}
+
+case "$command" in
+    setup)
+        check_libnl_tools
+        check_modules
+
+        claim_lock "pickifb"
+        setup_ifb
+        redirect_vif_traffic "$vifname" "$REMUS_IFB"
+        add_plug_qdisc "$vifname" "$REMUS_IFB"
+        release_lock "pickifb"
+
+        success
+        ;;
+    teardown)
+        teardown_netbuf "$vifname" "$REMUS_IFB"
+        ;;
+esac
+
+log debug "Successful remus-netbuf-setup $command for $vifname, ifb $REMUS_IFB."
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 95d9953..3c484de 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -789,6 +789,7 @@ static void remus_failover_cb(libxl__egc *egc,
                               libxl__domain_suspend_state *dss, int rc);
 
 static const libxl__remus_device_subkind_ops *remus_ops[] = {
+    &remus_device_nic,
     NULL,
 };
 
@@ -822,6 +823,23 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
 
     /* Convenience aliases */
     libxl__remus_device_state *const rds = &dss->rds;
+
+    if (info->netbuf) {
+        if (!libxl__netbuffer_enabled(gc)) {
+            LOG(ERROR, "Remus: No support for network buffering");
+            goto out;
+        }
+
+        if (info->netbufscript) {
+            rds->netbufscript =
+                libxl__strdup(gc, info->netbufscript);
+        } else {
+            rds->netbufscript =
+                GCSPRINTF("%s/remus-netbuf-setup",
+                libxl__xen_script_dir_path());
+        }
+    }
+
     rds->ao = ao;
     rds->egc = egc;
     rds->domid = domid;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 1eba7af..06dfb73 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -139,6 +139,7 @@ typedef struct libxl__ao libxl__ao;
 typedef struct libxl__aop_occurred libxl__aop_occurred;
 typedef struct libxl__osevent_hook_nexus libxl__osevent_hook_nexus;
 typedef struct libxl__osevent_hook_nexi libxl__osevent_hook_nexi;
+typedef struct libxl__remus_netbuf_state libxl__remus_netbuf_state;
 
 _hidden void libxl__alloc_failed(libxl_ctx *, const char *func,
                          size_t nmemb, size_t size) __attribute__((noreturn));
@@ -374,6 +375,8 @@ struct libxl__ctx {
     LIBXL_LIST_ENTRY(libxl_ctx) sigchld_users_entry;
 
     libxl_version_info version_info;
+
+    libxl__remus_netbuf_state *rns;
 };
 
 typedef struct {
@@ -2576,12 +2579,14 @@ struct libxl__remus_device_state {
     libxl__remus_callback *callback;
     /* the last ops must be NULL */
     const libxl__remus_device_subkind_ops **ops;
+    const char *netbufscript;
 
     /* private */
     /* devices that have been set up */
     int saved_rc;
     libxl__remus_device **dev;
 
+    libxl_device_nic *nics;
     int num_nics;
     int num_disks;
 
@@ -2639,6 +2644,9 @@ _hidden void libxl__remus_devices_preresume(libxl__egc *egc,
                                             libxl__remus_device_state *rds);
 _hidden void libxl__remus_devices_commit(libxl__egc *egc,
                                          libxl__remus_device_state *rds);
+
+extern const libxl__remus_device_subkind_ops remus_device_nic;
+
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
 /*----- Domain suspend (save) state structure -----*/
diff --git a/tools/libxl/libxl_netbuffer.c b/tools/libxl/libxl_netbuffer.c
index 52d593c..39025f4 100644
--- a/tools/libxl/libxl_netbuffer.c
+++ b/tools/libxl/libxl_netbuffer.c
@@ -17,11 +17,501 @@
 
 #include "libxl_internal.h"
 
+#include <netlink/cache.h>
+#include <netlink/socket.h>
+#include <netlink/attr.h>
+#include <netlink/route/link.h>
+#include <netlink/route/route.h>
+#include <netlink/route/qdisc.h>
+#include <netlink/route/qdisc/plug.h>
+
+struct libxl__remus_netbuf_state {
+    libxl__ao *ao;
+    uint32_t domid;
+
+    struct nl_sock *nlsock;
+    struct nl_cache *qdisc_cache;
+};
+
+typedef struct libxl__remus_device_nic {
+    int devid;
+    const char *vif;
+    const char *ifb;
+    struct rtnl_qdisc *qdisc;
+} libxl__remus_device_nic;
+
 int libxl__netbuffer_enabled(libxl__gc *gc)
 {
     return 1;
 }
 
+/*----- init() and destroy() -----*/
+
+static int nic_init(libxl__remus_device_state *rds)
+{
+    int rc, ret;
+    libxl__remus_netbuf_state *ns;
+
+    STATE_AO_GC(rds->ao);
+
+    GCNEW(ns);
+    CTX->rns = ns;
+
+    ns->nlsock = nl_socket_alloc();
+    if (!ns->nlsock) {
+        LOG(ERROR, "cannot allocate nl socket");
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ret = nl_connect(ns->nlsock, NETLINK_ROUTE);
+    if (ret) {
+        LOG(ERROR, "failed to open netlink socket: %s",
+            nl_geterror(ret));
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* get list of all qdiscs installed on network devs. */
+    ret = rtnl_qdisc_alloc_cache(ns->nlsock, &ns->qdisc_cache);
+    if (ret) {
+        LOG(ERROR, "failed to allocate qdisc cache: %s",
+            nl_geterror(ret));
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ns->ao = rds->ao;
+    ns->domid = rds->domid;
+
+    rc = 0;
+
+out:
+    return rc;
+}
+
+static void nic_destroy(libxl__remus_device_state *rds)
+{
+    STATE_AO_GC(rds->ao);
+    libxl__remus_netbuf_state *ns = CTX->rns;
+
+    if (!ns)
+        return;
+
+    /* free qdisc cache */
+    if (ns->qdisc_cache) {
+        nl_cache_clear(ns->qdisc_cache);
+        nl_cache_free(ns->qdisc_cache);
+        ns->qdisc_cache = NULL;
+    }
+
+    /* close & free nlsock */
+    if (ns->nlsock) {
+        nl_close(ns->nlsock);
+        nl_socket_free(ns->nlsock);
+        ns->nlsock = NULL;
+    }
+}
+
+/*----- setup() and teardown() -----*/
+
+/* helper functions */
+
+/*
+ * If the device has a vifname, then use that instead of
+ * the vifX.Y format.
+ * it must ONLY be used for remus because if driver domains
+ * were in use it would constitute a security vulnerability.
+ */
+static const char *get_vifname(libxl__remus_device *dev,
+                               const libxl_device_nic *nic)
+{
+    const char *vifname = NULL;
+    const char *path;
+    int rc;
+
+    STATE_AO_GC(dev->rds->ao);
+
+    /* Convenience aliases */
+    libxl__remus_netbuf_state *netbuf_state = CTX->rns;
+    const uint32_t domid = netbuf_state->domid;
+
+    path = GCSPRINTF("%s/backend/vif/%d/%d/vifname",
+                     libxl__xs_get_dompath(gc, 0), domid, nic->devid);
+    rc = libxl__xs_read_checked(gc, XBT_NULL, path, &vifname);
+    if (!rc && !vifname) {
+        vifname = libxl__device_nic_devname(gc, domid,
+                                            nic->devid,
+                                            nic->nictype);
+    }
+
+    return vifname;
+}
+
+static void free_qdisc(libxl__remus_device_nic *remus_nic)
+{
+    if (remus_nic->qdisc == NULL)
+        return;
+
+    nl_object_put((struct nl_object *)(remus_nic->qdisc));
+    remus_nic->qdisc = NULL;
+}
+
+static int init_qdisc(libxl__remus_netbuf_state *netbuf_state,
+                      libxl__remus_device_nic *remus_nic)
+{
+    int rc, ret, ifindex;
+    struct rtnl_link *ifb = NULL;
+    struct rtnl_qdisc *qdisc = NULL;
+
+    STATE_AO_GC(netbuf_state->ao);
+
+    /* Now that we have brought up REMUS_IFB device with plug qdisc for
+     * this vif, so we need to refill the qdisc cache.
+     */
+    ret = nl_cache_refill(netbuf_state->nlsock, netbuf_state->qdisc_cache);
+    if (ret) {
+        LOG(ERROR, "cannot refill qdisc cache: %s", nl_geterror(ret));
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* get a handle to the REMUS_IFB interface */
+    ifb = NULL;
+    ret = rtnl_link_get_kernel(netbuf_state->nlsock, 0,
+                               remus_nic->ifb, &ifb);
+    if (ret) {
+        LOG(ERROR, "cannot obtain handle for %s: %s", remus_nic->ifb,
+            nl_geterror(ret));
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ifindex = rtnl_link_get_ifindex(ifb);
+    if (!ifindex) {
+        LOG(ERROR, "interface %s has no index", remus_nic->ifb);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* Get a reference to the root qdisc installed on the REMUS_IFB, by
+     * querying the qdisc list we obtained earlier. The netbufscript
+     * sets up the plug qdisc as the root qdisc, so we don't have to
+     * search the entire qdisc tree on the REMUS_IFB dev.
+
+     * There is no need to explicitly free this qdisc as its just a
+     * reference from the qdisc cache we allocated earlier.
+     */
+    qdisc = rtnl_qdisc_get_by_parent(netbuf_state->qdisc_cache, ifindex,
+                                     TC_H_ROOT);
+
+    if (qdisc) {
+        const char *tc_kind = rtnl_tc_get_kind(TC_CAST(qdisc));
+        /* Sanity check: Ensure that the root qdisc is a plug qdisc. */
+        if (!tc_kind || strcmp(tc_kind, "plug")) {
+            LOG(ERROR, "plug qdisc is not installed on %s", remus_nic->ifb);
+            rc = ERROR_FAIL;
+            goto out;
+        }
+        remus_nic->qdisc = qdisc;
+    } else {
+        LOG(ERROR, "Cannot get qdisc handle from ifb %s", remus_nic->ifb);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = 0;
+
+out:
+    if (ifb)
+        rtnl_link_put(ifb);
+
+    if (rc && qdisc)
+        nl_object_put((struct nl_object *)qdisc);
+
+    return rc;
+}
+
+/* callbacks */
+
+static void netbuf_setup_script_cb(libxl__egc *egc,
+                                   libxl__async_exec_state *aes,
+                                   int status);
+static void netbuf_teardown_script_cb(libxl__egc *egc,
+                                      libxl__async_exec_state *aes,
+                                      int status);
+
+/*
+ * the script needs the following env & args
+ * $vifname
+ * $XENBUS_PATH (/libxl/<domid>/remus/netbuf/<devid>/)
+ * $REMUS_IFB (for teardown)
+ * setup/teardown as command line arg.
+ */
+static void setup_async_exec(libxl__remus_device *dev, char *op)
+{
+    int arraysize, nr = 0;
+    char **env = NULL, **args = NULL;
+    libxl__async_exec_state *aes = &dev->aes;
+    libxl__remus_device_nic *remus_nic = dev->data;
+    STATE_AO_GC(dev->rds->ao);
+
+    /* Convenience aliases */
+    libxl__remus_netbuf_state *ns = CTX->rns;
+    char *const script = libxl__strdup(gc, dev->rds->netbufscript);
+    const uint32_t domid = ns->domid;
+    const int dev_id = remus_nic->devid;
+    const char *const vif = remus_nic->vif;
+    const char *const ifb = remus_nic->ifb;
+
+    arraysize = 7;
+    GCNEW_ARRAY(env, arraysize);
+    env[nr++] = "vifname";
+    env[nr++] = libxl__strdup(gc, vif);
+    env[nr++] = "XENBUS_PATH";
+    env[nr++] = GCSPRINTF("%s/remus/netbuf/%d",
+                          libxl__xs_libxl_path(gc, domid), dev_id);
+    if (!strcmp(op, "teardown") && ifb) {
+        env[nr++] = "REMUS_IFB";
+        env[nr++] = libxl__strdup(gc, ifb);
+    }
+    env[nr++] = NULL;
+    assert(nr <= arraysize);
+
+    arraysize = 3; nr = 0;
+    GCNEW_ARRAY(args, arraysize);
+    args[nr++] = script;
+    args[nr++] = op;
+    args[nr++] = NULL;
+    assert(nr == arraysize);
+
+    aes->ao = dev->rds->ao;
+    aes->what = GCSPRINTF("%s %s", args[0], args[1]);
+    aes->env = env;
+    aes->args = args;
+    aes->timeout_ms = LIBXL_HOTPLUG_TIMEOUT * 1000;
+    aes->stdfds[0] = -1;
+    aes->stdfds[1] = -1;
+    aes->stdfds[2] = -1;
+
+    if (!strcmp(op, "teardown"))
+        aes->callback = netbuf_teardown_script_cb;
+    else
+        aes->callback = netbuf_setup_script_cb;
+}
+
+/* setup() and teardown() */
+
+static void nic_setup(libxl__remus_device *dev)
+{
+    int rc;
+    libxl__remus_device_nic *remus_nic;
+    const libxl_device_nic *nic = dev->backend_dev;
+
+    STATE_AO_GC(dev->rds->ao);
+
+    GCNEW(remus_nic);
+    dev->data = remus_nic;
+    remus_nic->devid = nic->devid;
+    remus_nic->vif = get_vifname(dev, nic);
+    if (!remus_nic->vif) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    setup_async_exec(dev, "setup");
+    rc = libxl__async_exec_start(gc, &dev->aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->callback(dev->rds->egc, dev, rc);
+}
+
+/*
+ * In return, the script writes the name of REMUS_IFB device (during setup)
+ * to be used for output buffering into XENBUS_PATH/ifb
+ */
+static void netbuf_setup_script_cb(libxl__egc *egc,
+                                   libxl__async_exec_state *aes,
+                                   int status)
+{
+    libxl__remus_device *dev = CONTAINER_OF(aes, *dev, aes);
+    libxl__remus_device_nic *remus_nic = dev->data;
+    const char *out_path_base, *hotplug_error = NULL;
+    int rc;
+
+    STATE_AO_GC(dev->rds->ao);
+
+    /* Convenience aliases */
+    libxl__remus_netbuf_state *netbuf_state = CTX->rns;
+    const uint32_t domid = netbuf_state->domid;
+    const int devid = remus_nic->devid;
+    const char *const vif = remus_nic->vif;
+    const char **const ifb = &remus_nic->ifb;
+
+    /*
+     * we need to get ifb first because it's needed for teardown
+     */
+    rc = libxl__xs_read_checked(gc, XBT_NULL,
+                                GCSPRINTF("%s/remus/netbuf/%d/ifb",
+                                          libxl__xs_libxl_path(gc, domid),
+                                          devid),
+                                ifb);
+    if (rc)
+        goto out;
+
+    if (!(*ifb)) {
+        LOG(ERROR, "Cannot get ifb dev name for domain %u dev %s",
+            domid, vif);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    out_path_base = GCSPRINTF("%s/remus/netbuf/%d",
+                              libxl__xs_libxl_path(gc, domid), devid);
+
+    rc = libxl__xs_read_checked(gc, XBT_NULL,
+                                GCSPRINTF("%s/hotplug-error", out_path_base),
+                                &hotplug_error);
+    if (rc)
+        goto out;
+
+    if (hotplug_error) {
+        LOG(ERROR, "netbuf script %s setup failed for vif %s: %s",
+            dev->rds->netbufscript, vif, hotplug_error);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    if (status) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    LOG(DEBUG, "%s will buffer packets from vif %s", *ifb, vif);
+    rc = init_qdisc(netbuf_state, remus_nic);
+
+out:
+    dev->callback(egc, dev, rc);
+}
+
+static void nic_teardown(libxl__remus_device *dev)
+{
+    int rc;
+    STATE_AO_GC(dev->rds->ao);
+
+    setup_async_exec(dev, "teardown");
+
+    rc = libxl__async_exec_start(gc, &dev->aes);
+    if (rc)
+        goto out;
+
+    return;
+
+out:
+    dev->callback(dev->rds->egc, dev, rc);
+}
+
+static void netbuf_teardown_script_cb(libxl__egc *egc,
+                                      libxl__async_exec_state *aes,
+                                      int status)
+{
+    int rc;
+    libxl__remus_device *dev = CONTAINER_OF(aes, *dev, aes);
+    libxl__remus_device_nic *remus_nic = dev->data;
+
+    if (status)
+        rc = ERROR_FAIL;
+    else
+        rc = 0;
+
+    free_qdisc(remus_nic);
+
+    dev->callback(egc, dev, rc);
+}
+
+/*----- checkpointing APIs -----*/
+
+/* The value of buffer_op, not the value passed to kernel */
+enum {
+    tc_buffer_start,
+    tc_buffer_release
+};
+
+/* API implementations */
+
+static int remus_netbuf_op(libxl__remus_device_nic *remus_nic,
+                           libxl__remus_netbuf_state *netbuf_state,
+                           int buffer_op)
+{
+    int rc, ret;
+
+    STATE_AO_GC(netbuf_state->ao);
+
+    if (buffer_op == tc_buffer_start)
+        ret = rtnl_qdisc_plug_buffer(remus_nic->qdisc);
+    else
+        ret = rtnl_qdisc_plug_release_one(remus_nic->qdisc);
+
+    if (ret) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ret = rtnl_qdisc_add(netbuf_state->nlsock,
+                         remus_nic->qdisc,
+                         NLM_F_REQUEST);
+    if (ret) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = 0;
+
+out:
+    if (rc)
+        LOG(ERROR, "Remus: cannot do netbuf op %s on %s:%s",
+            ((buffer_op == tc_buffer_start) ?
+            "start_new_epoch" : "release_prev_epoch"),
+            remus_nic->ifb, nl_geterror(ret));
+    return rc;
+}
+
+static void nic_postsuspend(libxl__remus_device *dev)
+{
+    int rc;
+    libxl__remus_device_nic *remus_nic = dev->data;
+    STATE_AO_GC(dev->rds->ao);
+    libxl__remus_netbuf_state *ns = CTX->rns;
+
+    rc = remus_netbuf_op(remus_nic, ns, tc_buffer_start);
+    dev->callback(dev->rds->egc, dev, rc);
+}
+
+static void nic_commit(libxl__remus_device *dev)
+{
+    int rc;
+    libxl__remus_device_nic *remus_nic = dev->data;
+    STATE_AO_GC(dev->rds->ao);
+    libxl__remus_netbuf_state *ns = CTX->rns;
+
+    rc = remus_netbuf_op(remus_nic, ns, tc_buffer_release);
+    dev->callback(dev->rds->egc, dev, rc);
+}
+
+const libxl__remus_device_subkind_ops remus_device_nic = {
+    .kind = LIBXL__REMUS_DEVICE_NIC,
+    .init = nic_init,
+    .destroy = nic_destroy,
+    .setup = nic_setup,
+    .teardown = nic_teardown,
+    .postsuspend = nic_postsuspend,
+    .commit = nic_commit,
+};
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_nonetbuffer.c b/tools/libxl/libxl_nonetbuffer.c
index 1c72a7f..fc29c36 100644
--- a/tools/libxl/libxl_nonetbuffer.c
+++ b/tools/libxl/libxl_nonetbuffer.c
@@ -22,6 +22,30 @@ int libxl__netbuffer_enabled(libxl__gc *gc)
     return 0;
 }
 
+static void nic_match(libxl__remus_device *dev)
+{
+    STATE_AO_GC(dev->rds->ao);
+
+    dev->callback(dev->rds->egc, dev, ERROR_FAIL);
+}
+
+static int nic_init(libxl__remus_device_state *rds)
+{
+    return 0;
+}
+
+static void nic_destroy(libxl__remus_device_state *rds)
+{
+    return;
+}
+
+const libxl__remus_device_subkind_ops remus_device_nic = {
+    .kind = LIBXL__REMUS_DEVICE_NIC,
+    .init = nic_init,
+    .destroy = nic_destroy,
+    .match = nic_match,
+};
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_remus_device.c
index 14cca79..7e0a359 100644
--- a/tools/libxl/libxl_remus_device.c
+++ b/tools/libxl/libxl_remus_device.c
@@ -66,12 +66,13 @@ static void device_teardown_cb(libxl__egc *egc,
 
 /* remus device setup and teardown */
 
-static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
+static void libxl__remus_device_init(libxl__egc *egc,
                                      libxl__remus_device_state *rds,
                                      libxl__remus_device_kind kind,
                                      void *libxl_dev);
 void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
 {
+    int i;
     STATE_AO_GC(rds->ao);
 
     if (!rds->ops[0])
@@ -85,7 +86,9 @@ void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
     rds->num_nics = 0;
     rds->num_disks = 0;
 
-    /* TBD: Remus setup - i.e. attach qdisc, enable disk buffering, etc */
+    /* TBD: Remus setup - i.e. enable disk buffering, etc */
+    if (rds->netbufscript)
+        rds->nics = libxl_device_nic_list(CTX, rds->domid, &rds->num_nics);
 
     if (rds->num_nics == 0 && rds->num_disks == 0)
         goto out;
@@ -93,6 +96,10 @@ void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
     GCNEW_ARRAY(rds->dev, rds->num_nics + rds->num_disks);
 
     /* TBD: CALL libxl__remus_device_init to init remus devices */
+    for (i = 0; i < rds->num_nics; i++) {
+        libxl__remus_device_init(egc, rds,
+                                 LIBXL__REMUS_DEVICE_NIC, &rds->nics[i]);
+    }
 
     return;
 
@@ -101,7 +108,7 @@ out:
     return;
 }
 
-static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
+static void libxl__remus_device_init(libxl__egc *egc,
                                      libxl__remus_device_state *rds,
                                      libxl__remus_device_kind kind,
                                      void *libxl_dev)
@@ -254,6 +261,7 @@ static void device_teardown_cb(libxl__egc *egc,
                                libxl__remus_device *dev,
                                int rc)
 {
+    int i;
     libxl__remus_device_state *const rds = dev->rds;
 
     STATE_AO_GC(rds->ao);
@@ -266,6 +274,13 @@ static void device_teardown_cb(libxl__egc *egc,
     rds->num_set_up--;
 
     if (rds->num_set_up == 0) {
+        /* clean nic */
+        for (i = 0; i < rds->num_nics; i++)
+            libxl_device_nic_dispose(&rds->nics[i]);
+        free(rds->nics);
+        rds->nics = NULL;
+        rds->num_nics = 0;
+
         destroy_device_subkind(rds);
         rds->callback(egc, rds, rds->saved_rc);
     }
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 61d31c1..a96771e 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -590,6 +590,8 @@ libxl_domain_remus_info = Struct("domain_remus_info",[
     ("interval",     integer),
     ("blackhole",    bool),
     ("compression",  bool),
+    ("netbuf",       bool),
+    ("netbufscript", string),
     ])
 
 libxl_event_type = Enumeration("event_type", [
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 4/7] remus drbd: Implement remus drbd replicated disk
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
                   ` (2 preceding siblings ...)
  2014-07-15  6:46 ` [PATCH v16 3/7] remus netbuffer: implement remus network buffering for nic devices Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 5/7] libxl: network buffering cmdline switch Yang Hongyang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

Implement remus-drbd-replicated-checkpointing-disk based on
generic remus devices framework.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 tools/hotplug/Linux/Makefile         |   1 +
 tools/hotplug/Linux/block-drbd-probe |  85 +++++++++++++
 tools/libxl/Makefile                 |   2 +-
 tools/libxl/libxl.c                  |   1 +
 tools/libxl/libxl_internal.h         |   4 +
 tools/libxl/libxl_remus_device.c     |  16 ++-
 tools/libxl/libxl_remus_disk_drbd.c  | 240 +++++++++++++++++++++++++++++++++++
 7 files changed, 346 insertions(+), 3 deletions(-)
 create mode 100755 tools/hotplug/Linux/block-drbd-probe
 create mode 100644 tools/libxl/libxl_remus_disk_drbd.c

diff --git a/tools/hotplug/Linux/Makefile b/tools/hotplug/Linux/Makefile
index 721f8c0..15d1b37 100644
--- a/tools/hotplug/Linux/Makefile
+++ b/tools/hotplug/Linux/Makefile
@@ -24,6 +24,7 @@ XEN_SCRIPTS += xen-hotplug-cleanup
 XEN_SCRIPTS += external-device-migrate
 XEN_SCRIPTS += vscsi
 XEN_SCRIPTS += block-iscsi
+XEN_SCRIPTS += block-drbd-probe
 XEN_SCRIPTS += $(XEN_SCRIPTS-y)
 
 XEN_SCRIPT_DATA = xen-script-common.sh locking.sh logging.sh
diff --git a/tools/hotplug/Linux/block-drbd-probe b/tools/hotplug/Linux/block-drbd-probe
new file mode 100755
index 0000000..3a3d446
--- /dev/null
+++ b/tools/hotplug/Linux/block-drbd-probe
@@ -0,0 +1,85 @@
+#! /bin/bash
+#
+# Copyright (C) 2014 FUJITSU LIMITED
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of version 2.1 of the GNU Lesser General Public
+# License as published by the Free Software Foundation.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with this library; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+#
+# Usage:
+#     block-drbd-probe devicename
+#
+# Return value:
+#     0: the device is drbd device
+#     1: the device is not drbd device
+#     2: unkown error
+#     3: the drbd device does not use protocol D
+#     4: the drbd device is not ready
+
+drbd_res=
+
+function get_res_name()
+{
+    local drbd_dev=$1
+    local drbd_dev_list=($(drbdadm sh-dev all))
+    local drbd_res_list=($(drbdadm sh-resource all))
+    local temp_drbd_dev temp_drbd_res
+    local found=0
+
+    for temp_drbd_dev in ${drbd_dev_list[@]}; do
+        if [[ "$temp_drbd_dev" == "$drbd_dev" ]]; then
+            found=1
+            break
+        fi
+    done
+
+    if [[ $found -eq 0 ]]; then
+        return 1
+    fi
+
+    for temp_drbd_res in ${drbd_res_list[@]}; do
+        temp_drbd_dev=$(drbdadm sh-dev $temp_drbd_res)
+        if [[ "$temp_drbd_dev" == "$drbd_dev" ]]; then
+            drbd_res="$temp_drbd_res"
+            return 0
+        fi
+    done
+
+    # OOPS
+    return 2
+}
+
+get_res_name $1
+rc=$?
+if [[ $rc -ne 0 ]]; then
+    exit $rc
+fi
+
+# check protocol
+drbdsetup $1 show | grep -q "protocol D;"
+if [[ $? -ne 0 ]]; then
+    exit 3
+fi
+
+# check connect status
+state=$(drbdadm cstate "$drbd_res")
+if [[ "$state" != "Connected" ]]; then
+    exit 4
+fi
+
+# check role
+role=$(drbdadm role "$drbd_res")
+if [[ "$role" != "Primary/Secondary" ]]; then
+    exit 4
+fi
+
+exit 0
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 202f1bb..ba10ab7 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -56,7 +56,7 @@ else
 LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
-LIBXL_OBJS-y += libxl_remus_device.o
+LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 3c484de..00b298a 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -790,6 +790,7 @@ static void remus_failover_cb(libxl__egc *egc,
 
 static const libxl__remus_device_subkind_ops *remus_ops[] = {
     &remus_device_nic,
+    &remus_device_drbd_disk,
     NULL,
 };
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 06dfb73..ef6bd88 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -140,6 +140,7 @@ typedef struct libxl__aop_occurred libxl__aop_occurred;
 typedef struct libxl__osevent_hook_nexus libxl__osevent_hook_nexus;
 typedef struct libxl__osevent_hook_nexi libxl__osevent_hook_nexi;
 typedef struct libxl__remus_netbuf_state libxl__remus_netbuf_state;
+typedef struct libxl__remus_drbd_state libxl__remus_drbd_state;
 
 _hidden void libxl__alloc_failed(libxl_ctx *, const char *func,
                          size_t nmemb, size_t size) __attribute__((noreturn));
@@ -377,6 +378,7 @@ struct libxl__ctx {
     libxl_version_info version_info;
 
     libxl__remus_netbuf_state *rns;
+    libxl__remus_drbd_state *drbd_state;
 };
 
 typedef struct {
@@ -2588,6 +2590,7 @@ struct libxl__remus_device_state {
 
     libxl_device_nic *nics;
     int num_nics;
+    libxl_device_disk *disks;
     int num_disks;
 
     /* for counting devices that have been handled */
@@ -2646,6 +2649,7 @@ _hidden void libxl__remus_devices_commit(libxl__egc *egc,
                                          libxl__remus_device_state *rds);
 
 extern const libxl__remus_device_subkind_ops remus_device_nic;
+extern const libxl__remus_device_subkind_ops remus_device_drbd_disk;
 
 _hidden int libxl__netbuffer_enabled(libxl__gc *gc);
 
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_remus_device.c
index 7e0a359..fd2fe02 100644
--- a/tools/libxl/libxl_remus_device.c
+++ b/tools/libxl/libxl_remus_device.c
@@ -86,21 +86,26 @@ void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
     rds->num_nics = 0;
     rds->num_disks = 0;
 
-    /* TBD: Remus setup - i.e. enable disk buffering, etc */
     if (rds->netbufscript)
         rds->nics = libxl_device_nic_list(CTX, rds->domid, &rds->num_nics);
 
+    rds->disks = libxl_device_disk_list(CTX, rds->domid, &rds->num_disks);
+
     if (rds->num_nics == 0 && rds->num_disks == 0)
         goto out;
 
     GCNEW_ARRAY(rds->dev, rds->num_nics + rds->num_disks);
 
-    /* TBD: CALL libxl__remus_device_init to init remus devices */
     for (i = 0; i < rds->num_nics; i++) {
         libxl__remus_device_init(egc, rds,
                                  LIBXL__REMUS_DEVICE_NIC, &rds->nics[i]);
     }
 
+    for (i = 0; i < rds->num_disks; i++) {
+        libxl__remus_device_init(egc, rds,
+                                 LIBXL__REMUS_DEVICE_DISK, &rds->disks[i]);
+    }
+
     return;
 
 out:
@@ -281,6 +286,13 @@ static void device_teardown_cb(libxl__egc *egc,
         rds->nics = NULL;
         rds->num_nics = 0;
 
+        /* clean disk */
+        for (i = 0; i < rds->num_disks; i++)
+            libxl_device_disk_dispose(&rds->disks[i]);
+        free(rds->disks);
+        rds->disks = NULL;
+        rds->num_disks = 0;
+
         destroy_device_subkind(rds);
         rds->callback(egc, rds, rds->saved_rc);
     }
diff --git a/tools/libxl/libxl_remus_disk_drbd.c b/tools/libxl/libxl_remus_disk_drbd.c
new file mode 100644
index 0000000..ae5a9d6
--- /dev/null
+++ b/tools/libxl/libxl_remus_disk_drbd.c
@@ -0,0 +1,240 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author Lai Jiangshan <laijs@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+/*** drbd implementation ***/
+const int DRBD_SEND_CHECKPOINT = 20;
+const int DRBD_WAIT_CHECKPOINT_ACK = 30;
+
+struct libxl__remus_drbd_state {
+    libxl__ao *ao;
+    char *drbd_probe_script;
+};
+
+typedef struct libxl__remus_drbd_disk {
+    int ctl_fd;
+    int ackwait;
+    const char *path;
+} libxl__remus_drbd_disk;
+
+/*----- helper functions, for async calls -----*/
+static void drbd_async_call(libxl__remus_device *dev,
+                            void func(libxl__remus_device *),
+                            libxl__ev_child_callback callback)
+{
+    int pid = -1;
+    STATE_AO_GC(dev->rds->ao);
+
+    /* Fork and call */
+    pid = libxl__ev_child_fork(gc, &dev->child, callback);
+    if (pid == -1) {
+        LOG(ERROR, "unable to fork");
+        goto out;
+    }
+
+    if (!pid) {
+        /* child */
+        func(dev);
+        /* notreached */
+        abort();
+    }
+
+    return;
+
+out:
+    dev->callback(dev->rds->egc, dev, ERROR_FAIL);
+}
+
+/*----- init() and destroy() -----*/
+static int drbd_init(libxl__remus_device_state *rds)
+{
+    libxl__remus_drbd_state *drbd_state;
+
+    STATE_AO_GC(rds->ao);
+
+    GCNEW(drbd_state);
+    CTX->drbd_state = drbd_state;
+    drbd_state->ao = ao;
+    drbd_state->drbd_probe_script = GCSPRINTF("%s/block-drbd-probe",
+                                              libxl__xen_script_dir_path());
+
+    return 0;
+}
+
+static void drbd_destroy(libxl__remus_device_state *rds)
+{
+    return;
+}
+
+/*----- match(), setup() and teardown() -----*/
+
+/* callbacks */
+static void match_async_exec_cb(libxl__egc *egc,
+                                libxl__async_exec_state *aes,
+                                int status);
+
+/* implementations */
+
+static void match_async_exec(libxl__egc *egc, libxl__remus_device *dev)
+{
+    int arraysize, nr = 0;
+    const libxl_device_disk *disk = dev->backend_dev;
+    libxl__async_exec_state *aes = &dev->aes;
+    STATE_AO_GC(dev->rds->ao);
+
+    libxl__remus_drbd_state *drbd_state = CTX->drbd_state;
+    /* setup env & args */
+    arraysize = 1;
+    GCNEW_ARRAY(aes->env, arraysize);
+    aes->env[nr++] = NULL;
+    assert(nr <= arraysize);
+
+    arraysize = 3;
+    nr = 0;
+    GCNEW_ARRAY(aes->args, arraysize);
+    aes->args[nr++] = drbd_state->drbd_probe_script;
+    aes->args[nr++] = disk->pdev_path;
+    aes->args[nr++] = NULL;
+    assert(nr <= arraysize);
+
+    aes->ao = drbd_state->ao;
+    aes->what = GCSPRINTF("%s %s", aes->args[0], aes->args[1]);
+    aes->timeout_ms = LIBXL_HOTPLUG_TIMEOUT * 1000;
+    aes->callback = match_async_exec_cb;
+    aes->stdfds[0] = -1;
+    aes->stdfds[1] = -1;
+    aes->stdfds[2] = -1;
+
+    if (libxl__async_exec_start(gc, aes))
+        goto out;
+
+    return;
+
+out:
+    dev->callback(egc, dev, ERROR_FAIL);
+}
+
+static void drbd_match(libxl__remus_device *dev)
+{
+    match_async_exec(dev->rds->egc, dev);
+}
+
+static void match_async_exec_cb(libxl__egc *egc,
+                                libxl__async_exec_state *aes,
+                                int status)
+{
+    libxl__remus_device *dev = CONTAINER_OF(aes, *dev, aes);
+
+    if (status) {
+        dev->callback(egc, dev, ERROR_REMUS_DEVOPS_NOT_MATCH);
+    } else {
+        dev->callback(egc, dev, 0);
+    }
+}
+
+static void drbd_setup(libxl__remus_device *dev)
+{
+    libxl__remus_drbd_disk *drbd_disk;
+    const libxl_device_disk *disk = dev->backend_dev;
+    STATE_AO_GC(dev->rds->ao);
+
+    GCNEW(drbd_disk);
+    dev->data = drbd_disk;
+    drbd_disk->path = disk->pdev_path;
+    drbd_disk->ackwait = 0;
+    drbd_disk->ctl_fd = open(drbd_disk->path, O_RDONLY);
+    if (drbd_disk->ctl_fd < 0)
+        dev->callback(dev->rds->egc, dev, ERROR_FAIL);
+    else
+        dev->callback(dev->rds->egc, dev, 0);
+}
+
+static void drbd_teardown(libxl__remus_device *dev)
+{
+    libxl__remus_drbd_disk *drbd_disk = dev->data;
+
+    close(drbd_disk->ctl_fd);
+    dev->callback(dev->rds->egc, dev, 0);
+}
+
+/*----- checkpointing APIs -----*/
+
+/* callbacks */
+static void chekpoint_async_call_done(libxl__egc *egc,
+                                      libxl__ev_child *child,
+                                      pid_t pid, int status);
+
+/* API implementations */
+
+/* this op will not wait and block, so implement as sync op */
+static void drbd_postsuspend(libxl__remus_device *dev)
+{
+    libxl__remus_drbd_disk *rdd = dev->data;
+
+    if (!rdd->ackwait) {
+        if (ioctl(rdd->ctl_fd, DRBD_SEND_CHECKPOINT, 0) <= 0)
+            rdd->ackwait = 1;
+    }
+
+    dev->callback(dev->rds->egc, dev, 0);
+}
+
+static void drbd_preresume_async(libxl__remus_device *dev)
+{
+    libxl__remus_drbd_disk *rdd = dev->data;
+    int ackwait = rdd->ackwait;
+
+    if (ackwait) {
+        ioctl(rdd->ctl_fd, DRBD_WAIT_CHECKPOINT_ACK, 0);
+        ackwait = 0;
+    }
+
+    _exit(ackwait);
+}
+
+static void drbd_preresume(libxl__remus_device *dev)
+{
+    drbd_async_call(dev, drbd_preresume_async, chekpoint_async_call_done);
+}
+
+static void chekpoint_async_call_done(libxl__egc *egc,
+                                      libxl__ev_child *child,
+                                      pid_t pid, int status)
+{
+    libxl__remus_device *dev = CONTAINER_OF(child, *dev, child);
+    libxl__remus_drbd_disk *rdd = dev->data;
+    STATE_AO_GC(dev->rds->ao);
+
+    if (WIFEXITED(status)) {
+        rdd->ackwait = WEXITSTATUS(status);
+        dev->callback(egc, dev, 0);
+    } else {
+        dev->callback(egc, dev, ERROR_FAIL);
+    }
+}
+
+const libxl__remus_device_subkind_ops remus_device_drbd_disk = {
+    .kind = LIBXL__REMUS_DEVICE_DISK,
+    .init = drbd_init,
+    .destroy = drbd_destroy,
+    .match = drbd_match,
+    .setup = drbd_setup,
+    .teardown = drbd_teardown,
+    .postsuspend = drbd_postsuspend,
+    .preresume = drbd_preresume,
+};
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 5/7] libxl: network buffering cmdline switch
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
                   ` (3 preceding siblings ...)
  2014-07-15  6:46 ` [PATCH v16 4/7] remus drbd: Implement remus drbd replicated disk Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 6/7] libxl: disk " Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 7/7] MAINTAINERS: Update maintained files of REMUS Yang Hongyang
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

Command line switch to 'xl remus' command.
Also update man pages to reflect the addition of a new option to
'xl remus' command.

Note: the network buffering is enabled as default. If you want to
disable it, please use -n option.

Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
---
 docs/man/xl.conf.pod.5    |  6 ++++++
 docs/man/xl.pod.1         | 11 ++++++++++-
 tools/libxl/xl.c          |  4 ++++
 tools/libxl/xl.h          |  1 +
 tools/libxl/xl_cmdimpl.c  | 28 ++++++++++++++++++++++------
 tools/libxl/xl_cmdtable.c |  3 +++
 6 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/docs/man/xl.conf.pod.5 b/docs/man/xl.conf.pod.5
index 7c43bde..8ae19bb 100644
--- a/docs/man/xl.conf.pod.5
+++ b/docs/man/xl.conf.pod.5
@@ -105,6 +105,12 @@ Configures the default gateway device to set for virtual network devices.
 
 Default: C<None>
 
+=item B<remus.default.netbufscript="PATH">
+
+Configures the default script used by Remus to setup network buffering.
+
+Default: C</etc/xen/scripts/remus-netbuf-setup>
+
 =item B<output_format="json|sxp">
 
 Configures the default output format used by xl when printing "machine
diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 30bd4bf..e9e4a83 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -431,7 +431,7 @@ Enable Remus HA for domain. By default B<xl> relies on ssh as a transport
 mechanism between the two hosts.
 
 N.B: Remus support in xl is still in experimental (proof-of-concept) phase.
-     There is no support for network or disk buffering at the moment.
+     Disk buffering support is limited to drbd disks.
 
 B<OPTIONS>
 
@@ -450,6 +450,15 @@ Generally useful for debugging.
 
 Disable memory checkpoint compression.
 
+=item B<-n>
+
+Disable network output buffering.
+
+=item B<-N> I<netbufscript>
+
+Use <netbufscript> to setup network buffering instead of the instead of
+the default (/etc/xen/scripts/remus-netbuf-setup).
+
 =item B<-s> I<sshcommand>
 
 Use <sshcommand> instead of ssh.  String will be passed to sh.
diff --git a/tools/libxl/xl.c b/tools/libxl/xl.c
index 4c5a5ee..f014306 100644
--- a/tools/libxl/xl.c
+++ b/tools/libxl/xl.c
@@ -44,6 +44,7 @@ char *default_vifscript = NULL;
 char *default_bridge = NULL;
 char *default_gatewaydev = NULL;
 char *default_vifbackend = NULL;
+char *default_remus_netbufscript = NULL;
 enum output_format default_output_format = OUTPUT_FORMAT_JSON;
 int claim_mode = 1;
 bool progress_use_cr = 0;
@@ -176,6 +177,9 @@ static void parse_global_config(const char *configfile,
     if (!xlu_cfg_get_long (config, "claim_mode", &l, 0))
         claim_mode = l;
 
+    xlu_cfg_replace_string (config, "remus.default.netbufscript",
+        &default_remus_netbufscript, 0);
+
     xlu_cfg_destroy(config);
 }
 
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index 10a2e66..087eb8c 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -170,6 +170,7 @@ extern char *default_vifscript;
 extern char *default_bridge;
 extern char *default_gatewaydev;
 extern char *default_vifbackend;
+extern char *default_remus_netbufscript;
 extern char *blkdev_start;
 
 enum output_format {
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 68df548..b324f34 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -7175,8 +7175,9 @@ int main_remus(int argc, char **argv)
     r_info.interval = 200;
     r_info.blackhole = 0;
     r_info.compression = 1;
+    r_info.netbuf = 1;
 
-    SWITCH_FOREACH_OPT(opt, "bui:s:e", NULL, "remus", 2) {
+    SWITCH_FOREACH_OPT(opt, "buni:s:N:e", NULL, "remus", 2) {
     case 'i':
         r_info.interval = atoi(optarg);
         break;
@@ -7186,6 +7187,12 @@ int main_remus(int argc, char **argv)
     case 'u':
         r_info.compression = 0;
         break;
+    case 'n':
+        r_info.netbuf = 0;
+        break;
+    case 'N':
+        r_info.netbufscript = optarg;
+        break;
     case 's':
         ssh_command = optarg;
         break;
@@ -7197,6 +7204,9 @@ int main_remus(int argc, char **argv)
     domid = find_domain(argv[optind]);
     host = argv[optind + 1];
 
+    if (!r_info.netbufscript)
+        r_info.netbufscript = default_remus_netbufscript;
+
     if (r_info.blackhole) {
         send_fd = open("/dev/null", O_RDWR, 0644);
         if (send_fd < 0) {
@@ -7234,13 +7244,19 @@ int main_remus(int argc, char **argv)
     /* Point of no return */
     rc = libxl_domain_remus_start(ctx, &r_info, domid, send_fd, recv_fd, 0);
 
-    /* If we are here, it means backup has failed/domain suspend failed.
-     * Try to resume the domain and exit gracefully.
-     * TODO: Split-Brain check.
+    /* check if the domain exists. User may have xl destroyed the
+     * domain to force failover
      */
-    fprintf(stderr, "remus sender: libxl_domain_suspend failed"
-            " (rc=%d)\n", rc);
+    if (libxl_domain_info(ctx, 0, domid)) {
+        fprintf(stderr, "Remus: Primary domain has been destroyed.\n");
+        close(send_fd);
+        return 0;
+    }
 
+    /* If we are here, it means remus setup/domain suspend/backup has
+     * failed. Try to resume the domain and exit gracefully.
+     * TODO: Split-Brain check.
+     */
     if (rc == ERROR_GUEST_TIMEDOUT)
         fprintf(stderr, "Failed to suspend domain at primary.\n");
     else {
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 4279b9f..3f7520d 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -487,6 +487,9 @@ struct cmd_spec cmd_table[] = {
       "-i MS                   Checkpoint domain memory every MS milliseconds (def. 200ms).\n"
       "-b                      Replicate memory checkpoints to /dev/null (blackhole)\n"
       "-u                      Disable memory checkpoint compression.\n"
+      "-n                      Disable network output buffering.\n"
+      "-N <netbufscript>       Use netbufscript to setup network buffering instead of the\n"
+      "                        instead of the default (/etc/xen/scripts/remus-netbuf-setup).\n"
       "-s <sshcommand>         Use <sshcommand> instead of ssh.  String will be passed\n"
       "                        to sh. If empty, run <host> instead of \n"
       "                        ssh <host> xl migrate-receive -r [-e]\n"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 6/7] libxl: disk buffering cmdline switch
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
                   ` (4 preceding siblings ...)
  2014-07-15  6:46 ` [PATCH v16 5/7] libxl: network buffering cmdline switch Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  2014-07-15  6:46 ` [PATCH v16 7/7] MAINTAINERS: Update maintained files of REMUS Yang Hongyang
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

disk buffering is enabled as default.
add an option "-d" to disable it.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 docs/man/xl.pod.1                | 4 ++++
 tools/libxl/libxl.c              | 1 +
 tools/libxl/libxl_internal.h     | 1 +
 tools/libxl/libxl_remus_device.c | 3 ++-
 tools/libxl/libxl_types.idl      | 1 +
 tools/libxl/xl_cmdimpl.c         | 6 +++++-
 tools/libxl/xl_cmdtable.c        | 1 +
 7 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index e9e4a83..547e8a2 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -459,6 +459,10 @@ Disable network output buffering.
 Use <netbufscript> to setup network buffering instead of the instead of
 the default (/etc/xen/scripts/remus-netbuf-setup).
 
+=item B<-d>
+
+Disable disk buffering.
+
 =item B<-s> I<sshcommand>
 
 Use <sshcommand> instead of ssh.  String will be passed to sh.
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 00b298a..f0b86fd 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -844,6 +844,7 @@ int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
     rds->ao = ao;
     rds->egc = egc;
     rds->domid = domid;
+    rds->diskbuf = info->diskbuf;
     rds->callback = libxl__remus_setup_done;
     rds->ops = remus_ops;
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index ef6bd88..441a51f 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2582,6 +2582,7 @@ struct libxl__remus_device_state {
     /* the last ops must be NULL */
     const libxl__remus_device_subkind_ops **ops;
     const char *netbufscript;
+    bool diskbuf;
 
     /* private */
     /* devices that have been set up */
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_remus_device.c
index fd2fe02..c9df7f0 100644
--- a/tools/libxl/libxl_remus_device.c
+++ b/tools/libxl/libxl_remus_device.c
@@ -89,7 +89,8 @@ void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
     if (rds->netbufscript)
         rds->nics = libxl_device_nic_list(CTX, rds->domid, &rds->num_nics);
 
-    rds->disks = libxl_device_disk_list(CTX, rds->domid, &rds->num_disks);
+    if (rds->diskbuf)
+        rds->disks = libxl_device_disk_list(CTX, rds->domid, &rds->num_disks);
 
     if (rds->num_nics == 0 && rds->num_disks == 0)
         goto out;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index a96771e..6551109 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -592,6 +592,7 @@ libxl_domain_remus_info = Struct("domain_remus_info",[
     ("compression",  bool),
     ("netbuf",       bool),
     ("netbufscript", string),
+    ("diskbuf",      bool),
     ])
 
 libxl_event_type = Enumeration("event_type", [
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index b324f34..167b65b 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -7176,8 +7176,9 @@ int main_remus(int argc, char **argv)
     r_info.blackhole = 0;
     r_info.compression = 1;
     r_info.netbuf = 1;
+    r_info.diskbuf = 1;
 
-    SWITCH_FOREACH_OPT(opt, "buni:s:N:e", NULL, "remus", 2) {
+    SWITCH_FOREACH_OPT(opt, "bundi:s:N:e", NULL, "remus", 2) {
     case 'i':
         r_info.interval = atoi(optarg);
         break;
@@ -7193,6 +7194,9 @@ int main_remus(int argc, char **argv)
     case 'N':
         r_info.netbufscript = optarg;
         break;
+    case 'd':
+        r_info.diskbuf = 0;
+        break;
     case 's':
         ssh_command = optarg;
         break;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 3f7520d..246aa11 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -490,6 +490,7 @@ struct cmd_spec cmd_table[] = {
       "-n                      Disable network output buffering.\n"
       "-N <netbufscript>       Use netbufscript to setup network buffering instead of the\n"
       "                        instead of the default (/etc/xen/scripts/remus-netbuf-setup).\n"
+      "-d                      Disable disk buffering.\n"
       "-s <sshcommand>         Use <sshcommand> instead of ssh.  String will be passed\n"
       "                        to sh. If empty, run <host> instead of \n"
       "                        ssh <host> xl migrate-receive -r [-e]\n"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v16 7/7] MAINTAINERS: Update maintained files of REMUS
  2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
                   ` (5 preceding siblings ...)
  2014-07-15  6:46 ` [PATCH v16 6/7] libxl: disk " Yang Hongyang
@ 2014-07-15  6:46 ` Yang Hongyang
  6 siblings, 0 replies; 19+ messages in thread
From: Yang Hongyang @ 2014-07-15  6:46 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, ian.jackson,
	eddie.dong, rshriram, laijs

Update maintained files of REMUS.

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 MAINTAINERS | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 266e47b..9c407dc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -262,6 +262,11 @@ S:	Maintained
 F:	docs/README.remus
 F:	tools/blktap2/drivers/block-remus.c
 F:	tools/blktap2/drivers/hashtable*
+F:	tools/libxl/libxl_remus_*
+F:	tools/libxl/libxl_netbuffer.c
+F:	tools/libxl/libxl_nonetbuffer.c
+F:	tools/hotplug/Linux/remus-netbuf-setup
+F:	tools/hotplug/Linux/block-drbd-probe
 
 SCHEDULING
 M:	George Dunlap <george.dunlap@eu.citrix.com>
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-15  6:46 ` [PATCH v16 2/7] remus: introduce remus device Yang Hongyang
@ 2014-07-15 18:28   ` Ian Jackson
  2014-07-16 10:50     ` Hongyang Yang
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Jackson @ 2014-07-15 18:28 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: laijs, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, rshriram, ian.campbell

Yang Hongyang writes ("[PATCH v16 2/7] remus: introduce remus device"):
> introduce remus device, an abstract layer of remus devices(nic, disk,
> etc).It provides the following APIs for libxl:
...

Thanks.  I've gone into a lot more detail here.  Many of my changes
are wording changes to comments, since it's probably easier for me to
simply provide you with the better wording.  Please feel free to cut
and paste them.  If they seem inaccurate then I have misunderstood
something :-).

For the comments,
  Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>



> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
...
> +/*
> + * LIBXL_HAVE_REMUS
> + * If this is defined, then libxl supports remus.
> + */
> +#define LIBXL_HAVE_REMUS 1
> +

It occurs to me to ask: would this be better put in at the end, given
that the code after (only) this patch has gone in is not really
useable because of the lack of proper device handling ?


> +static void remus_device_preresume_cb(libxl__egc *egc,
> +                                      libxl__remus_device_state *rds,
> +                                      int rc)
> +{
> +    int ok = 0;
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
> +    STATE_AO_GC(dss->ao);
> +
> +    if (!rc) {
> +        /* Resumes the domain and the device model */
> +        if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
> +            ok = 1;
> +    }
> +    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);

Can we please use the standard `goto out' error handling style ?  Both
here and in remus_device_postsuspend_cb.

The right approach is this: At the top of the function, test rc, and
if nonzero log if necessary, and `goto out'.  The success path should
set `ok=1'.

You have it right in remus_checkpoint_dm_saved.

> +static void remus_device_commit_cb(libxl__egc *egc,
> +                                   libxl__remus_device_state *rds,
> +                                   int rc)
> +{
> +    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
> +
> +    STATE_AO_GC(dss->ao);
> +
> +    if (rc) {
> +        LOG(ERROR, "Failed to do device commit op."
> +            " Terminating Remus..");
> +        goto out;
> +    } else {

This shouldn't be an else; that results in every-increasing nesting
depth.

> +        /* Set checkpoint interval timeout */
> +        rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
> +                                         remus_next_checkpoint,
> +                                         dss->interval);
> +        if (rc) {
> +            LOG(ERROR, "unable to register timeout for next epoch."
> +                " Terminating Remus..");
> +            goto out;

> +static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
> +                                  const struct timeval *requested_abs)
> +{
> +    libxl__domain_suspend_state *dss =
> +                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
> +
> +    STATE_AO_GC(dss->ao);
> +
> +    libxl__ev_time_deregister(gc, &dss->checkpoint_timeout);

This is not necessary.  libxl's timers are one-shot.

>      libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
>  }

But this is a bit mystifying to me.  It seems that when you have
finished committing to the checkpoint, you delay by the
inter-checkpoint delay before returning.  Is that right ?  I think
this could at the very least do with some commentary explaining why...

>  
> @@ -1752,6 +1849,34 @@ static void domain_suspend_done(libxl__egc *egc,
>          xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
>                             dss->guest_evtchn.port, &dss->guest_evtchn_lockfd);
>  
> +    if (dss->remus) {

The usual libxl idiom for this kind of conditional processing can be
seen in libxl_create.c.  Search, for example, for
"domcreate_bootloader_done".  You'll see that what we do is
this:

        if (restore_fd >= 0) {
            LOG(DEBUG, "restoring, not running bootloader\n");
                    domcreate_bootloader_done(egc, &dcs->bl, 0);

So I think what you should do here is:

      if (!dss->remus) {
          remus_teardown_done(egc,rds,0;
          return;
      }

and then put the meat of the function outside the conditional.

The effect of this is that there is still only one place where
dss->callback is called.


> +        /*
> +         * With Remus, if we reach this point, it means either
> +         * backup died or some network error occurred preventing us
> +         * from sending checkpoints. Teardown the network buffers and
> +         * release netlink resources.  This is an async op.
> +         */
> +        LOGE(WARN, "Remus: Domain suspend terminated with rc %d,"
> +             " teardown Remus devices...", rc);
> +        dss->rds.callback = libxl__remus_teardown_done;

I don't understand why libxl__remus_teardown_done has an
inter-file-visible name.  Isn't it just a callback chaining function ?
If so I would call it remus_teardown_done or something.


> +typedef enum libxl__remus_device_kind {
> +    LIBXL__REMUS_DEVICE_NIC,
> +    LIBXL__REMUS_DEVICE_DISK,
> +} libxl__remus_device_kind;

AFAICT this enum is used only for filtering which subkind ops vtables
we attempt setup() on.  Maybe it would be better to remove this enum
and replace its uses in libxl__remus_devices_setup with references to
(lists of) vtables ?

Eg

   extern const libxl__remus_device_subkind_ops libxl__remus_subkinds_nic[];
   extern const libxl__remus_device_subkind_ops libxl__remus_subkinds_disk[];
   const libxl__remus_device_subkind_ops *libxl__remus_all_subkinds[] = {
          libxl__remus_subkinds_nic,
          libxl__remus_subkinds_disk,
          0,
   };

and then

    for (i = 0; i < rds->num_nics; i++) {
            libxl__remus_device_init(egc, rds,
                                     libxl__remus_device_subkinds_nic,
                                     &rds->nics[i]);
    }

or something.

> +struct libxl__remus_device_subkind_ops {
> +    /* the device kind this ops belongs to... */
> +    libxl__remus_device_kind kind;
> +
> +    /*
> +     * init() and destroy() APIs are produced by a device subkind and
> +     * consumed by the main remus code, a device subkind must implement

We would normally write, in English, that an API is "provided" and
"called", rather than "produced" and "consumed".  Objects may be
produced and consumed, but not functions.

Also this whole API has the property there that you attribute to these
functions.  So I would say:

        * These operations are provided by a device subkind and
        * called by the main remus code.  Functions must be
        * implemented unless otherwise stated.

at the top of the structure.

I would also move the replicated text about async functions to a
common section:

        * Many of these functions are asynchronous.  They call
        * dev->callback when done.  The actual implementations may be
        * synchronously and call dev->callback directly (as the last
        * thing they do).

> +     * init() and destroy() APIs are produced by a device subkind and
> +     * consumed by the main remus code, a device subkind must implement
> +     * these two APIs.

What you really mean is that these init and destroy a subkind's data.
So I would rename `destroy' to `cleanup' and say:

        * init() and cleanup() relate to the subkind-specific state in
        * the libxl ctx, not to any specific device.

Although, now I have read later in this file, it seems that this
subkind-specific state is stored in the per-remus-operation state
structure, not in the libxl_ctx.

I see from (for example) that libxl_netbuffer.c uses a netlink socket
per remus operation.  Would it be possible to share a netlink socket
between operations ?  I seem to remember this was a static global
variable in a previous version of the series, which suggests it would
be possible.

If it is, why not put it in the libxl_ctx ?


And of course:

        * Synchronous.  cleanup() cannot fail.

> +    /*
> +     * checkpoint callbacks, these are async ops, . These function pointers may be NULL, means the op is
> +     * not implemented, and it will do nothing when checkpoint.
> +     * The callers of these APIs must check the function pointer first.
> +     * 

     * Checkpoint operations.  May be NULL, meaning the op is not
     * implemented and the caller should treat them as a no-op (and do
     * nothing when checkpointing).

And of course:

     * Asynchronous.

> +    void (*postsuspend)(libxl__remus_device *dev);
> +    void (*preresume)(libxl__remus_device *dev);
> +    void (*commit)(libxl__remus_device *dev);

> +    /*
> +     * This API determines whether the subkind matchs the specific
> +     * device. In the implementation, we first init all device
> +     * subkind, for example, NIC, DRBD disk... Then we will find
> +     * out the libxl devices, and match the device with the
> +     * subkind, if the device is a drbd disk, then it will be
> +     * matched with DRBD subkind, and the further ops(such as
> +     * checkpoint etc.)  of this device will using DRBD subkind
> +     * ops. This API is mainly for disks, because we must use an
> +     * external script to determine whether a libxl_disk is a DRBD
> +     * disk.  This function pointer may be NULL. That means this
> +     * *kind* of device's ops is always matched with the *kind* of
> +     * device.  It's an async op and must be implemented
> +     * asynchronously, call dev->callback when done.
> +     */

     * Determines whether the subkind matches the specific device.
     * init() will have been called for the subkind.  If match()
     * succeeds, the device will then be managed with this set of
     * subkind operations.  Asynchronous.  Yields 0 if the device
     * matches, or REMUS_DEVOPS_NOT_MATCH if it does not; any other rc
     * indicates failure.  May be NULL, meaning this subkind matches
     * any device.

> +    void (*match)(libxl__remus_device *dev);

> +    /*
> +     * setup() and teardown() are refer to the actual remus device,
> +     * a device subkind must implement these two APIs. They are async
> +     * ops, and call dev->callback when done.
> +     * These callbacks can be implemented synchronously, call
> +     * dev->callback at last directly.
> +     */

Having dealt with the boilerplatey bits we can simply write:

     * setup() and teardown() are refer to the actual remus device.
     * Asynchronous.

I seem to remember asking you to document that:

     * teardown is called even if setup fails.

> +/*
> + * This structure is for remus device layer, it records remus devices
> + * that have been set up.

I don't think this struct is entirely for the remus device layer,
since only part of it is "private".

    * State associated with a remus invocation, including parameters
    * passed to the remus abstract device layer by the generic
    * (non-remus) save/restore machinery.

> + */
> +struct libxl__remus_device_state {
> +    /* must set by caller of libxl__remus_device_(setup|teardown) */

You mean
      must be set by [etc]

> +    libxl__ao *ao;
> +    libxl__egc *egc;
> +    uint32_t domid;
> +    libxl__remus_callback *callback;
> +    /* the last ops must be NULL */
> +    const libxl__remus_device_subkind_ops **ops;

Surely it shouldn't be the non-remus code's responsibility to provide
the vtable list ?  In fact I don't really understand why this needs to
be a parameter at all.  Isn't it always the same ?

> +    /* private */
> +    /* devices that have been set up */
> +    int saved_rc;
> +    libxl__remus_device **dev;

Please make the sectional comments like "private" and "must be set..."
stand out more.  I seem to remember asking this before (about a
different struct perhaps?)

The 2nd comment there can't be right.  Perhaps it refers to dev.

But if dev is the set of devices that have been set up, what form is
it in ?  I think it's an array and that the counters num_devices and
num_set_up are associated with it.

If so you need to bring dev and these sizes/counts together, and write
much more informative commentary.  To give a specific example:

> +    /* for counting devices that have been set up */
> +    int num_set_up;

The variable is called "num_set_up".  And everything here is devices.
So the variable must be a "num[ber]" of devices which have been "set
up".  That's the same as "counting" the devices which have been set
up.

So the comment tells me nothing that the variable name doesn't.

But together these comments leave many important questions unanswered.
What is the size of the devs array ?  What proportion of the devs
array is initialised at any one time ?  May the devs array contain
null pointers and what do they mean ?  etc.

> +typedef void libxl__remus_device_callback(libxl__egc *,
> +                                          libxl__remus_device *,
> +                                          int rc);
> +/*
> + * This structure is init and setup by remus device abstruct layer,
> + * and pass to remus device ops

Instead,

     Information about a single device being handled by remus.
     Allocated by the remus abstract layer.

It's not all "init" by the abstract layer - your more detailed
comments inside the struct explain that nicely.

> + */
> +struct libxl__remus_device {
> +    /*----- shared between abstract and concrete layers -----*/
> +    /* set by remus device abstruct layer */

I would say
        * all set by remus device abstruct layer */
and add a blank line after it.

That makes it clear that the 2nd comment line applies to the whole
section, not just the single next line.

> +    /* libxl__device_* which this remus device related to */
> +    const void *backend_dev;
> +    libxl__remus_device_kind kind;
> +
> +    /*----- private for abstract layer only -----*/
> +    /*
> +     * we must go through all device ops until we find a matched ops
> +     * for the device.
> +     */

This last comment belongs in the code, if anywhere, but I suspect it
is obvious.  Perhaps what you really mean is

        * Control and state variables for the asynchronous callback
        * based loops which iterate over device subkinds, and over
        * individual devices.

?

> +    int ops_index;
> +    const libxl__remus_device_subkind_ops *ops;
> +    libxl__remus_device_callback *callback;
> +    libxl__remus_device_state *rds;

> +    /*----- private for concrete (device-specific) layer -----*/
> +    /* *kind* of device's private data */
> +    void *data;

Again, the /*----- -----*/ comment would benefit from a following
blank line.

It would be better to change the variable name than write the
information in the comment.  Is there some reason why data shouldn't
be `concrete_data' or something ?

I'm not sure why "*kind*" is emphasized in the comment.
;2~
> +    /* for calling scripts, eg. setup|teardown|match scripts */
> +    libxl__async_exec_state aes;
> +    /*
> +     * for async func calls, in the implementation of device ops, we
> +     * may use fork to do async ops. this is owned by device-specific
> +     * ops methods
> +     */
> +    libxl__ev_child child;

I still think these variables would be better in the subkind private
data structs.  Having them here is a layering violation.  The fact
that each of the subkinds needs these is a coincidence, not part of
the design.  So it is fine to write them out twice for that reason.

And I'm not sure how helpful the comments are.  It's fairly obvious
that a libxl__async_exec_state is for running scripts, for example.
But maybe if these were in the concrete code it would be possible to
make the comments more specific.

> --- /dev/null
> +++ b/tools/libxl/libxl_remus_device.c
> @@ -0,0 +1,339 @@
> +/*
> + * Copyright (C) 2014
> + * Author: Yang Hongyang <yanghy@cn.fujitsu.com>

I think you need to specify who owns the copyright.  Probably
Fujitsu ?  Although this will depend on your contractual situation.

> +/*----- helper functions -----*/
> +
> +static int init_device_subkind(libxl__remus_device_state *rds)
> +{
> +    int rc;
> +    const libxl__remus_device_subkind_ops **ops;
> +
> +    for (ops = rds->ops; *ops; ops++) {
> +        rc = (*ops)->init(rds);

Why not call init immediately before match() ?  That would do away
with this loop entirely.

Furthermore, this seems to imply that init is idempotent.  This isn't
spelled out in your API comments.

If init is idempotent, and is going to be called before match() every
time, why not fold it into match ?

> +        if (rc) {
> +            goto out;
> +        }

If you keep this, there is no need for the braces here.  Indeed you
can put `goto out' on the same line.

> +/* remus device setup and teardown */
> +
> +static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
> +                                     libxl__remus_device_state *rds,
> +                                     libxl__remus_device_kind kind,
> +                                     void *libxl_dev);
> +void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
> +{

Missing blank line between the two declarations.

If libxl__remus_device_init is static, why does it have a
globally-visible name ?

> +    STATE_AO_GC(rds->ao);
> +
> +    if (!rds->ops[0])
> +        goto out;

Why the special case ?  AFAICT init_device_subkind would do the right
thing if called anyway.

> +static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
> +                                     libxl__remus_device_state *rds,
> +                                     libxl__remus_device_kind kind,
> +                                     void *libxl_dev)
> +{
> +    libxl__remus_device *dev = NULL;
> +
> +    STATE_AO_GC(rds->ao);
> +    GCNEW(dev);
> +    dev->backend_dev = libxl_dev;
> +    dev->kind = kind;
> +    dev->rds = rds;
> +
> +    libxl__async_exec_init(&dev->aes);
> +    libxl__ev_child_init(&dev->child);
> +
> +    /* match the ops begin */
> +    dev->ops_index = 0;
> +    dev->ops = rds->ops[dev->ops_index];
> +    for (; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
> +        if (dev->ops->kind == dev->kind) {
> +            if (dev->ops->match) {
> +                dev->callback = device_match_cb;
> +                dev->ops->match(dev);
> +            } else {
> +                /*
> +                 * This devops do not have match() implementation.
> +                 * That means this *kind* of device's ops is always
> +                 * matched with the *kind* of device.
> +                 */
> +                dev->callback = device_setup_cb;
> +                dev->ops->setup(dev);
> +            }

This would be much less confusing if you had this !match case call
device_match_cb with rc=0 instead of directly calling setup.

> +static void device_match_cb(libxl__egc *egc,
> +                            libxl__remus_device *dev,
> +                            int rc)
> +{
> +    libxl__remus_device_state *const rds = dev->rds;
> +
> +    STATE_AO_GC(rds->ao);
> +
> +    if (rds->saved_rc) {
> +        /* there's already an error happened, we do not need to continue */
> +        rds->num_devices++;
> +        if (all_devices_handled(rds))
> +            rds->callback(egc, rds, rds->saved_rc);
> +        return;
> +    }
> +
> +    if (rc) {
> +        /* the ops does not match, try next ops */

But what if rc==ERROR_MY_HEAD_EXPLODED ?

> +        dev->ops = rds->ops[++dev->ops_index];
> +        if (!dev->ops || rc != ERROR_REMUS_DEVOPS_NOT_MATCH) {
> +            /* the device can not be matched */
> +            rds->num_devices++;
> +            rds->saved_rc = ERROR_REMUS_DEVICE_NOT_SUPPORTED;
> +            if (all_devices_handled(rds))
> +                rds->callback(egc, rds, rds->saved_rc);
> +            return;
> +        }
> +        for ( ; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
> +            if (dev->ops->kind == dev->kind) {

The logic here is tangled and duplicated.  Please arrange to have only
one instance of this loop.

The easiest way is probably to have libxl__remus_device_init set
ops_index to -1 or something and call device_match_cb.  Then
device_match_cb can do all the work.

> +static void device_setup_cb(libxl__egc *egc,
> +                            libxl__remus_device *dev,
> +                            int rc)
> +{
> +    /* Convenience aliases */
> +    libxl__remus_device_state *const rds = dev->rds;
> +
> +    STATE_AO_GC(rds->ao);
> +
> +    rds->num_devices++;
> +    /*
> +     * the netbuf script was designed as below:
> +     * 1. when setup failed, the script won't teardown the device itself.
> +     * 2. the teardown op is ok to be executed many times.

This information needs to be in the API commentary, and can then be
deleted here.

> +     * we add devices that have been set up to the array no matter
> +     * the setup process succeed or failed because we need to ensure
> +     * the device been teardown while setup failed. If any of the
> +     * device setup failed, we will quit remus, but before we exit,
> +     * we will teardown the devices that have been added to **dev
> +     */

This information would be better expressed in the commentary on the
state variables, as invariant(s) on the state.

> +    if (all_devices_handled(rds))
> +        rds->callback(egc, rds, rds->saved_rc);

This little snippet of code occurs many times.  Perhaps
all_devices_handled should include calling the callback and be called
"check_all_devices_handled" or "check_make_callback" or something.

> +void libxl__remus_devices_teardown(libxl__egc *egc, libxl__remus_device_state *rds)
> +{
> +    int i, num_set_up;
> +    libxl__remus_device *dev;
> +
> +    STATE_AO_GC(rds->ao);
> +
> +    rds->saved_rc = 0;
> +
> +    if (rds->num_set_up == 0) {
> +        destroy_device_subkind(rds);
> +        goto out;
> +    }

Again, you need to arrange for this unfolded loop to contain only one
copy of each bit of logic.

> +#define define_remus_device_checkpoint_api(api)                             \
> +void libxl__remus_devices_##api(libxl__egc *egc,                            \
> +                                libxl__remus_device_state *rds)             \

Can you please make the lines shorter ?  When I quote this, it comes
out looking like this:

...
> +    for (i = 0; i < rds->num_set_up; i++) {                        \
    \
> +        dev = rds->dev[i];                                         \
    \
> +        dev->callback = device_checkpoint_cb;                      \
    \
> +        if (dev->ops->api) {                                       \
    \
> +            dev->ops->api(dev);                                    \
    \

which is hard to read.


To comment on a slightly reformatted version:

> +void libxl__remus_devices_##api(libxl__egc *egc,                  \
> +                                libxl__remus_device_state *rds)   \
> +{                                                                 \
> +    int i;                                                        \
> +    libxl__remus_device *dev;                                     \
> +                                                                  \
> +    STATE_AO_GC(rds->ao);                                         \
> +                                                                  \
> +    rds->num_devices = 0;                                         \
> +    rds->saved_rc = 0;                                            \
> +                                                                  \
> +    if (rds->num_set_up == 0)                                     \
> +        goto out;                                                 \

I think if you did as I suggest with check_all_devices_handled, you
would end up calling check_all_devices_handled on exit from this
function and therefore wouldn't need the special case.

> +                                                                  \
> +    for (i = 0; i < rds->num_set_up; i++) {                       \
> +        dev = rds->dev[i];                                        \
> +        dev->callback = device_checkpoint_cb;                     \
> +        if (dev->ops->api) {                                      \
> +            dev->ops->api(dev);                                   \
> +        } else {                                                  \
> +            rds->num_devices++;                                   \
> +            if (rds->num_devices == rds->num_set_up)              \
> +                rds->callback(egc, rds, rds->saved_rc);           \
> +        }                                                         \
> +    }                                                             \
> +                                                                  \
> +    return;                                                       \
> +                                                                  \
> +out:                                                              \
> +    rds->callback(egc, rds, rds->saved_rc);                       \
> +}

This function is strikingly similar to the previous unfolded async
loops.

And it's also similar to libxl__ao_device and libxl__multidev_*.

Can any of this be factored out ?  Can we in fact reuse
libxl__ao_device and libxl__multidev_* ?


Thanks,
Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-15 18:28   ` Ian Jackson
@ 2014-07-16 10:50     ` Hongyang Yang
  2014-07-16 11:46       ` Ian Jackson
  0 siblings, 1 reply; 19+ messages in thread
From: Hongyang Yang @ 2014-07-16 10:50 UTC (permalink / raw)
  To: Ian Jackson
  Cc: laijs, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, rshriram, ian.campbell



On 07/16/2014 02:28 AM, Ian Jackson wrote:
> Yang Hongyang writes ("[PATCH v16 2/7] remus: introduce remus device"):
>> introduce remus device, an abstract layer of remus devices(nic, disk,
>> etc).It provides the following APIs for libxl:
> ...
>
> Thanks.  I've gone into a lot more detail here.  Many of my changes
> are wording changes to comments, since it's probably easier for me to
> simply provide you with the better wording.  Please feel free to cut
> and paste them.  If they seem inaccurate then I have misunderstood
> something :-).
>
> For the comments,
>    Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

Thank you very much!
I've copy/paste your commnets and added your Signed-off-by.

>
>
>
>> --- a/tools/libxl/libxl.h
>> +++ b/tools/libxl/libxl.h
> ...
>> +/*
>> + * LIBXL_HAVE_REMUS
>> + * If this is defined, then libxl supports remus.
>> + */
>> +#define LIBXL_HAVE_REMUS 1
>> +
>
> It occurs to me to ask: would this be better put in at the end, given
> that the code after (only) this patch has gone in is not really
> useable because of the lack of proper device handling ?

I will make a separate patch to update this at the end, thanks!

>
>
>> +static void remus_device_preresume_cb(libxl__egc *egc,
>> +                                      libxl__remus_device_state *rds,
>> +                                      int rc)
>> +{
>> +    int ok = 0;
>> +    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
>> +    STATE_AO_GC(dss->ao);
>> +
>> +    if (!rc) {
>> +        /* Resumes the domain and the device model */
>> +        if (!libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1))
>> +            ok = 1;
>> +    }
>> +    libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, ok);
>
> Can we please use the standard `goto out' error handling style ?  Both
> here and in remus_device_postsuspend_cb.
>
> The right approach is this: At the top of the function, test rc, and
> if nonzero log if necessary, and `goto out'.  The success path should
> set `ok=1'.

Ok, thanks.

>
> You have it right in remus_checkpoint_dm_saved.
>
>> +static void remus_device_commit_cb(libxl__egc *egc,
>> +                                   libxl__remus_device_state *rds,
>> +                                   int rc)
>> +{
>> +    libxl__domain_suspend_state *dss = CONTAINER_OF(rds, *dss, rds);
>> +
>> +    STATE_AO_GC(dss->ao);
>> +
>> +    if (rc) {
>> +        LOG(ERROR, "Failed to do device commit op."
>> +            " Terminating Remus..");
>> +        goto out;
>> +    } else {
>
> This shouldn't be an else; that results in every-increasing nesting
> depth.

Ok, thanks.

>
>> +        /* Set checkpoint interval timeout */
>> +        rc = libxl__ev_time_register_rel(gc, &dss->checkpoint_timeout,
>> +                                         remus_next_checkpoint,
>> +                                         dss->interval);
>> +        if (rc) {
>> +            LOG(ERROR, "unable to register timeout for next epoch."
>> +                " Terminating Remus..");
>> +            goto out;
>
>> +static void remus_next_checkpoint(libxl__egc *egc, libxl__ev_time *ev,
>> +                                  const struct timeval *requested_abs)
>> +{
>> +    libxl__domain_suspend_state *dss =
>> +                            CONTAINER_OF(ev, *dss, checkpoint_timeout);
>> +
>> +    STATE_AO_GC(dss->ao);
>> +
>> +    libxl__ev_time_deregister(gc, &dss->checkpoint_timeout);
>
> This is not necessary.  libxl's timers are one-shot.
>
>>       libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);
>>   }
>
> But this is a bit mystifying to me.  It seems that when you have
> finished committing to the checkpoint, you delay by the
> inter-checkpoint delay before returning.  Is that right ?  I think
> this could at the very least do with some commentary explaining why...

This delay time is the interval between checkpoints. Remus doing checkpoints
periodically.

>
>>
>> @@ -1752,6 +1849,34 @@ static void domain_suspend_done(libxl__egc *egc,
>>           xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
>>                              dss->guest_evtchn.port, &dss->guest_evtchn_lockfd);
>>
>> +    if (dss->remus) {
>
> The usual libxl idiom for this kind of conditional processing can be
> seen in libxl_create.c.  Search, for example, for
> "domcreate_bootloader_done".  You'll see that what we do is
> this:
>
>          if (restore_fd >= 0) {
>              LOG(DEBUG, "restoring, not running bootloader\n");
>                      domcreate_bootloader_done(egc, &dcs->bl, 0);
>
> So I think what you should do here is:
>
>        if (!dss->remus) {
>            remus_teardown_done(egc,rds,0;

If it's not a remus suspend, there's no rds...so we can not call
remus_teardown_done.
I may keep the original code the next version.

>            return;
>        }
>
> and then put the meat of the function outside the conditional.
>
> The effect of this is that there is still only one place where
> dss->callback is called.
>
>
>> +        /*
>> +         * With Remus, if we reach this point, it means either
>> +         * backup died or some network error occurred preventing us
>> +         * from sending checkpoints. Teardown the network buffers and
>> +         * release netlink resources.  This is an async op.
>> +         */
>> +        LOGE(WARN, "Remus: Domain suspend terminated with rc %d,"
>> +             " teardown Remus devices...", rc);
>> +        dss->rds.callback = libxl__remus_teardown_done;
>
> I don't understand why libxl__remus_teardown_done has an
> inter-file-visible name.  Isn't it just a callback chaining function ?
> If so I would call it remus_teardown_done or something.

I will rename it to remus_teardown_done, thanks.

>
>
>> +typedef enum libxl__remus_device_kind {
>> +    LIBXL__REMUS_DEVICE_NIC,
>> +    LIBXL__REMUS_DEVICE_DISK,
>> +} libxl__remus_device_kind;
>
> AFAICT this enum is used only for filtering which subkind ops vtables
> we attempt setup() on.  Maybe it would be better to remove this enum
> and replace its uses in libxl__remus_devices_setup with references to
> (lists of) vtables ?
>
> Eg
>
>     extern const libxl__remus_device_subkind_ops libxl__remus_subkinds_nic[];
>     extern const libxl__remus_device_subkind_ops libxl__remus_subkinds_disk[];
>     const libxl__remus_device_subkind_ops *libxl__remus_all_subkinds[] = {
>            libxl__remus_subkinds_nic,
>            libxl__remus_subkinds_disk,
>            0,
>     };
>
> and then
>
>      for (i = 0; i < rds->num_nics; i++) {
>              libxl__remus_device_init(egc, rds,
>                                       libxl__remus_device_subkinds_nic,
>                                       &rds->nics[i]);
>      }
>
> or something.

Thanks for the detailed comments. this enum also used for match() and
other uses, so I will keep it.

>
>> +struct libxl__remus_device_subkind_ops {
>> +    /* the device kind this ops belongs to... */
>> +    libxl__remus_device_kind kind;
>> +
>> +    /*
>> +     * init() and destroy() APIs are produced by a device subkind and
>> +     * consumed by the main remus code, a device subkind must implement
>
> We would normally write, in English, that an API is "provided" and
> "called", rather than "produced" and "consumed".  Objects may be
> produced and consumed, but not functions.
>
> Also this whole API has the property there that you attribute to these
> functions.  So I would say:
>
>          * These operations are provided by a device subkind and
>          * called by the main remus code.  Functions must be
>          * implemented unless otherwise stated.
>
> at the top of the structure.
>
> I would also move the replicated text about async functions to a
> common section:
>
>          * Many of these functions are asynchronous.  They call
>          * dev->callback when done.  The actual implementations may be
>          * synchronously and call dev->callback directly (as the last
>          * thing they do).
>
>> +     * init() and destroy() APIs are produced by a device subkind and
>> +     * consumed by the main remus code, a device subkind must implement
>> +     * these two APIs.
>
> What you really mean is that these init and destroy a subkind's data.
> So I would rename `destroy' to `cleanup' and say:
>
>          * init() and cleanup() relate to the subkind-specific state in
>          * the libxl ctx, not to any specific device.
>
> Although, now I have read later in this file, it seems that this
> subkind-specific state is stored in the per-remus-operation state
> structure, not in the libxl_ctx.
>
> I see from (for example) that libxl_netbuffer.c uses a netlink socket
> per remus operation.  Would it be possible to share a netlink socket
> between operations ?  I seem to remember this was a static global
> variable in a previous version of the series, which suggests it would
> be possible.
>
> If it is, why not put it in the libxl_ctx ?

It is in the libxl_ctx. init() will only be called one time per remus
operation.

>
>
> And of course:
>
>          * Synchronous.  cleanup() cannot fail.
>
>> +    /*
>> +     * checkpoint callbacks, these are async ops, . These function pointers may be NULL, means the op is
>> +     * not implemented, and it will do nothing when checkpoint.
>> +     * The callers of these APIs must check the function pointer first.
>> +     *
>
>       * Checkpoint operations.  May be NULL, meaning the op is not
>       * implemented and the caller should treat them as a no-op (and do
>       * nothing when checkpointing).
>
> And of course:
>
>       * Asynchronous.
>
>> +    void (*postsuspend)(libxl__remus_device *dev);
>> +    void (*preresume)(libxl__remus_device *dev);
>> +    void (*commit)(libxl__remus_device *dev);
>
>> +    /*
>> +     * This API determines whether the subkind matchs the specific
>> +     * device. In the implementation, we first init all device
>> +     * subkind, for example, NIC, DRBD disk... Then we will find
>> +     * out the libxl devices, and match the device with the
>> +     * subkind, if the device is a drbd disk, then it will be
>> +     * matched with DRBD subkind, and the further ops(such as
>> +     * checkpoint etc.)  of this device will using DRBD subkind
>> +     * ops. This API is mainly for disks, because we must use an
>> +     * external script to determine whether a libxl_disk is a DRBD
>> +     * disk.  This function pointer may be NULL. That means this
>> +     * *kind* of device's ops is always matched with the *kind* of
>> +     * device.  It's an async op and must be implemented
>> +     * asynchronously, call dev->callback when done.
>> +     */
>
>       * Determines whether the subkind matches the specific device.
>       * init() will have been called for the subkind.  If match()
>       * succeeds, the device will then be managed with this set of
>       * subkind operations.  Asynchronous.  Yields 0 if the device
>       * matches, or REMUS_DEVOPS_NOT_MATCH if it does not; any other rc
>       * indicates failure.  May be NULL, meaning this subkind matches
>       * any device.
>
>> +    void (*match)(libxl__remus_device *dev);
>
>> +    /*
>> +     * setup() and teardown() are refer to the actual remus device,
>> +     * a device subkind must implement these two APIs. They are async
>> +     * ops, and call dev->callback when done.
>> +     * These callbacks can be implemented synchronously, call
>> +     * dev->callback at last directly.
>> +     */
>
> Having dealt with the boilerplatey bits we can simply write:
>
>       * setup() and teardown() are refer to the actual remus device.
>       * Asynchronous.
>
> I seem to remember asking you to document that:
>
>       * teardown is called even if setup fails.
>
>> +/*
>> + * This structure is for remus device layer, it records remus devices
>> + * that have been set up.
>
> I don't think this struct is entirely for the remus device layer,
> since only part of it is "private".
>
>      * State associated with a remus invocation, including parameters
>      * passed to the remus abstract device layer by the generic
>      * (non-remus) save/restore machinery.
>
>> + */
>> +struct libxl__remus_device_state {
>> +    /* must set by caller of libxl__remus_device_(setup|teardown) */
>
> You mean
>        must be set by [etc]
>
>> +    libxl__ao *ao;
>> +    libxl__egc *egc;
>> +    uint32_t domid;
>> +    libxl__remus_callback *callback;
>> +    /* the last ops must be NULL */
>> +    const libxl__remus_device_subkind_ops **ops;
>
> Surely it shouldn't be the non-remus code's responsibility to provide
> the vtable list ?  In fact I don't really understand why this needs to
> be a parameter at all.  Isn't it always the same ?

We make remus device abstract layer more generic, because COLO will reuse remus
device layer, so it will has diffrent ops.

>
>> +    /* private */
>> +    /* devices that have been set up */
>> +    int saved_rc;
>> +    libxl__remus_device **dev;
>
> Please make the sectional comments like "private" and "must be set..."
> stand out more.  I seem to remember asking this before (about a
> different struct perhaps?)
>
> The 2nd comment there can't be right.  Perhaps it refers to dev.
>
> But if dev is the set of devices that have been set up, what form is
> it in ?  I think it's an array and that the counters num_devices and
> num_set_up are associated with it.
>
> If so you need to bring dev and these sizes/counts together, and write
> much more informative commentary.  To give a specific example:
>
>> +    /* for counting devices that have been set up */
>> +    int num_set_up;
>
> The variable is called "num_set_up".  And everything here is devices.
> So the variable must be a "num[ber]" of devices which have been "set
> up".  That's the same as "counting" the devices which have been set
> up.
>
> So the comment tells me nothing that the variable name doesn't.
>
> But together these comments leave many important questions unanswered.
> What is the size of the devs array ?  What proportion of the devs
> array is initialised at any one time ?  May the devs array contain
> null pointers and what do they mean ?  etc.

I have made it more clear in the next version, please refer to v17.

>
>> +typedef void libxl__remus_device_callback(libxl__egc *,
>> +                                          libxl__remus_device *,
>> +                                          int rc);
>> +/*
>> + * This structure is init and setup by remus device abstruct layer,
>> + * and pass to remus device ops
>
> Instead,
>
>       Information about a single device being handled by remus.
>       Allocated by the remus abstract layer.
>
> It's not all "init" by the abstract layer - your more detailed
> comments inside the struct explain that nicely.
>
>> + */
>> +struct libxl__remus_device {
>> +    /*----- shared between abstract and concrete layers -----*/
>> +    /* set by remus device abstruct layer */
>
> I would say
>          * all set by remus device abstruct layer */
> and add a blank line after it.
>
> That makes it clear that the 2nd comment line applies to the whole
> section, not just the single next line.
>
>> +    /* libxl__device_* which this remus device related to */
>> +    const void *backend_dev;
>> +    libxl__remus_device_kind kind;
>> +
>> +    /*----- private for abstract layer only -----*/
>> +    /*
>> +     * we must go through all device ops until we find a matched ops
>> +     * for the device.
>> +     */
>
> This last comment belongs in the code, if anywhere, but I suspect it
> is obvious.  Perhaps what you really mean is
>
>          * Control and state variables for the asynchronous callback
>          * based loops which iterate over device subkinds, and over
>          * individual devices.
>
> ?

yes.

>
>> +    int ops_index;
>> +    const libxl__remus_device_subkind_ops *ops;
>> +    libxl__remus_device_callback *callback;
>> +    libxl__remus_device_state *rds;
>
>> +    /*----- private for concrete (device-specific) layer -----*/
>> +    /* *kind* of device's private data */
>> +    void *data;
>
> Again, the /*----- -----*/ comment would benefit from a following
> blank line.
>
> It would be better to change the variable name than write the
> information in the comment.  Is there some reason why data shouldn't
> be `concrete_data' or something ?

I will rename it to concrete_data. and remove *kind*

>
> I'm not sure why "*kind*" is emphasized in the comment.
> ;2~
>> +    /* for calling scripts, eg. setup|teardown|match scripts */
>> +    libxl__async_exec_state aes;
>> +    /*
>> +     * for async func calls, in the implementation of device ops, we
>> +     * may use fork to do async ops. this is owned by device-specific
>> +     * ops methods
>> +     */
>> +    libxl__ev_child child;
>
> I still think these variables would be better in the subkind private
> data structs.  Having them here is a layering violation.  The fact
> that each of the subkinds needs these is a coincidence, not part of
> the design.  So it is fine to write them out twice for that reason.
>
> And I'm not sure how helpful the comments are.  It's fairly obvious
> that a libxl__async_exec_state is for running scripts, for example.
> But maybe if these were in the concrete code it would be possible to
> make the comments more specific.

If we move these 2 variables into device specific structs, that is in
concrete_data, there will be some problems:
   concrete_data will be init when setup() is called.
   match() is called before setup(), but in match() we may call hotplug
scripts or doing async func calls, so this two vars may be needed.

>
>> --- /dev/null
>> +++ b/tools/libxl/libxl_remus_device.c
>> @@ -0,0 +1,339 @@
>> +/*
>> + * Copyright (C) 2014
>> + * Author: Yang Hongyang <yanghy@cn.fujitsu.com>
>
> I think you need to specify who owns the copyright.  Probably
> Fujitsu ?  Although this will depend on your contractual situation.

Will update, thanks!

>
>> +/*----- helper functions -----*/
>> +
>> +static int init_device_subkind(libxl__remus_device_state *rds)
>> +{
>> +    int rc;
>> +    const libxl__remus_device_subkind_ops **ops;
>> +
>> +    for (ops = rds->ops; *ops; ops++) {
>> +        rc = (*ops)->init(rds);
>
> Why not call init immediately before match() ?  That would do away
> with this loop entirely.
>
> Furthermore, this seems to imply that init is idempotent.  This isn't
> spelled out in your API comments.
>
> If init is idempotent, and is going to be called before match() every
> time, why not fold it into match ?

a ops will be called many times in match(). but init()/destroy() only
need to be called once. so I think it's better to separate it. and also
if a subkind init failed, we can exit immediately, and not need to wait
for match complete.

>
>> +        if (rc) {
>> +            goto out;
>> +        }
>
> If you keep this, there is no need for the braces here.  Indeed you
> can put `goto out' on the same line.

Ok, thanks.

>
>> +/* remus device setup and teardown */
>> +
>> +static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
>> +                                     libxl__remus_device_state *rds,
>> +                                     libxl__remus_device_kind kind,
>> +                                     void *libxl_dev);
>> +void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_device_state *rds)
>> +{
>
> Missing blank line between the two declarations.
>
> If libxl__remus_device_init is static, why does it have a
> globally-visible name ?

Will rename it, thanks!

>
>> +    STATE_AO_GC(rds->ao);
>> +
>> +    if (!rds->ops[0])
>> +        goto out;
>
> Why the special case ?  AFAICT init_device_subkind would do the right
> thing if called anyway.

Ok, will delete this case, thanks.

>
>> +static __attribute__((unused)) void libxl__remus_device_init(libxl__egc *egc,
>> +                                     libxl__remus_device_state *rds,
>> +                                     libxl__remus_device_kind kind,
>> +                                     void *libxl_dev)
>> +{
>> +    libxl__remus_device *dev = NULL;
>> +
>> +    STATE_AO_GC(rds->ao);
>> +    GCNEW(dev);
>> +    dev->backend_dev = libxl_dev;
>> +    dev->kind = kind;
>> +    dev->rds = rds;
>> +
>> +    libxl__async_exec_init(&dev->aes);
>> +    libxl__ev_child_init(&dev->child);
>> +
>> +    /* match the ops begin */
>> +    dev->ops_index = 0;
>> +    dev->ops = rds->ops[dev->ops_index];
>> +    for (; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
>> +        if (dev->ops->kind == dev->kind) {
>> +            if (dev->ops->match) {
>> +                dev->callback = device_match_cb;
>> +                dev->ops->match(dev);
>> +            } else {
>> +                /*
>> +                 * This devops do not have match() implementation.
>> +                 * That means this *kind* of device's ops is always
>> +                 * matched with the *kind* of device.
>> +                 */
>> +                dev->callback = device_setup_cb;
>> +                dev->ops->setup(dev);
>> +            }
>
> This would be much less confusing if you had this !match case call
> device_match_cb with rc=0 instead of directly calling setup.
>
>> +static void device_match_cb(libxl__egc *egc,
>> +                            libxl__remus_device *dev,
>> +                            int rc)
>> +{
>> +    libxl__remus_device_state *const rds = dev->rds;
>> +
>> +    STATE_AO_GC(rds->ao);
>> +
>> +    if (rds->saved_rc) {
>> +        /* there's already an error happened, we do not need to continue */
>> +        rds->num_devices++;
>> +        if (all_devices_handled(rds))
>> +            rds->callback(egc, rds, rds->saved_rc);
>> +        return;
>> +    }
>> +
>> +    if (rc) {
>> +        /* the ops does not match, try next ops */
>
> But what if rc==ERROR_MY_HEAD_EXPLODED ?

this case is treated as fail, see below:
   if (!dev->ops || rc != ERROR_REMUS_DEVOPS_NOT_MATCH) {
>
>> +        dev->ops = rds->ops[++dev->ops_index];
>> +        if (!dev->ops || rc != ERROR_REMUS_DEVOPS_NOT_MATCH) {
>> +            /* the device can not be matched */
>> +            rds->num_devices++;
>> +            rds->saved_rc = ERROR_REMUS_DEVICE_NOT_SUPPORTED;
>> +            if (all_devices_handled(rds))
>> +                rds->callback(egc, rds, rds->saved_rc);
>> +            return;
>> +        }
>> +        for ( ; dev->ops; dev->ops = rds->ops[++dev->ops_index]) {
>> +            if (dev->ops->kind == dev->kind) {
>
> The logic here is tangled and duplicated.  Please arrange to have only
> one instance of this loop.
>
> The easiest way is probably to have libxl__remus_device_init set
> ops_index to -1 or something and call device_match_cb.  Then
> device_match_cb can do all the work.

Ok, I will refactor this, thanks.

>
>> +static void device_setup_cb(libxl__egc *egc,
>> +                            libxl__remus_device *dev,
>> +                            int rc)
>> +{
>> +    /* Convenience aliases */
>> +    libxl__remus_device_state *const rds = dev->rds;
>> +
>> +    STATE_AO_GC(rds->ao);
>> +
>> +    rds->num_devices++;
>> +    /*
>> +     * the netbuf script was designed as below:
>> +     * 1. when setup failed, the script won't teardown the device itself.
>> +     * 2. the teardown op is ok to be executed many times.
>
> This information needs to be in the API commentary, and can then be
> deleted here.

Ok, thanks.

>
>> +     * we add devices that have been set up to the array no matter
>> +     * the setup process succeed or failed because we need to ensure
>> +     * the device been teardown while setup failed. If any of the
>> +     * device setup failed, we will quit remus, but before we exit,
>> +     * we will teardown the devices that have been added to **dev
>> +     */
>
> This information would be better expressed in the commentary on the
> state variables, as invariant(s) on the state.

Ok, thanks.

>
>> +    if (all_devices_handled(rds))
>> +        rds->callback(egc, rds, rds->saved_rc);
>
> This little snippet of code occurs many times.  Perhaps
> all_devices_handled should include calling the callback and be called
> "check_all_devices_handled" or "check_make_callback" or something.

I will refactor this, thanks!

>
>> +void libxl__remus_devices_teardown(libxl__egc *egc, libxl__remus_device_state *rds)
>> +{
>> +    int i, num_set_up;
>> +    libxl__remus_device *dev;
>> +
>> +    STATE_AO_GC(rds->ao);
>> +
>> +    rds->saved_rc = 0;
>> +
>> +    if (rds->num_set_up == 0) {
>> +        destroy_device_subkind(rds);
>> +        goto out;
>> +    }
>
> Again, you need to arrange for this unfolded loop to contain only one
> copy of each bit of logic.
>
>> +#define define_remus_device_checkpoint_api(api)                             \
>> +void libxl__remus_devices_##api(libxl__egc *egc,                            \
>> +                                libxl__remus_device_state *rds)             \
>
> Can you please make the lines shorter ?  When I quote this, it comes

Ok, thanks.

> out looking like this:
>
> ...
>> +    for (i = 0; i < rds->num_set_up; i++) {                        \
>      \
>> +        dev = rds->dev[i];                                         \
>      \
>> +        dev->callback = device_checkpoint_cb;                      \
>      \
>> +        if (dev->ops->api) {                                       \
>      \
>> +            dev->ops->api(dev);                                    \
>      \
>
> which is hard to read.
>
>
> To comment on a slightly reformatted version:
>
>> +void libxl__remus_devices_##api(libxl__egc *egc,                  \
>> +                                libxl__remus_device_state *rds)   \
>> +{                                                                 \
>> +    int i;                                                        \
>> +    libxl__remus_device *dev;                                     \
>> +                                                                  \
>> +    STATE_AO_GC(rds->ao);                                         \
>> +                                                                  \
>> +    rds->num_devices = 0;                                         \
>> +    rds->saved_rc = 0;                                            \
>> +                                                                  \
>> +    if (rds->num_set_up == 0)                                     \
>> +        goto out;                                                 \
>
> I think if you did as I suggest with check_all_devices_handled, you
> would end up calling check_all_devices_handled on exit from this
> function and therefore wouldn't need the special case.

This special case is needed because if rds->num_set_up == 0, rds->callback
will no be called in below loop.

>
>> +                                                                  \
>> +    for (i = 0; i < rds->num_set_up; i++) {                       \
>> +        dev = rds->dev[i];                                        \
>> +        dev->callback = device_checkpoint_cb;                     \
>> +        if (dev->ops->api) {                                      \
>> +            dev->ops->api(dev);                                   \
>> +        } else {                                                  \
>> +            rds->num_devices++;                                   \
>> +            if (rds->num_devices == rds->num_set_up)              \
>> +                rds->callback(egc, rds, rds->saved_rc);           \
>> +        }                                                         \
>> +    }                                                             \
>> +                                                                  \
>> +    return;                                                       \
>> +                                                                  \
>> +out:                                                              \
>> +    rds->callback(egc, rds, rds->saved_rc);                       \
>> +}
>
> This function is strikingly similar to the previous unfolded async
> loops.
>
> And it's also similar to libxl__ao_device and libxl__multidev_*.
>
> Can any of this be factored out ?  Can we in fact reuse
> libxl__ao_device and libxl__multidev_* ?

I don't have a good solution to reuse libxl__ao_device and libxl__multidev_*
currently, so I will keep the original code for now, but will refactor it.

>
>
> Thanks,
> Ian.
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-16 10:50     ` Hongyang Yang
@ 2014-07-16 11:46       ` Ian Jackson
  2014-07-18 10:06         ` Hongyang Yang
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Jackson @ 2014-07-16 11:46 UTC (permalink / raw)
  To: Hongyang Yang
  Cc: laijs, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, rshriram, ian.campbell

Hongyang Yang writes ("Re: [PATCH v16 2/7] remus: introduce remus device"):
> On 07/16/2014 02:28 AM, Ian Jackson wrote:
> > Yang Hongyang writes ("[PATCH v16 2/7] remus: introduce remus device"):
> >>       libxl__xc_domain_saverestore_async_callback_done(egc, &dss->shs, 1);

Thanks for all your replies.  Everything that seemed good to me I
haven't responded to individually, so what is left is things that I
wanted to comment further on.

I've been trying to respond to these emails quickly, to enable you to
make sure that code changes flowing from all of my comments, and all
the resulting conversations, are incorporated in your series.

FYI I don't intend to review v17 because I want to finish all these
conversations, and resolve the outstanding questions.  When we have
answered the questions and the answers have been incorporated in your
series, I will re-review it.  I hope that makes sense.

> > But this is a bit mystifying to me.  It seems that when you have
> > finished committing to the checkpoint, you delay by the
> > inter-checkpoint delay before returning.  Is that right ?  I think
> > this could at the very least do with some commentary explaining why...
> 
> This delay time is the interval between checkpoints. Remus doing checkpoints
> periodically.

Yes, but why does this inter-checkpoint delay appear as a delay before
returning from the libxc checkpoint callback ?  There may be a good
reason, based on the libxc API semantics or something, but it isn't
evident from the doc comments or the code.

> >> @@ -1752,6 +1849,34 @@ static void domain_suspend_done(libxl__egc *egc,
> >>           xc_suspend_evtchn_release(CTX->xch, CTX->xce, domid,
> >>                              dss->guest_evtchn.port, &dss->guest_evtchn_lockfd);
> >>
> >> +    if (dss->remus) {
> >
> > The usual libxl idiom for this kind of conditional processing can be
> > seen in libxl_create.c.  Search, for example, for
> > "domcreate_bootloader_done".  You'll see that what we do is
> > this:
> >
> >          if (restore_fd >= 0) {
> >              LOG(DEBUG, "restoring, not running bootloader\n");
> >                      domcreate_bootloader_done(egc, &dcs->bl, 0);
> >
> > So I think what you should do here is:
> >
> >        if (!dss->remus) {
> >            remus_teardown_done(egc,rds,0;
> 
> If it's not a remus suspend, there's no rds...so we can not call
> remus_teardown_done.

AFAICT dss has the rds as a member.  So you can pass &dss->rds (and
then use CONTAINER_OF to extract it).  Your existing function
libxl__remus_teardown_done only uses the passed-in rds with
CONTAINER_OF to find the dss.  So the contents of dss->rds don't
matter (although in fact it will be blank rather than undefined
because it will have been allocated with a zeroing allocator).

> >> +typedef enum libxl__remus_device_kind {
> >> +    LIBXL__REMUS_DEVICE_NIC,
> >> +    LIBXL__REMUS_DEVICE_DISK,
> >> +} libxl__remus_device_kind;
> >
> > AFAICT this enum is used only for filtering which subkind ops vtables
> > we attempt setup() on.  Maybe it would be better to remove this enum
> > and replace its uses in libxl__remus_devices_setup with references to
> > (lists of) vtables ?
...
> > Eg
> >      for (i = 0; i < rds->num_nics; i++) {
> >              libxl__remus_device_init(egc, rds,
> >                                       libxl__remus_device_subkinds_nic,
> >                                       &rds->nics[i]);
...
> Thanks for the detailed comments. this enum also used for match() and
> other uses, so I will keep it.

My proposal above would do away with the need for match to look at the
kind at all.  It would only iterate over the subkinds which are
relevant to that kind.

What are the other purposes you mention ?  I didn't notice them.

> > Although, now I have read later in this file, it seems that this
> > subkind-specific state is stored in the per-remus-operation state
> > structure, not in the libxl_ctx.
> >
> > I see from (for example) that libxl_netbuffer.c uses a netlink socket
> > per remus operation.  Would it be possible to share a netlink socket
> > between operations ?  I seem to remember this was a static global
> > variable in a previous version of the series, which suggests it would
> > be possible.
> >
> > If it is, why not put it in the libxl_ctx ?
> 
> It is in the libxl_ctx. init() will only be called one time per remus
> operation.

Oh, yes, I see, it is in the ctx.  But it is also unconditionally
allocated by nic_init.  So the code is wrong: if you run two remus
operations on the same ctx, each of them will allocate the
libxl__remus_netbuf_state.

I haven't looked at the teardown path but I suspect that it will leave
one of the operations going without a useable ns.

If you store this state in the ctx, and therefore share it amongst
multiple concurrent remus operations, you must set it up idempotently,
and only destroy it when there are no remus operations outstanding
(which I think means either refcounting, or - more sensibly - just
leaving it allocated until the ctx is destroyed).

> >> +    libxl__ao *ao;
> >> +    libxl__egc *egc;
> >> +    uint32_t domid;
> >> +    libxl__remus_callback *callback;
> >> +    /* the last ops must be NULL */
> >> +    const libxl__remus_device_subkind_ops **ops;
> >
> > Surely it shouldn't be the non-remus code's responsibility to provide
> > the vtable list ?  In fact I don't really understand why this needs to
> > be a parameter at all.  Isn't it always the same ?
> 
> We make remus device abstract layer more generic, because COLO will
> reuse remus device layer, so it will has diffrent ops.

I hadn't realised that you plan to use different ops for your high
availability system.  I'm tempted to ask all sorts of questions about
your plans the future but this is a bit of a can of worms.

So we have two ways to approach this patch series.  One would be to
can ask all the questions about how this will be done in the future in
your HA stack, to understand the motivations behind making this a
parameter here.  This will involve getting stuck into review of the
design of your HA stack.

The other would be to put that aside and simply review these patches
on their own terms.

I think the former, proactive, approach is going to be too impractical
here.  So I am going to adopt the latter, unless you want to try to
persuade me otherwise.

On that basis I think that for now, at least, this should be fixed.
If it needs to be made general later, that can be addressed at that
time.

> >> +    /* for calling scripts, eg. setup|teardown|match scripts */
> >> +    libxl__async_exec_state aes;
> >> +    /*
> >> +     * for async func calls, in the implementation of device ops, we
> >> +     * may use fork to do async ops. this is owned by device-specific
> >> +     * ops methods
> >> +     */
> >> +    libxl__ev_child child;
> >
> > I still think these variables would be better in the subkind private
> > data structs.  Having them here is a layering violation.  The fact
> > that each of the subkinds needs these is a coincidence, not part of
> > the design.  So it is fine to write them out twice for that reason.
...
> If we move these 2 variables into device specific structs, that is in
> concrete_data, there will be some problems:
>    concrete_data will be init when setup() is called.
>    match() is called before setup(), but in match() we may call hotplug
> scripts or doing async func calls, so this two vars may be needed.

I see.  This is rather unfortunate.

Why not merge match and setup ?  After all if match returns true we
always call setup...

> >> +/*----- helper functions -----*/
> >> +
> >> +static int init_device_subkind(libxl__remus_device_state *rds)
> >> +{
> >> +    int rc;
> >> +    const libxl__remus_device_subkind_ops **ops;
> >> +
> >> +    for (ops = rds->ops; *ops; ops++) {
> >> +        rc = (*ops)->init(rds);
> >
> > Why not call init immediately before match() ?  That would do away
> > with this loop entirely.
> >
> > Furthermore, this seems to imply that init is idempotent.  This isn't
> > spelled out in your API comments.
> >
> > If init is idempotent, and is going to be called before match() every
> > time, why not fold it into match ?
> 
> a ops will be called many times in match(). but init()/destroy() only
> need to be called once. so I think it's better to separate it.

init is in fact called more than once.  Is it supposed to be
idempotent ?  If it is supposed idempotent then it can check at the
start if anything needs to be done.  That would make the lifetime of
the per-subkkind state a matter entirely for the subkind, which would
simplify the API.  And, I think, the code.

> and also if a subkind init failed, we can exit immediately, and not
> need to wait for match complete.

I don't understand why this is important.  Performance isn't important
here, is it ?  And since we have to have the code for dealing with
asynchronous matching anyway, we don't gain any code simplicity from
separating init out.

> >> +static void device_match_cb(libxl__egc *egc,
> >> +                            libxl__remus_device *dev,
> >> +                            int rc)
> >> +{
...
> >> +    if (rc) {
> >> +        /* the ops does not match, try next ops */
> >
> > But what if rc==ERROR_MY_HEAD_EXPLODED ?
> 
> this case is treated as fail, see below:
>    if (!dev->ops || rc != ERROR_REMUS_DEVOPS_NOT_MATCH) {

That contradicts the comment, though.

> >> +void libxl__remus_devices_##api(libxl__egc *egc,                  \
> >> +                                libxl__remus_device_state *rds)   \
> >> +{                                                                 \
> >> +    int i;                                                        \
> >> +    libxl__remus_device *dev;                                     \
> >> +                                                                  \
> >> +    STATE_AO_GC(rds->ao);                                         \
> >> +                                                                  \
> >> +    rds->num_devices = 0;                                         \
> >> +    rds->saved_rc = 0;                                            \
> >> +                                                                  \
> >> +    if (rds->num_set_up == 0)                                     \
> >> +        goto out;                                                 \
> >
> > I think if you did as I suggest with check_all_devices_handled, you
> > would end up calling check_all_devices_handled on exit from this
> > function and therefore wouldn't need the special case.
> 
> This special case is needed because if rds->num_set_up == 0, rds->callback
> will no be called in below loop.

I think if you did as I suggest with check_all_devices_handled, you
would end up calling check_all_devices_handled *on every exit path*
from this function and therefore wouldn't need the special case.

> > This function is strikingly similar to the previous unfolded async
> > loops.
> >
> > And it's also similar to libxl__ao_device and libxl__multidev_*.
> >
> > Can any of this be factored out ?  Can we in fact reuse
> > libxl__ao_device and libxl__multidev_* ?
> 
> I don't have a good solution to reuse libxl__ao_device and libxl__multidev_*
> currently, so I will keep the original code for now, but will refactor it.

I'm sorry, I don't understand what you mean by not having a `good
solution'.  If multidev can be used then it would be much better if
you used it.  There would be less duplication.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-16 11:46       ` Ian Jackson
@ 2014-07-18 10:06         ` Hongyang Yang
  2014-07-18 10:14           ` Ian Campbell
  2014-07-18 14:24           ` Ian Jackson
  0 siblings, 2 replies; 19+ messages in thread
From: Hongyang Yang @ 2014-07-18 10:06 UTC (permalink / raw)
  To: Ian Jackson
  Cc: laijs, wency, andrew.cooper3, yunhong.jiang, eddie.dong,
	xen-devel, rshriram, ian.campbell

Hi Ian,

   There're two comments left that I wanted to comment further on.
Apart from this two comments, I think I have addressed all your
comments on v16, but before I send a v18 out, Shriram and I will
have more detailed review on it, in case I have missed some of
your comments.

My latest working tree that we will review on is hosted at github:
https://github.com/macrosheep/xen/tree/remus-v17.6
Refer to the latest 10 commits.
What I think is the big changes compared to v16:
  - Merge match() and setup() api.
  - Reuse libxl__multidev and libxl__ao_device

NOTE:
   branches which has `draft` in the name may not compile, and you can
simply ignore it.
   branches which has version number like remus-vXX is compilable and
well tested, you can keep on track on those versions, especially the
latest version makes sense.

On 07/16/2014 07:46 PM, Ian Jackson wrote:
>>>> +typedef enum libxl__remus_device_kind {
>>>> +    LIBXL__REMUS_DEVICE_NIC,
>>>> +    LIBXL__REMUS_DEVICE_DISK,
>>>> +} libxl__remus_device_kind;
>>>
>>> AFAICT this enum is used only for filtering which subkind ops vtables
>>> we attempt setup() on.  Maybe it would be better to remove this enum
>>> and replace its uses in libxl__remus_devices_setup with references to
>>> (lists of) vtables ?
> ...
>>> Eg
>>>       for (i = 0; i < rds->num_nics; i++) {
>>>               libxl__remus_device_init(egc, rds,
>>>                                        libxl__remus_device_subkinds_nic,
>>>                                        &rds->nics[i]);
> ...
>> Thanks for the detailed comments. this enum also used for match() and
>> other uses, so I will keep it.
>
> My proposal above would do away with the need for match to look at the
> kind at all.  It would only iterate over the subkinds which are
> relevant to that kind.
>
> What are the other purposes you mention ?  I didn't notice them.

I intend to keep this *kind* because I will use them for filter out
what kind of device the remus abstract layer need to handle.

take the code v17.6(refer to above link) as an example:

typedef enum libxl__remus_device_kind {
     LIBXL__REMUS_DEVICE_NIC  = (1 << 0),
     LIBXL__REMUS_DEVICE_DISK = (1 << 1),
} libxl__remus_device_kind;

there's a flag in remus devices state(which I will call rds below):
struct libxl__remus_devices_state {
...
     int device_kind_flags;
...
}

and this flag will be set by upper layer(remus save/restore) depend
on the comand switch:
int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
                              uint32_t domid, int send_fd, int recv_fd,
                              const libxl_asyncop_how *ao_how)
{
...
     if (info->netbuf) {
         if (!libxl__netbuffer_enabled(gc)) {
             LOG(ERROR, "Remus: No support for network buffering");
             goto out;
         }
         rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
     }

     if (info->diskbuf)
         rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;
...
}

so that I can determine what kind of device I need to handle in
remus abstract layer without had to know command switch related things.
I think the fact that the abstract layer need to know what the command
switch is causes layer violation...

void libxl__remus_devices_setup(libxl__egc *egc, libxl__remus_devices_state *rds)
{
...
     if (rds->device_kind_flags & LIBXL__REMUS_DEVICE_NIC)
         rds->nics = libxl_device_nic_list(CTX, rds->domid, &rds->num_nics);

     if (rds->device_kind_flags & LIBXL__REMUS_DEVICE_DISK)
         rds->disks = libxl_device_disk_list(CTX, rds->domid, &rds->num_disks);
...
}

>
>
>>>> +/*----- helper functions -----*/
>>>> +
>>>> +static int init_device_subkind(libxl__remus_device_state *rds)
>>>> +{
>>>> +    int rc;
>>>> +    const libxl__remus_device_subkind_ops **ops;
>>>> +
>>>> +    for (ops = rds->ops; *ops; ops++) {
>>>> +        rc = (*ops)->init(rds);
>>>
>>> Why not call init immediately before match() ?  That would do away
>>> with this loop entirely.
>>>
>>> Furthermore, this seems to imply that init is idempotent.  This isn't
>>> spelled out in your API comments.
>>>
>>> If init is idempotent, and is going to be called before match() every
>>> time, why not fold it into match ?
>>
>> a ops will be called many times in match(). but init()/destroy() only
>> need to be called once. so I think it's better to separate it.
>
> init is in fact called more than once.  Is it supposed to be
> idempotent ?  If it is supposed idempotent then it can check at the
> start if anything needs to be done.  That would make the lifetime of
> the per-subkkind state a matter entirely for the subkind, which would
> simplify the API.  And, I think, the code.
>
>> and also if a subkind init failed, we can exit immediately, and not
>> need to wait for match complete.
>
> I don't understand why this is important.  Performance isn't important
> here, is it ?  And since we have to have the code for dealing with
> asynchronous matching anyway, we don't gain any code simplicity from
> separating init out.

a subkind init()/cleanup() only need to be called once per Remus instance.
there're cases that this subkind is used for multiple devices, it we don't
seperate it out, it will be called many times. I agree that I can make
it capable to be called multiple times, but I think that's not what the
design is. it is not supposed to be idempotent I think. and I think
separate it makes the logic more comprehensive.

Thanks,
Yang

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 10:06         ` Hongyang Yang
@ 2014-07-18 10:14           ` Ian Campbell
  2014-07-18 14:24           ` Ian Jackson
  1 sibling, 0 replies; 19+ messages in thread
From: Ian Campbell @ 2014-07-18 10:14 UTC (permalink / raw)
  To: Hongyang Yang
  Cc: laijs, wency, andrew.cooper3, yunhong.jiang, Ian Jackson,
	xen-devel, eddie.dong, rshriram

On Fri, 2014-07-18 at 18:06 +0800, Hongyang Yang wrote:

> > My proposal above would do away with the need for match to look at the
> > kind at all.  It would only iterate over the subkinds which are
> > relevant to that kind.
> >
> > What are the other purposes you mention ?  I didn't notice them.
> 
> I intend to keep this *kind* because I will use them for filter out
> what kind of device the remus abstract layer need to handle.
> 
> take the code v17.6(refer to above link) as an example:
> 
> typedef enum libxl__remus_device_kind {
>      LIBXL__REMUS_DEVICE_NIC  = (1 << 0),
>      LIBXL__REMUS_DEVICE_DISK = (1 << 1),
> } libxl__remus_device_kind;

I've not been following closely so I make no comment on whether you
actually should or shouldn't need a kind in this code (I'll leave that
to Ian) but if you do need it then this is needlessly duplicating
libxl__device_kind AFAICT.

If you really do need a bit field then you can use
1<<LIBXL__DEVICE_KIND_FOO.

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 10:06         ` Hongyang Yang
  2014-07-18 10:14           ` Ian Campbell
@ 2014-07-18 14:24           ` Ian Jackson
  2014-07-18 15:54             ` Shriram Rajagopalan
  1 sibling, 1 reply; 19+ messages in thread
From: Ian Jackson @ 2014-07-18 14:24 UTC (permalink / raw)
  To: Hongyang Yang
  Cc: ian.campbell, wency, andrew.cooper3, yunhong.jiang, Ian Jackson,
	xen-devel, eddie.dong, rshriram, laijs

Hongyang Yang writes ("Re: [PATCH v16 2/7] remus: introduce remus device"):
>    There're two comments left that I wanted to comment further on.
> Apart from this two comments, I think I have addressed all your
> comments on v16, but before I send a v18 out, Shriram and I will
> have more detailed review on it, in case I have missed some of
> your comments.

Right.  Thanks.

> My latest working tree that we will review on is hosted at github:
> https://github.com/macrosheep/xen/tree/remus-v17.6
> Refer to the latest 10 commits.
> What I think is the big changes compared to v16:
>   - Merge match() and setup() api.
>   - Reuse libxl__multidev and libxl__ao_device

Oh, good.


> >> Thanks for the detailed comments. this enum also used for match() and
> >> other uses, so I will keep it.
> >
> > My proposal above would do away with the need for match to look at the
> > kind at all.  It would only iterate over the subkinds which are
> > relevant to that kind.
> >
> > What are the other purposes you mention ?  I didn't notice them.
> 
> I intend to keep this *kind* because I will use them for filter out
> what kind of device the remus abstract layer need to handle.

Ah, I see, thanks for the explanation.  I'm still not entirely
convinced though:

> and this flag will be set by upper layer(remus save/restore) depend
> on the comand switch:
> int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info *info,
>                               uint32_t domid, int send_fd, int recv_fd,
>                               const libxl_asyncop_how *ao_how)
> {
> ...
>      if (info->netbuf) {
>          if (!libxl__netbuffer_enabled(gc)) {
>              LOG(ERROR, "Remus: No support for network buffering");
>              goto out;
>          }
>          rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
>      }
> 
>      if (info->diskbuf)
>          rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;

So what is actually going on here is a mechanism for disabling remus
processing for certain kinds of device.

But wouldn't one perhaps want some devices of the same kind to have
remus enabled, and some disabled ?

I'm not sure that this grouping makes much sense.  But perhaps you can
explain why one might want to disable all network buffering and enable
all disk buffering, but why one would never want to enable buffering
only for a subset of nics, or a subset of disks.


If this _is_ desirable, I think this kind of explicit conversion from
one type to another is undesirable if it can be eliminated.

In particular, perhaps what this really demonstrates is that the
commonality between libxl_domain_remus_info.netbuf and .diskbuf ought
to be properly represented in the idl for libxl_domain_remus_info.

Perhaps we should replace these by an array of defbools indexed by the
device kind ?  The device kind would have to be in the IDL as part of
the API.


> >> and also if a subkind init failed, we can exit immediately, and not
> >> need to wait for match complete.
> >
> > I don't understand why this is important.  Performance isn't important
> > here, is it ?  And since we have to have the code for dealing with
> > asynchronous matching anyway, we don't gain any code simplicity from
> > separating init out.
> 
> a subkind init()/cleanup() only need to be called once per Remus instance.
> there're cases that this subkind is used for multiple devices, it we don't
> seperate it out, it will be called many times. I agree that I can make
> it capable to be called multiple times, but I think that's not what the
> design is. it is not supposed to be idempotent I think. and I think
> separate it makes the logic more comprehensive.

The cleanup needs to be idempotent anyway.

Making a setup function idempotent is normally easy.  You just need to
spot at the start whether the setup has already been done.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 14:24           ` Ian Jackson
@ 2014-07-18 15:54             ` Shriram Rajagopalan
  2014-07-18 15:58               ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Shriram Rajagopalan @ 2014-07-18 15:54 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Ian Campbell, Wen Congyang, Andrew Cooper, Jiang Yunhong,
	Dong Eddie, xen-devel@lists.xen.org, Hongyang Yang, Lai Jiangshan


[-- Attachment #1.1: Type: text/plain, Size: 5403 bytes --]

On Fri, Jul 18, 2014 at 10:24 AM, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:

> Hongyang Yang writes ("Re: [PATCH v16 2/7] remus: introduce remus device"):
> >    There're two comments left that I wanted to comment further on.
> > Apart from this two comments, I think I have addressed all your
> > comments on v16, but before I send a v18 out, Shriram and I will
> > have more detailed review on it, in case I have missed some of
> > your comments.
>
> Right.  Thanks.
>
> > My latest working tree that we will review on is hosted at github:
> > https://github.com/macrosheep/xen/tree/remus-v17.6
> > Refer to the latest 10 commits.
> > What I think is the big changes compared to v16:
> >   - Merge match() and setup() api.
> >   - Reuse libxl__multidev and libxl__ao_device
>
> Oh, good.
>
>
> > >> Thanks for the detailed comments. this enum also used for match() and
> > >> other uses, so I will keep it.
> > >
> > > My proposal above would do away with the need for match to look at the
> > > kind at all.  It would only iterate over the subkinds which are
> > > relevant to that kind.
> > >
> > > What are the other purposes you mention ?  I didn't notice them.
> >
> > I intend to keep this *kind* because I will use them for filter out
> > what kind of device the remus abstract layer need to handle.
>
> Ah, I see, thanks for the explanation.  I'm still not entirely
> convinced though:
>
> > and this flag will be set by upper layer(remus save/restore) depend
> > on the comand switch:
> > int libxl_domain_remus_start(libxl_ctx *ctx, libxl_domain_remus_info
> *info,
> >                               uint32_t domid, int send_fd, int recv_fd,
> >                               const libxl_asyncop_how *ao_how)
> > {
> > ...
> >      if (info->netbuf) {
> >          if (!libxl__netbuffer_enabled(gc)) {
> >              LOG(ERROR, "Remus: No support for network buffering");
> >              goto out;
> >          }
> >          rds->device_kind_flags |= LIBXL__REMUS_DEVICE_NIC;
> >      }
> >
> >      if (info->diskbuf)
> >          rds->device_kind_flags |= LIBXL__REMUS_DEVICE_DISK;
>
> So what is actually going on here is a mechanism for disabling remus
> processing for certain kinds of device.
>
> But wouldn't one perhaps want some devices of the same kind to have
> remus enabled, and some disabled ?
>

No. By default, Remus would enable buffering for both disks and networks.
Leaving any one of these devices out would not guarantee consistency.

What is happening above is that Remus is trying to enable network buffering
if the info->netbuf flag is set (which is the default, unless the user
explicitly disables
network buffering).  A similar check is performed for disks as well.

I think things would be more clearer if the if blocks became something like
this:

if (info->netbuf) {
 ...
} else
  LOG(WARN, "Network buffering disabled. Failover may not be successful")

 A similar if-else block for the disk buffering should help.
 In fact, I would suggest pushing these warning statements out to the xl
cmdline
 code, so that even if the user were to filter out all the non-ERROR
messages
 this warning would always be displayed in the console. [My knowledge of the
libxl log filtering is poor. So correct me if I am wrong about this.]


> I'm not sure that this grouping makes much sense.  But perhaps you can
> explain why one might want to disable all network buffering and enable
> all disk buffering, but why one would never want to enable buffering
> only for a subset of nics, or a subset of disks.
>
>
Answered above. There is no way to guarantee consistency upon recovery.
I can certainly see use cases for this selective protection. But I consider
these as
"exceptions" to the normal usage scenarios. Nice to haves, IMO and not a
priority ATM.


>
> If this _is_ desirable, I think this kind of explicit conversion from
> one type to another is undesirable if it can be eliminated.
>
> In particular, perhaps what this really demonstrates is that the
> commonality between libxl_domain_remus_info.netbuf and .diskbuf ought
> to be properly represented in the idl for libxl_domain_remus_info.
>
> Perhaps we should replace these by an array of defbools indexed by the
> device kind ?  The device kind would have to be in the IDL as part of
> the API.
>
>
>
I am not upto speed on the implementation front. So I'll defer it to
Hongyang.


>  > >> and also if a subkind init failed, we can exit immediately, and not
> > >> need to wait for match complete.
> > >
> > > I don't understand why this is important.  Performance isn't important
> > > here, is it ?  And since we have to have the code for dealing with
> > > asynchronous matching anyway, we don't gain any code simplicity from
> > > separating init out.
> >
> > a subkind init()/cleanup() only need to be called once per Remus
> instance.
> > there're cases that this subkind is used for multiple devices, it we
> don't
> > seperate it out, it will be called many times. I agree that I can make
> > it capable to be called multiple times, but I think that's not what the
> > design is. it is not supposed to be idempotent I think. and I think
> > separate it makes the logic more comprehensive.
>
> The cleanup needs to be idempotent anyway.
>
> Making a setup function idempotent is normally easy.  You just need to
> spot at the start whether the setup has already been done.
>
> Thanks,
> Ian.
>
>

[-- Attachment #1.2: Type: text/html, Size: 7345 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 15:54             ` Shriram Rajagopalan
@ 2014-07-18 15:58               ` Ian Campbell
  2014-07-18 16:08                 ` Shriram Rajagopalan
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2014-07-18 15:58 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Wen Congyang, Andrew Cooper, Jiang Yunhong,
	Ian Jackson, xen-devel@lists.xen.org, Dong Eddie, Hongyang Yang

On Fri, 2014-07-18 at 11:54 -0400, Shriram Rajagopalan wrote:

> if (info->netbuf) {
>  ...
> } else
>   LOG(WARN, "Network buffering disabled. Failover may not be successful")

If this is the case then why are we offering that option to the user in
the first place?

What is the use case for disabling network buffering given that it
breaks failover?

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 15:58               ` Ian Campbell
@ 2014-07-18 16:08                 ` Shriram Rajagopalan
  2014-07-18 16:17                   ` Ian Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Shriram Rajagopalan @ 2014-07-18 16:08 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Lai Jiangshan, Wen Congyang, Andrew Cooper, Jiang Yunhong,
	Ian Jackson, xen-devel@lists.xen.org, Dong Eddie, Hongyang Yang


[-- Attachment #1.1: Type: text/plain, Size: 1011 bytes --]

On Fri, Jul 18, 2014 at 11:58 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:

> On Fri, 2014-07-18 at 11:54 -0400, Shriram Rajagopalan wrote:
>
> > if (info->netbuf) {
> >  ...
> > } else
> >   LOG(WARN, "Network buffering disabled. Failover may not be successful")
>
> If this is the case then why are we offering that option to the user in
> the first place?
>
> What is the use case for disabling network buffering given that it
> breaks failover?
>
>
The keyword above is "may". Without network buffering, ongoing TCP
connections "may" be hung, depending on whether any communication
happened during the failed checkpoint. Same applies to disk.

If one can control the network interactions or if the application is
capable of recovering
from lost TCP connections (or lost UDP packets for that matter), then it
stands to benefit
from no-network buffering as it eliminates the latency overhead introduced
by buffering every
packet [ 0.5 x checkpoint-interval + RTT between primary-backup].


>  Ian.
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 1784 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 16:08                 ` Shriram Rajagopalan
@ 2014-07-18 16:17                   ` Ian Campbell
  2014-07-18 16:50                     ` Shriram Rajagopalan
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Campbell @ 2014-07-18 16:17 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Wen Congyang, Andrew Cooper, Jiang Yunhong,
	Ian Jackson, xen-devel@lists.xen.org, Dong Eddie, Hongyang Yang

On Fri, 2014-07-18 at 12:08 -0400, Shriram Rajagopalan wrote:

> The keyword above is "may". Without network buffering, ongoing TCP 
> connections "may" be hung, depending on whether any communication 
> happened during the failed checkpoint. Same applies to disk.

That doesn't sound like a good reason to offer this amount of rope to an
innocent user.

> If one can control the network interactions or if the application is
> capable of recovering
> from lost TCP connections (or lost UDP packets for that matter), then
> it stands to benefit
> from no-network buffering as it eliminates the latency overhead
> introduced by buffering every
> packet [ 0.5 x checkpoint-interval + RTT between primary-backup]. 

Now *this* might be a reason to offer such an option for network
traffic.

But is there an existing real world requirement to support this now or
is this just a theoretical desire? Seems like punting on it for the time
being would let the rest of the series make progress.

It seems harder to imagine a case where not replicating a disk would be
tolerable.

Ian.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v16 2/7] remus: introduce remus device
  2014-07-18 16:17                   ` Ian Campbell
@ 2014-07-18 16:50                     ` Shriram Rajagopalan
  0 siblings, 0 replies; 19+ messages in thread
From: Shriram Rajagopalan @ 2014-07-18 16:50 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Lai Jiangshan, Wen Congyang, Andrew Cooper, Jiang Yunhong,
	Ian Jackson, xen-devel@lists.xen.org, Dong Eddie, Hongyang Yang


[-- Attachment #1.1: Type: text/plain, Size: 2060 bytes --]

On Fri, Jul 18, 2014 at 12:17 PM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:

> On Fri, 2014-07-18 at 12:08 -0400, Shriram Rajagopalan wrote:
>
> > The keyword above is "may". Without network buffering, ongoing TCP
> > connections "may" be hung, depending on whether any communication
> > happened during the failed checkpoint. Same applies to disk.
>
> That doesn't sound like a good reason to offer this amount of rope to an
> innocent user.
>
> > If one can control the network interactions or if the application is
> > capable of recovering
> > from lost TCP connections (or lost UDP packets for that matter), then
> > it stands to benefit
> > from no-network buffering as it eliminates the latency overhead
> > introduced by buffering every
> > packet [ 0.5 x checkpoint-interval + RTT between primary-backup].
>
> Now *this* might be a reason to offer such an option for network
> traffic.
>
> But is there an existing real world requirement to support this now or
> is this just a theoretical desire? Seems like punting on it for the time
> being would let the rest of the series make progress.
>
>
Most of the if(netbuf_enabled) {} else {} is an artifact of
whether or not the host system has the correct version of libnl installed.
If autoconf said no, then libxl_nonetbuffer.c would be compiled in, but
the existence (or not) of network buffers had to be checked at various
points
in code (in libxl), especially for teardown path.

It seems harder to imagine a case where not replicating a disk would be
> tolerable.
>

Read only root file system, tmpfs based scratch file system?

Its useful to just use memory replication for testing/profiling purposes.

So i think a better alternative would be to have an explicit test flag (-t)
followed
by the no-netbuf flag (-n) and/or the no disk buffering flag (-d).  Use of
a test flag
should be suggestive enough to the innocent user that this invocation is
not
meant for production, while flexible enough for someone with knowledge of
the system to disable network or disk replication for other needs.

[-- Attachment #1.2: Type: text/html, Size: 2880 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-07-18 16:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-15  6:46 [PATCH v16 0/7] Remus/Libxl: Remus network buffering and drbd disk Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 1/7] remus: add libnl3 dependency for network buffering support Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 2/7] remus: introduce remus device Yang Hongyang
2014-07-15 18:28   ` Ian Jackson
2014-07-16 10:50     ` Hongyang Yang
2014-07-16 11:46       ` Ian Jackson
2014-07-18 10:06         ` Hongyang Yang
2014-07-18 10:14           ` Ian Campbell
2014-07-18 14:24           ` Ian Jackson
2014-07-18 15:54             ` Shriram Rajagopalan
2014-07-18 15:58               ` Ian Campbell
2014-07-18 16:08                 ` Shriram Rajagopalan
2014-07-18 16:17                   ` Ian Campbell
2014-07-18 16:50                     ` Shriram Rajagopalan
2014-07-15  6:46 ` [PATCH v16 3/7] remus netbuffer: implement remus network buffering for nic devices Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 4/7] remus drbd: Implement remus drbd replicated disk Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 5/7] libxl: network buffering cmdline switch Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 6/7] libxl: disk " Yang Hongyang
2014-07-15  6:46 ` [PATCH v16 7/7] MAINTAINERS: Update maintained files of REMUS Yang Hongyang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).