[PATCH 00/17] blktap2 related bugfix patches

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] blktap2 related bugfix patches
@ 2014-10-14  2:13 Wen Congyang
  2014-10-14  2:13 ` [PATCH 01/17] tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
                   ` (18 more replies)
  0 siblings, 19 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

These bugs are found when we implement COLO, or rebase
COLO to upstream xen. They are independent patches, so
post them in separate series.

The codes are also hosted on github:
https://github.com/wencongyang/xen/commits/bugfix-v4

Lai Jiangshan (1):
  tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error

Wen Congyang (16):
  tools: block-remus: pass uuid to the callback td_open
  tools: block-remus: use correct way to get remus_image
  tools: block-remus: fix bug in tdremus_close()
  tools: block-remus: fix memory leak
  tools: blktap2: return the correct dev path
  tools: blktap2: use correct way to get free event id
  tools: blktap2: don't return negative event id
  tools: blktap2: use correct way to define array.
  tools: block-remus: fix bug in ctl_request()
  tools: block-remus: clean unused functions
  tools: blktap2: implement an API to create a connection asynchronously
  tools: block-remus: connect to backup asynchronously
  block-remus: switch to unprotected mode before closing
  tools: blktap2: move ramdisk related codes to block-replication.c
  support blktap remus in xl
  HACK: libxl/remus: setup and control disk replication for blktap2
    backends

 tools/blktap2/drivers/Makefile            |    1 +
 tools/blktap2/drivers/block-aio.c         |   41 +-
 tools/blktap2/drivers/block-cache.c       |    4 +-
 tools/blktap2/drivers/block-log.c         |    4 +-
 tools/blktap2/drivers/block-qcow.c        |    5 +-
 tools/blktap2/drivers/block-ram.c         |    5 +-
 tools/blktap2/drivers/block-remus.c       | 1201 +++++++----------------------
 tools/blktap2/drivers/block-replication.c |  928 ++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.h |  178 +++++
 tools/blktap2/drivers/block-vhd.c         |    5 +-
 tools/blktap2/drivers/scheduler.c         |   33 +-
 tools/blktap2/drivers/tapdisk-control.c   |   17 +-
 tools/blktap2/drivers/tapdisk-disktype.c  |   12 +-
 tools/blktap2/drivers/tapdisk-disktype.h  |    2 +-
 tools/blktap2/drivers/tapdisk-interface.c |   21 +-
 tools/blktap2/drivers/tapdisk-interface.h |    1 +
 tools/blktap2/drivers/tapdisk-vbd.c       |    9 +
 tools/blktap2/drivers/tapdisk-vbd.h       |    1 +
 tools/blktap2/drivers/tapdisk.h           |    3 +-
 tools/libxl/Makefile                      |    2 +-
 tools/libxl/libxl.c                       |   25 +-
 tools/libxl/libxl_blktap2.c               |   38 +-
 tools/libxl/libxl_create.c                |    8 +
 tools/libxl/libxl_device.c                |   35 +-
 tools/libxl/libxl_dm.c                    |    4 +-
 tools/libxl/libxl_internal.h              |   10 +-
 tools/libxl/libxl_noblktap2.c             |    8 +-
 tools/libxl/libxl_remus_device.c          |    6 +
 tools/libxl/libxl_remus_disk_blktap.c     |  209 +++++
 tools/libxl/libxl_types.idl               |    2 +
 tools/libxl/libxlu_disk_l.l               |    2 +
 31 files changed, 1857 insertions(+), 963 deletions(-)
 create mode 100644 tools/blktap2/drivers/block-replication.c
 create mode 100644 tools/blktap2/drivers/block-replication.h
 create mode 100644 tools/libxl/libxl_remus_disk_blktap.c

-- 
1.9.3

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/17] tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-14  2:13 ` [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open Wen Congyang
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

From: Lai Jiangshan <laijs@cn.fujitsu.com>

In normal case, there are at most TAPDISK_DATA_REQUESTS request
at the same time. But in remus mode, the write requests are
forwarded from the master side, and cached in block-remus. All
cached requests will be forwarded to aio driver when syncing
primary vm and backup vm. In this case, The number of requests
may be more than TAPDISK_DATA_REQUESTS. So aio driver can't hanlde
these requests at the same time, it will cause tapdisk2 exit.

We don't know how many requests will be handled, so dynamic allocate
aio_requests to avoid this error.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-aio.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index f398da2..10ab20b 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -55,9 +55,10 @@ struct tdaio_state {
 	int                  fd;
 	td_driver_t         *driver;
 
+	int                  aio_max_count;
 	int                  aio_free_count;	
 	struct aio_request   aio_requests[MAX_AIO_REQS];
-	struct aio_request  *aio_free_list[MAX_AIO_REQS];
+	struct aio_request   **aio_free_list;
 };
 
 /*Get Image size, secsize*/
@@ -122,6 +123,11 @@ int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags)
 
 	memset(prv, 0, sizeof(struct tdaio_state));
 
+	prv->aio_free_list = malloc(MAX_AIO_REQS * sizeof(*prv->aio_free_list));
+	if (!prv->aio_free_list)
+		return -ENOMEM;
+
+	prv->aio_max_count = MAX_AIO_REQS;
 	prv->aio_free_count = MAX_AIO_REQS;
 	for (i = 0; i < MAX_AIO_REQS; i++)
 		prv->aio_free_list[i] = &prv->aio_requests[i];
@@ -159,6 +165,28 @@ done:
 	return ret;	
 }
 
+static int tdaio_refill(struct tdaio_state *prv)
+{
+	struct aio_request **new, *new_req;
+	int i, max = prv->aio_max_count + MAX_AIO_REQS;
+
+	new = realloc(prv->aio_free_list, max * sizeof(*prv->aio_free_list));
+	if (!new)
+		return -1;
+	prv->aio_free_list = new;
+
+	new_req = calloc(MAX_AIO_REQS, sizeof(*new_req));
+	if (!new_req)
+		return -1;
+
+	prv->aio_max_count = max;
+	prv->aio_free_count = MAX_AIO_REQS;
+	for (i = 0; i < MAX_AIO_REQS; i++)
+		prv->aio_free_list[i] = &new_req[i];
+
+	return 0;
+}
+
 void tdaio_complete(void *arg, struct tiocb *tiocb, int err)
 {
 	struct aio_request *aio = (struct aio_request *)arg;
@@ -207,8 +235,10 @@ void tdaio_queue_write(td_driver_t *driver, td_request_t treq)
 	size    = treq.secs * driver->info.sector_size;
 	offset  = treq.sec  * (uint64_t)driver->info.sector_size;
 
-	if (prv->aio_free_count == 0)
-		goto fail;
+	if (prv->aio_free_count == 0) {
+		if (tdaio_refill(prv) < 0)
+			goto fail;
+	}
 
 	aio        = prv->aio_free_list[--prv->aio_free_count];
 	aio->treq  = treq;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
  2014-10-14  2:13 ` [PATCH 01/17] tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  2:58   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 03/17] tools: block-remus: use correct way to get remus_image Wen Congyang
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

remus's callback td_open needs uuid, but it is hard coded as 0.
After commit 4b1af8, the vbd's uuid is the minor of the blktap
device, not 0.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-aio.c         | 3 ++-
 tools/blktap2/drivers/block-cache.c       | 3 ++-
 tools/blktap2/drivers/block-log.c         | 3 ++-
 tools/blktap2/drivers/block-qcow.c        | 3 ++-
 tools/blktap2/drivers/block-ram.c         | 3 ++-
 tools/blktap2/drivers/block-remus.c       | 8 ++------
 tools/blktap2/drivers/block-vhd.c         | 3 ++-
 tools/blktap2/drivers/tapdisk-interface.c | 4 +++-
 tools/blktap2/drivers/tapdisk.h           | 2 +-
 9 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index 10ab20b..1b560e5 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -111,7 +111,8 @@ static int tdaio_get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize aio state. */
-int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags)
+int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
+	       td_uuid_t uuid)
 {
 	int i, fd, ret, o_flags;
 	struct tdaio_state *prv;
diff --git a/tools/blktap2/drivers/block-cache.c b/tools/blktap2/drivers/block-cache.c
index 1d2f4eb..cd6ea6a 100644
--- a/tools/blktap2/drivers/block-cache.c
+++ b/tools/blktap2/drivers/block-cache.c
@@ -517,7 +517,8 @@ block_cache_put_request(block_cache_t *cache, block_cache_request_t *breq)
 }
 
 static int
-block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags)
+block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
+		 td_uuid_t uuid)
 {
 	int i, err;
 	radix_tree_t *tree;
diff --git a/tools/blktap2/drivers/block-log.c b/tools/blktap2/drivers/block-log.c
index 5330cdc..7b33b63 100644
--- a/tools/blktap2/drivers/block-log.c
+++ b/tools/blktap2/drivers/block-log.c
@@ -585,7 +585,8 @@ static void ctl_request(event_id_t id, char mode, void *private)
 
 static int tdlog_close(td_driver_t*);
 
-static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags)
+static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags,
+		      td_uuid_t uuid)
 {
   struct tdlog_state* s = (struct tdlog_state*)driver->data;
   int rc;
diff --git a/tools/blktap2/drivers/block-qcow.c b/tools/blktap2/drivers/block-qcow.c
index b45bcaa..64dfafc 100644
--- a/tools/blktap2/drivers/block-qcow.c
+++ b/tools/blktap2/drivers/block-qcow.c
@@ -865,7 +865,8 @@ out:
 }
 
 /* Open the disk file and initialize qcow state. */
-int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags)
+int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
+		 td_uuid_t uuid)
 {
 	int fd, len, i, ret, size, o_flags;
 	td_disk_info_t *bs = &(driver->info);
diff --git a/tools/blktap2/drivers/block-ram.c b/tools/blktap2/drivers/block-ram.c
index a859481..b64a194 100644
--- a/tools/blktap2/drivers/block-ram.c
+++ b/tools/blktap2/drivers/block-ram.c
@@ -108,7 +108,8 @@ static int get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize ram state. */
-int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags)
+int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
+		td_uuid_t uuid)
 {
 	char *p;
 	uint64_t size;
diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 079588d..eb8c0ed 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -1633,18 +1633,14 @@ static int ctl_register(struct tdremus_state *s)
 /* interface */
 
 static int tdremus_open(td_driver_t *driver, const char *name,
-			td_flag_t flags)
+			td_flag_t flags, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int rc;
 
 	RPRINTF("opening %s\n", name);
 
-	/* first we need to get the underlying vbd for this driver stack. To do so we
-	 * need to know the vbd's id. Fortunately, for tapdisk2 this is hard-coded as
-	 * 0 (see tapdisk2.c)
-	 */
-	device_vbd = tapdisk_server_get_vbd(0);
+	device_vbd = tapdisk_server_get_vbd(uuid);
 
 	memset(s, 0, sizeof(*s));
 	s->server_fd.fd = -1;
diff --git a/tools/blktap2/drivers/block-vhd.c b/tools/blktap2/drivers/block-vhd.c
index 76ea5bd..06e9c89 100644
--- a/tools/blktap2/drivers/block-vhd.c
+++ b/tools/blktap2/drivers/block-vhd.c
@@ -675,7 +675,8 @@ __vhd_open(td_driver_t *driver, const char *name, vhd_flag_t flags)
 }
 
 static int
-_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags)
+_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
+	  td_uuid_t uuid)
 {
 	vhd_flag_t vhd_flags = 0;
 
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index 2e51883..36b5393 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -63,6 +63,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 {
 	int err;
 	td_driver_t *driver;
+	td_vbd_t *vbd = image->private;
 
 	driver = image->driver;
 	if (!driver) {
@@ -78,7 +79,8 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 	}
 
 	if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
-		err = driver->ops->td_open(driver, image->name, image->flags);
+		err = driver->ops->td_open(driver, image->name, image->flags,
+					   vbd->uuid);
 		if (err) {
 			if (!image->driver)
 				tapdisk_driver_free(driver);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 66d508e..459eaec 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -157,7 +157,7 @@ struct tap_disk {
 	const char                  *disk_type;
 	td_flag_t                    flags;
 	int                          private_data_size;
-	int (*td_open)               (td_driver_t *, const char *, td_flag_t);
+	int (*td_open)               (td_driver_t *, const char *, td_flag_t, td_uuid_t);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/17] tools: block-remus: use correct way to get remus_image
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
  2014-10-14  2:13 ` [PATCH 01/17] tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
  2014-10-14  2:13 ` [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  3:02   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 04/17] tools: block-remus: fix bug in tdremus_close() Wen Congyang
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We set remus_image in backup_read(). If we do flush
before the first read operation, remus_image will be
NULL. Pass image to remus via the callback td_open().

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-aio.c         | 6 ++++--
 tools/blktap2/drivers/block-cache.c       | 5 +++--
 tools/blktap2/drivers/block-log.c         | 5 +++--
 tools/blktap2/drivers/block-qcow.c        | 6 ++++--
 tools/blktap2/drivers/block-ram.c         | 6 ++++--
 tools/blktap2/drivers/block-remus.c       | 8 ++++----
 tools/blktap2/drivers/block-vhd.c         | 6 ++++--
 tools/blktap2/drivers/tapdisk-interface.c | 3 +--
 tools/blktap2/drivers/tapdisk.h           | 2 +-
 9 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/tools/blktap2/drivers/block-aio.c b/tools/blktap2/drivers/block-aio.c
index 1b560e5..27ba07d 100644
--- a/tools/blktap2/drivers/block-aio.c
+++ b/tools/blktap2/drivers/block-aio.c
@@ -40,6 +40,7 @@
 #include "tapdisk.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
+#include "tapdisk-image.h"
 
 #define MAX_AIO_REQS         TAPDISK_DATA_REQUESTS
 
@@ -111,11 +112,12 @@ static int tdaio_get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize aio state. */
-int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
-	       td_uuid_t uuid)
+int tdaio_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int i, fd, ret, o_flags;
 	struct tdaio_state *prv;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	ret = 0;
 	prv = (struct tdaio_state *)driver->data;
diff --git a/tools/blktap2/drivers/block-cache.c b/tools/blktap2/drivers/block-cache.c
index cd6ea6a..ff2c773 100644
--- a/tools/blktap2/drivers/block-cache.c
+++ b/tools/blktap2/drivers/block-cache.c
@@ -517,12 +517,13 @@ block_cache_put_request(block_cache_t *cache, block_cache_request_t *breq)
 }
 
 static int
-block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
-		 td_uuid_t uuid)
+block_cache_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int i, err;
 	radix_tree_t *tree;
 	block_cache_t *cache;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	if (!td_flag_test(flags, TD_OPEN_RDONLY))
 		return -EINVAL;
diff --git a/tools/blktap2/drivers/block-log.c b/tools/blktap2/drivers/block-log.c
index 7b33b63..80351d3 100644
--- a/tools/blktap2/drivers/block-log.c
+++ b/tools/blktap2/drivers/block-log.c
@@ -585,11 +585,12 @@ static void ctl_request(event_id_t id, char mode, void *private)
 
 static int tdlog_close(td_driver_t*);
 
-static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t flags,
-		      td_uuid_t uuid)
+static int tdlog_open(td_driver_t* driver, td_image_t *image, td_uuid_t uuid)
 {
   struct tdlog_state* s = (struct tdlog_state*)driver->data;
   int rc;
+  const char *name = image->name;
+  td_flag_t flags = image->flags;
 
   memset(s, 0, sizeof(*s));
 
diff --git a/tools/blktap2/drivers/block-qcow.c b/tools/blktap2/drivers/block-qcow.c
index 64dfafc..c63bd9d 100644
--- a/tools/blktap2/drivers/block-qcow.c
+++ b/tools/blktap2/drivers/block-qcow.c
@@ -45,6 +45,7 @@
 #include "qcow.h"
 #include "blk.h"
 #include "atomicio.h"
+#include "tapdisk-image.h"
 
 /* *BSD has no O_LARGEFILE */
 #ifndef O_LARGEFILE
@@ -865,14 +866,15 @@ out:
 }
 
 /* Open the disk file and initialize qcow state. */
-int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
-		 td_uuid_t uuid)
+int tdqcow_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	int fd, len, i, ret, size, o_flags;
 	td_disk_info_t *bs = &(driver->info);
 	struct tdqcow_state   *s  = (struct tdqcow_state *)driver->data;
 	QCowHeader header;
 	uint64_t final_cluster = 0;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
  	DPRINTF("QCOW: Opening %s\n", name);
 
diff --git a/tools/blktap2/drivers/block-ram.c b/tools/blktap2/drivers/block-ram.c
index b64a194..3e148ab 100644
--- a/tools/blktap2/drivers/block-ram.c
+++ b/tools/blktap2/drivers/block-ram.c
@@ -40,6 +40,7 @@
 #include "tapdisk.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
+#include "tapdisk-image.h"
 
 char *img;
 long int   disksector_size;
@@ -108,13 +109,14 @@ static int get_image_info(int fd, td_disk_info_t *info)
 }
 
 /* Open the disk file and initialize ram state. */
-int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
-		td_uuid_t uuid)
+int tdram_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	char *p;
 	uint64_t size;
 	int i, fd, ret = 0, count = 0, o_flags;
 	struct tdram_state *prv = (struct tdram_state *)driver->data;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	connections++;
 
diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index eb8c0ed..a2c08d8 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -1152,8 +1152,6 @@ void backup_queue_read(td_driver_t *driver, td_request_t treq)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int i;
-	if(!remus_image)
-		remus_image = treq.image;
 	
 	/* check if this read is queued in any currently ongoing flush */
 	if (ramdisk_read(&s->ramdisk, treq.sec, treq.secs, treq.buf)) {
@@ -1632,15 +1630,17 @@ static int ctl_register(struct tdremus_state *s)
 
 /* interface */
 
-static int tdremus_open(td_driver_t *driver, const char *name,
-			td_flag_t flags, td_uuid_t uuid)
+static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	int rc;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	RPRINTF("opening %s\n", name);
 
 	device_vbd = tapdisk_server_get_vbd(uuid);
+	remus_image = image;
 
 	memset(s, 0, sizeof(*s));
 	s->server_fd.fd = -1;
diff --git a/tools/blktap2/drivers/block-vhd.c b/tools/blktap2/drivers/block-vhd.c
index 06e9c89..b20f724 100644
--- a/tools/blktap2/drivers/block-vhd.c
+++ b/tools/blktap2/drivers/block-vhd.c
@@ -59,6 +59,7 @@
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
 #include "tapdisk-disktype.h"
+#include "tapdisk-image.h"
 
 unsigned int SPB;
 
@@ -675,10 +676,11 @@ __vhd_open(td_driver_t *driver, const char *name, vhd_flag_t flags)
 }
 
 static int
-_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
-	  td_uuid_t uuid)
+_vhd_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	vhd_flag_t vhd_flags = 0;
+	const char *name = image->name;
+	td_flag_t flags = image->flags;
 
 	if (flags & TD_OPEN_RDONLY)
 		vhd_flags |= VHD_FLAG_OPEN_RDONLY;
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index 36b5393..a29de64 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -79,8 +79,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
 	}
 
 	if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
-		err = driver->ops->td_open(driver, image->name, image->flags,
-					   vbd->uuid);
+		err = driver->ops->td_open(driver, image, vbd->uuid);
 		if (err) {
 			if (!image->driver)
 				tapdisk_driver_free(driver);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 459eaec..3c3b51d 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -157,7 +157,7 @@ struct tap_disk {
 	const char                  *disk_type;
 	td_flag_t                    flags;
 	int                          private_data_size;
-	int (*td_open)               (td_driver_t *, const char *, td_flag_t, td_uuid_t);
+	int (*td_open)               (td_driver_t *, td_image_t *, td_uuid_t);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/17] tools: block-remus: fix bug in tdremus_close()
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (2 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 03/17] tools: block-remus: use correct way to get remus_image Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  3:01   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 05/17] tools: block-remus: fix memory leak Wen Congyang
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

We close ctl_fd.fd, but we don't unregister ctl_fd.id. It will
cause select() return fails, and the user cannot talk with
tapdisk2.

This patch also does some cleanup.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c | 90 ++++++++++++++++++++++---------------
 1 file changed, 53 insertions(+), 37 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index a2c08d8..fd5f209 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -151,9 +151,6 @@ typedef struct poll_fd {
 } poll_fd_t;
 
 struct tdremus_state {
-//  struct tap_disk* driver;
-	void* driver_data;
-
   /* XXX: this is needed so that the server can perform operations on
    * the driver from the stream_fd event handler. fix this. */
 	td_driver_t *tdremus_driver;
@@ -731,12 +728,26 @@ static int mwrite(int fd, void* buf, size_t len)
 
 static void inline close_stream_fd(struct tdremus_state *s)
 {
+	if (s->stream_fd.fd < 0)
+		return;
+
 	/* XXX: -2 is magic. replace with macro perhaps? */
 	tapdisk_server_unregister_event(s->stream_fd.id);
 	close(s->stream_fd.fd);
 	s->stream_fd.fd = -2;
 }
 
+static void close_server_fd(struct tdremus_state *s)
+{
+	if (s->server_fd.fd < 0)
+		return;
+
+	tapdisk_server_unregister_event(s->server_fd.id);
+	s->server_fd.id = -1;
+	close(s->stream_fd.fd);
+	s->stream_fd.fd = -1;
+}
+
 /* primary functions */
 static void remus_client_event(event_id_t, char mode, void *private);
 static void remus_connect_event(event_id_t id, char mode, void *private);
@@ -1347,12 +1358,7 @@ static int unprotected_start(td_driver_t *driver)
 	/* close the server socket */
 	close_stream_fd(s);
 
-	/* unregister the replication stream */
-	tapdisk_server_unregister_event(s->server_fd.id);
-
-	/* close the replication stream */
-	close(s->server_fd.fd);
-	s->server_fd.fd = -1;
+	close_server_fd(s);
 
 	/* install the unprotected read/write handlers */
 	tapdisk_remus.td_queue_read = unprotected_queue_read;
@@ -1553,27 +1559,27 @@ static int ctl_open(td_driver_t *driver, const char* name)
 			s->ctl_path[i] = '_';
 	}
 	if (asprintf(&s->msg_path, "%s.msg", s->ctl_path) < 0)
-		goto err_ctlfifo;
+		goto err_setmsgfifo;
 
 	if (mkfifo(s->ctl_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno != EEXIST) {
 		RPRINTF("error creating control FIFO %s: %d\n", s->ctl_path, errno);
-		goto err_msgfifo;
+		goto err_mkctlfifo;
 	}
 
 	if (mkfifo(s->msg_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno != EEXIST) {
 		RPRINTF("error creating message FIFO %s: %d\n", s->msg_path, errno);
-		goto err_msgfifo;
+		goto err_mkmsgfifo;
 	}
 
 	/* RDWR so that fd doesn't block select when no writer is present */
 	if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
 		RPRINTF("error opening control FIFO %s: %d\n", s->ctl_path, errno);
-		goto err_msgfifo;
+		goto err_openctlfifo;
 	}
 
 	if ((s->msg_fd.fd = open(s->msg_path, O_RDWR)) < 0) {
 		RPRINTF("error opening message FIFO %s: %d\n", s->msg_path, errno);
-		goto err_openctlfifo;
+		goto err_openmsgfifo;
 	}
 
 	RPRINTF("control FIFO %s\n", s->ctl_path);
@@ -1581,36 +1587,45 @@ static int ctl_open(td_driver_t *driver, const char* name)
 
 	return 0;
 
- err_openctlfifo:
+err_openmsgfifo:
 	close(s->ctl_fd.fd);
- err_msgfifo:
+	s->ctl_fd.fd = -1;
+err_openctlfifo:
+	unlink(s->ctl_path);
+err_mkmsgfifo:
+	unlink(s->msg_path);
+err_mkctlfifo:
 	free(s->msg_path);
 	s->msg_path = NULL;
- err_ctlfifo:
+err_setmsgfifo:
 	free(s->ctl_path);
 	s->ctl_path = NULL;
 	return -1;
 }
 
-static void ctl_close(td_driver_t *driver)
+static void ctl_close(struct tdremus_state *s)
 {
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	/* TODO: close *all* connections */
-
-	if(s->ctl_fd.fd)
+	if(s->ctl_fd.fd) {
 		close(s->ctl_fd.fd);
+		s->ctl_fd.fd = -1;
+	}
 
 	if (s->ctl_path) {
 		unlink(s->ctl_path);
 		free(s->ctl_path);
 		s->ctl_path = NULL;
 	}
+
 	if (s->msg_path) {
 		unlink(s->msg_path);
 		free(s->msg_path);
 		s->msg_path = NULL;
 	}
+
+	if (s->msg_fd.fd) {
+		close(s->msg_fd.fd);
+		s->msg_fd.fd = -1;
+	}
 }
 
 static int ctl_register(struct tdremus_state *s)
@@ -1628,6 +1643,16 @@ static int ctl_register(struct tdremus_state *s)
 	return 0;
 }
 
+static void ctl_unregister(struct tdremus_state *s)
+{
+	RPRINTF("unregistering ctl fifo\n");
+
+	if (s->ctl_fd.id >= 0) {
+		tapdisk_server_unregister_event(s->ctl_fd.id);
+		s->ctl_fd.id = -1;
+	}
+}
+
 /* interface */
 
 static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
@@ -1658,13 +1683,12 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 
 	if ((rc = ctl_open(driver, name))) {
 		RPRINTF("error setting up control channel\n");
-		free(s->driver_data);
 		return rc;
 	}
 
 	if ((rc = ctl_register(s))) {
 		RPRINTF("error registering control channel\n");
-		free(s->driver_data);
+		ctl_close(s);
 		return rc;
 	}
 
@@ -1687,19 +1711,11 @@ static int tdremus_close(td_driver_t *driver)
 	RPRINTF("closing\n");
 	if (s->ramdisk.inprogress)
 		hashtable_destroy(s->ramdisk.inprogress, 0);
-	
-	if (s->driver_data) {
-		free(s->driver_data);
-		s->driver_data = NULL;
-	}
-	if (s->server_fd.fd >= 0) {
-		close(s->server_fd.fd);
-		s->server_fd.fd = -1;
-	}
-	if (s->stream_fd.fd >= 0)
-		close_stream_fd(s);
 
-	ctl_close(driver);
+	close_server_fd(s);
+	close_stream_fd(s);
+	ctl_unregister(s);
+	ctl_close(s);
 
 	return 0;
 }
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/17] tools: block-remus: fix memory leak
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (3 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 04/17] tools: block-remus: fix bug in tdremus_close() Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  2:33   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 06/17] tools: blktap2: return the correct dev path Wen Congyang
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

Fix the following two memory leak:
1. If s->ramdisk.prev is not NULL, we merge the write requests in
   s->ramdisk.h into s->ramdisk.prev, and then destroy s->ramdisk.h.
   But we forget to free hash value when destroying s->ramdisk.h.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index fd5f209..55363a3 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -599,7 +599,7 @@ static int ramdisk_start_flush(td_driver_t *driver)
 		}
 		free(sectors);
 
-		hashtable_destroy (s->ramdisk.h, 0);
+		hashtable_destroy (s->ramdisk.h, 1);
 	} else
 		s->ramdisk.prev = s->ramdisk.h;
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/17] tools: blktap2: return the correct dev path
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (4 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 05/17] tools: block-remus: fix memory leak Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-14  2:13 ` [PATCH 07/17] tools: blktap2: use correct way to get free event id Wen Congyang
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

The user uses TAPDISK_MESSAGE_OPEN to pass the devpath to tapdisk2,
and will use TAPDISK_MESSAGE_LIST to query and get the pid of the
tapdisk2.

The devpath's format is: driver:params[|driver:params[...]].
The first vbd image only contains the first params, and we will
return driver:params, not devpath. The devpath is stored in
vbd->name, so return vbd->name instead of image->name.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/tapdisk-control.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 0b5cf3c..3a4ec8e 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -270,15 +270,10 @@ tapdisk_control_list(struct tapdisk_control_connection *connection,
 		response.u.list.state   = vbd->state;
 		response.u.list.path[0] = 0;
 
-		if (!list_empty(&vbd->images)) {
-			td_image_t *image = list_entry(vbd->images.next,
-						       td_image_t, next);
+		if (vbd->name)
 			snprintf(response.u.list.path,
 				 sizeof(response.u.list.path),
-				 "%s:%s",
-				 tapdisk_disk_types[image->type]->name,
-				 image->name);
-		}
+				 "%s", vbd->name);
 
 		tapdisk_control_write_message(connection->socket, &response, 2);
 	}
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/17] tools: blktap2: use correct way to get free event id
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (5 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 06/17] tools: blktap2: return the correct dev path Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-14  2:13 ` [PATCH 08/17] tools: blktap2: don't return negative " Wen Congyang
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

If we register/unregister event too many times, and we use event id
from 1 again. But we don't check it if it is used.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/scheduler.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/tools/blktap2/drivers/scheduler.c b/tools/blktap2/drivers/scheduler.c
index 6b8d009..dd608dd 100644
--- a/tools/blktap2/drivers/scheduler.c
+++ b/tools/blktap2/drivers/scheduler.c
@@ -160,6 +160,31 @@ scheduler_run_events(scheduler_t *s)
 	}
 }
 
+static int
+get_free_id(scheduler_t *s)
+{
+	event_t *event, *tmp;
+	int old_uuid = s->uuid;
+	int id = s->uuid++;
+
+	if (!s->uuid)
+		s->uuid++;
+
+retry:
+	scheduler_for_each_event(s, event, tmp)
+		if (event->id == id) {
+			id = s->uuid++;
+			if (!s->uuid)
+				s->uuid++;
+			if (id == old_uuid)
+				return 0;
+
+			goto retry;
+		}
+
+	return id;
+}
+
 int
 scheduler_register_event(scheduler_t *s, char mode, int fd,
 			 int timeout, event_cb_t cb, void *private)
@@ -187,10 +212,12 @@ scheduler_register_event(scheduler_t *s, char mode, int fd,
 	event->deadline = now.tv_sec + timeout;
 	event->cb       = cb;
 	event->private  = private;
-	event->id       = s->uuid++;
+	event->id       = get_free_id(s);
 
-	if (!s->uuid)
-		s->uuid++;
+	if (!event->id) {
+		free(event);
+		return -EBUSY;
+	}
 
 	list_add_tail(&event->next, &s->events);
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/17] tools: blktap2: don't return negative event id
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (6 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 07/17] tools: blktap2: use correct way to get free event id Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-14  2:13 ` [PATCH 09/17] tools: blktap2: use correct way to define array Wen Congyang
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

If we find some error when registering a new event, we will return
a negative value. So we should skip negative event id.

Also fix a wrong check of return value.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/scheduler.c       | 8 ++++----
 tools/blktap2/drivers/tapdisk-control.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/blktap2/drivers/scheduler.c b/tools/blktap2/drivers/scheduler.c
index dd608dd..e07528b 100644
--- a/tools/blktap2/drivers/scheduler.c
+++ b/tools/blktap2/drivers/scheduler.c
@@ -167,15 +167,15 @@ get_free_id(scheduler_t *s)
 	int old_uuid = s->uuid;
 	int id = s->uuid++;
 
-	if (!s->uuid)
-		s->uuid++;
+	if (s->uuid < 0)
+		s->uuid = 1;
 
 retry:
 	scheduler_for_each_event(s, event, tmp)
 		if (event->id == id) {
 			id = s->uuid++;
-			if (!s->uuid)
-				s->uuid++;
+			if (s->uuid < 0)
+				s->uuid = 1;
 			if (id == old_uuid)
 				return 0;
 
diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 3a4ec8e..4e5f748 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -700,7 +700,7 @@ tapdisk_control_accept(event_id_t id, char mode, void *private)
 					    connection->socket, 0,
 					    tapdisk_control_handle_request,
 					    connection);
-	if (err == -1) {
+	if (err < 0) {
 		close(fd);
 		free(connection);
 		EPRINTF("failed to register new control event: %d\n", err);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/17] tools: blktap2: use correct way to define array.
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (7 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 08/17] tools: blktap2: don't return negative " Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  2:37   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 10/17] tools: block-remus: fix bug in ctl_request() Wen Congyang
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

Currently, we use the following way to define an array:
type array[] = {
    [index] = xxx,
    0,
};
So array[index+1] will be NULL. If index is not the last
index, it will override another index.

tapdisk_vbd_index is not defined, but array[DISK_TYPE_VINDEX]
is overridden, so we don't find this problem when building
the source.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/tapdisk-disktype.c | 12 ++----------
 tools/blktap2/drivers/tapdisk-disktype.h |  2 +-
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/tools/blktap2/drivers/tapdisk-disktype.c b/tools/blktap2/drivers/tapdisk-disktype.c
index e9a6890..8d1383b 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.c
+++ b/tools/blktap2/drivers/tapdisk-disktype.c
@@ -82,12 +82,6 @@ static const disk_info_t block_cache_disk = {
        1,
 };
 
-static const disk_info_t vhd_index_disk = {
-       "vhdi",
-       "vhd index image (vhdi)",
-       1,
-};
-
 static const disk_info_t log_disk = {
 	"log",
 	"write logger (log)",
@@ -110,9 +104,8 @@ const disk_info_t *tapdisk_disk_types[] = {
 	[DISK_TYPE_QCOW]	= &qcow_disk,
 	[DISK_TYPE_BLOCK_CACHE] = &block_cache_disk,
 	[DISK_TYPE_LOG]	= &log_disk,
-	[DISK_TYPE_VINDEX]	= &vhd_index_disk,
 	[DISK_TYPE_REMUS]	= &remus_disk,
-	0,
+	[DISK_TYPE_MAX]		= NULL,
 };
 
 extern struct tap_disk tapdisk_aio;
@@ -137,10 +130,9 @@ const struct tap_disk *tapdisk_disk_drivers[] = {
 	[DISK_TYPE_RAM]         = &tapdisk_ram,
 	[DISK_TYPE_QCOW]        = &tapdisk_qcow,
 	[DISK_TYPE_BLOCK_CACHE] = &tapdisk_block_cache,
-	[DISK_TYPE_VINDEX]      = &tapdisk_vhd_index,
 	[DISK_TYPE_LOG]         = &tapdisk_log,
 	[DISK_TYPE_REMUS]       = &tapdisk_remus,
-	0,
+	[DISK_TYPE_MAX]         = NULL,
 };
 
 int
diff --git a/tools/blktap2/drivers/tapdisk-disktype.h b/tools/blktap2/drivers/tapdisk-disktype.h
index b697eea..c574990 100644
--- a/tools/blktap2/drivers/tapdisk-disktype.h
+++ b/tools/blktap2/drivers/tapdisk-disktype.h
@@ -39,7 +39,7 @@
 #define DISK_TYPE_BLOCK_CACHE 7
 #define DISK_TYPE_LOG         8
 #define DISK_TYPE_REMUS       9
-#define DISK_TYPE_VINDEX      10
+#define DISK_TYPE_MAX         10
 
 #define DISK_TYPE_NAME_MAX    32
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/17] tools: block-remus: fix bug in ctl_request()
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (8 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 09/17] tools: blktap2: use correct way to define array Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  2:38   ` Shriram Rajagopalan
  2014-10-14  2:13 ` [PATCH 11/17] tools: block-remus: clean unused functions Wen Congyang
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

ctl_request() handles the command which the users writes to ctl fifo. The
user will read the response from msg fifo. This patch fixes the following bugs:
1. If the command is not "flush", we don't respond, and the user will wait
   the forever.
2. If the current mode is not mode_primary, we don't respond in s->queue_flush(),
   so call s->queue_flush() only if the mode is mode_primary.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 55363a3..9be47f6 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -1513,13 +1513,18 @@ static void ctl_request(event_id_t id, char mode, void *private)
 	/* TODO: need to get driver somehow */
 	msg[rc] = '\0';
 	if (!strncmp(msg, "flush", 5)) {
-		if (s->queue_flush)
+		if (s->mode == mode_primary) {
 			if ((rc = s->queue_flush(driver))) {
 				RPRINTF("error passing flush request to backup");
 				ctl_respond(s, TDREMUS_FAIL);
 			}
+		} else {
+			RPRINTF("We are not in primary mode\n");
+			ctl_respond(s, TDREMUS_FAIL);
+		}
 	} else {
 		RPRINTF("unknown command: %s\n", msg);
+		ctl_respond(s, TDREMUS_FAIL);
 	}
 }
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/17] tools: block-remus: clean unused functions
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (9 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 10/17] tools: block-remus: fix bug in ctl_request() Wen Congyang
@ 2014-10-14  2:13 ` Wen Congyang
  2014-10-20  3:01   ` Shriram Rajagopalan
  2014-10-14  2:14 ` [PATCH 12/17] tools: blktap2: implement an API to create a connection asynchronously Wen Congyang
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:13 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/block-remus.c | 142 ------------------------------------
 1 file changed, 142 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 9be47f6..e5ad782 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -186,7 +186,6 @@ typedef struct tdremus_wire {
 
 #define TDREMUS_READ "rreq"
 #define TDREMUS_WRITE "wreq"
-#define TDREMUS_SUBMIT "sreq"
 #define TDREMUS_COMMIT "creq"
 #define TDREMUS_DONE "done"
 #define TDREMUS_FAIL "fail"
@@ -750,42 +749,6 @@ static void close_server_fd(struct tdremus_state *s)
 
 /* primary functions */
 static void remus_client_event(event_id_t, char mode, void *private);
-static void remus_connect_event(event_id_t id, char mode, void *private);
-static void remus_retry_connect_event(event_id_t id, char mode, void *private);
-
-static int primary_do_connect(struct tdremus_state *state)
-{
-	event_id_t id;
-	int fd;
-	int rc;
-	int flags;
-
-	RPRINTF("client connecting to %s:%d...\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
-
-	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create client socket: %d\n", errno);
-		return -1;
-	}
-
-	/* make socket nonblocking */
-	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
-		flags = 0;
-	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
-		return -1;
-
-	/* once we have created the socket and populated the address, we can now start
-	 * our non-blocking connect. rather than duplicating code we trigger a timeout
-	 * on the socket fd, which calls out nonblocking connect code
-	 */
-	if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0, remus_retry_connect_event, state)) < 0) {
-		RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-		/* TODO: we leak a fd here */
-		return -1;
-	}
-	state->stream_fd.fd = fd;
-	state->stream_fd.id = id;
-	return 0;
-}
 
 static int primary_blocking_connect(struct tdremus_state *state)
 {
@@ -939,100 +902,6 @@ static int primary_start(td_driver_t *driver)
 	return 0;
 }
 
-/* timeout callback */
-static void remus_retry_connect_event(event_id_t id, char mode, void *private)
-{
-	struct tdremus_state *s = (struct tdremus_state *)private;
-
-	/* do a non-blocking connect */
-	if (connect(s->stream_fd.fd, (struct sockaddr *)&s->sa, sizeof(s->sa))
-	    && errno != EINPROGRESS)
-	{
-		if(errno == ECONNREFUSED || errno == ENETUNREACH || errno == EAGAIN || errno == ECONNABORTED)
-		{
-			/* try again in a second */
-			tapdisk_server_unregister_event(s->stream_fd.id);
-			if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd, REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
-				RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-				return;
-			}
-			s->stream_fd.id = id;
-		}
-		else
-		{
-			/* not recoverable */
-			RPRINTF("error connection to server %s\n", strerror(errno));
-			return;
-		}
-	}
-	else
-	{
-		/* the connect returned EINPROGRESS (nonblocking connect) we must wait for the fd to be writeable to determine if the connect worked */
-
-		tapdisk_server_unregister_event(s->stream_fd.id);
-		if((id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD, s->stream_fd.fd, 0, remus_connect_event, s)) < 0) {
-			RPRINTF("error registering client connection event handler: %s\n", strerror(id));
-			return;
-		}
-		s->stream_fd.id = id;
-	}
-}
-
-/* callback when nonblocking connect() is finished */
-/* called only by primary in unprotected state */
-static void remus_connect_event(event_id_t id, char mode, void *private)
-{
-	int socket_errno;
-	socklen_t socket_errno_size;
-	struct tdremus_state *s = (struct tdremus_state *)private;
-
-	/* check to se if the connect succeeded */
-	socket_errno_size = sizeof(socket_errno);
-	if (getsockopt(s->stream_fd.fd, SOL_SOCKET, SO_ERROR, &socket_errno, &socket_errno_size)) {
-		RPRINTF("error getting socket errno\n");
-		return;
-	}
-
-	RPRINTF("socket connect returned %d\n", socket_errno);
-
-	if(socket_errno)
-	{
-		/* the connect did not succeed */
-
-		if(socket_errno == ECONNREFUSED || socket_errno == ENETUNREACH || socket_errno == ETIMEDOUT
-		   || socket_errno == ECONNABORTED || socket_errno == EAGAIN)
-		{
-			/* we can probably assume that the backup is down. just try again later */
-			tapdisk_server_unregister_event(s->stream_fd.id);
-			if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd, REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
-				RPRINTF("error registering timeout client connection event handler: %s\n", strerror(id));
-				return;
-			}
-			s->stream_fd.id = id;
-		}
-		else
-		{
-			RPRINTF("socket connect returned %d, giving up\n", socket_errno);
-		}
-	}
-	else
-	{
-		/* the connect succeeded */
-
-		/* unregister this function and register a new event handler */
-		tapdisk_server_unregister_event(s->stream_fd.id);
-		if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->stream_fd.fd, 0, remus_client_event, s)) < 0) {
-			RPRINTF("error registering client event handler: %s\n", strerror(id));
-			return;
-		}
-		s->stream_fd.id = id;
-
-		/* switch from unprotected to protected client */
-		switch_mode(s->tdremus_driver, mode_primary);
-	}
-}
-
-
 /* we install this event handler on the primary once we have connected to the backup */
 /* wait for "done" message to commit checkpoint */
 static void remus_client_event(event_id_t id, char mode, void *private)
@@ -1247,15 +1116,6 @@ static int server_do_wreq(td_driver_t *driver)
 	return -1;
 }
 
-static int server_do_sreq(td_driver_t *driver)
-{
-	/*
-	  RPRINTF("submit request received\n");
-  */
-
-	return 0;
-}
-
 /* at this point, the server can start applying the most recent
  * ramdisk. */
 static int server_do_creq(td_driver_t *driver)
@@ -1296,8 +1156,6 @@ static void remus_server_event(event_id_t id, char mode, void *private)
 
 	if (!strcmp(req, TDREMUS_WRITE))
 		server_do_wreq(driver);
-	else if (!strcmp(req, TDREMUS_SUBMIT))
-		server_do_sreq(driver);
 	else if (!strcmp(req, TDREMUS_COMMIT))
 		server_do_creq(driver);
 	else
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/17] tools: blktap2: implement an API to create a connection asynchronously
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (10 preceding siblings ...)
  2014-10-14  2:13 ` [PATCH 11/17] tools: block-remus: clean unused functions Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-14  2:14 ` [PATCH 13/17] tools: block-remus: connect to backup asynchronously Wen Congyang
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

tapdisk2 is a single thread process. If we use remus,
we will block in primary_blocking_connect(). The
user will not have any chance to talk with tapdisk2.
So we should connect to backup asynchronously. The patch
only implements an API to create a connection asynchronously.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/blktap2/drivers/Makefile            |   1 +
 tools/blktap2/drivers/block-replication.c | 468 ++++++++++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.h | 111 +++++++
 3 files changed, 580 insertions(+)
 create mode 100644 tools/blktap2/drivers/block-replication.c
 create mode 100644 tools/blktap2/drivers/block-replication.h

diff --git a/tools/blktap2/drivers/Makefile b/tools/blktap2/drivers/Makefile
index 3476fc1..a7f45c7 100644
--- a/tools/blktap2/drivers/Makefile
+++ b/tools/blktap2/drivers/Makefile
@@ -29,6 +29,7 @@ REMUS-OBJS  := block-remus.o
 REMUS-OBJS  += hashtable.o
 REMUS-OBJS  += hashtable_itr.o
 REMUS-OBJS  += hashtable_utility.o
+REMUS-OBJS  += block-replication.o
 
 tapdisk2 tapdisk-stream tapdisk-diff $(QCOW_UTIL): AIOLIBS := -laio
 
diff --git a/tools/blktap2/drivers/block-replication.c b/tools/blktap2/drivers/block-replication.c
new file mode 100644
index 0000000..e4b2679
--- /dev/null
+++ b/tools/blktap2/drivers/block-replication.c
@@ -0,0 +1,468 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "tapdisk-server.h"
+#include "block-replication.h"
+
+#include <string.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <syslog.h>
+#include <stdlib.h>
+#include <arpa/inet.h>
+
+#undef DPRINTF
+#undef EPRINTF
+#define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "%s: " _f, log_prefix, ## _a)
+#define EPRINTF(_f, _a...) syslog (LOG_ERR, "%s: " _f, log_prefix, ## _a)
+
+/* connection status */
+enum {
+	connection_none,
+	connection_in_progress,
+	connection_established,
+	connection_closed,
+};
+
+/* common functions */
+/* args should be host:port */
+static int get_args(td_replication_connect_t *t, const char* name)
+{
+	char* host;
+	const char* port;
+	int gai_status;
+	int valid_addr;
+	struct addrinfo gai_hints;
+	struct addrinfo *servinfo, *servinfo_itr;
+	const char *log_prefix = t->log_prefix;
+
+	memset(&gai_hints, 0, sizeof gai_hints);
+	gai_hints.ai_family = AF_UNSPEC;
+	gai_hints.ai_socktype = SOCK_STREAM;
+
+	port = strchr(name, ':');
+	if (!port) {
+		EPRINTF("missing host in %s\n", name);
+		return -ENOENT;
+	}
+	if (!(host = strndup(name, port - name))) {
+		EPRINTF("unable to allocate host\n");
+		return -ENOMEM;
+	}
+	port++;
+	if ((gai_status = getaddrinfo(host, port,
+				      &gai_hints, &servinfo)) != 0) {
+		EPRINTF("getaddrinfo error: %s\n", gai_strerror(gai_status));
+		free(host);
+		return -ENOENT;
+	}
+	free(host);
+
+	/* TODO: do something smarter here */
+	valid_addr = 0;
+	for (servinfo_itr = servinfo; servinfo_itr != NULL;
+	     servinfo_itr = servinfo_itr->ai_next) {
+		if (servinfo_itr->ai_family == AF_INET) {
+			valid_addr = 1;
+			memset(&t->sa, 0, sizeof(t->sa));
+			t->sa = *(struct sockaddr_in *)servinfo_itr->ai_addr;
+			break;
+		}
+	}
+	freeaddrinfo(servinfo);
+
+	if (!valid_addr)
+		return -ENOENT;
+
+	DPRINTF("host: %s, port: %d\n", inet_ntoa(t->sa.sin_addr),
+		ntohs(t->sa.sin_port));
+
+	return 0;
+}
+
+int td_replication_connect_init(td_replication_connect_t *t, const char *name)
+{
+	int rc;
+
+	rc = get_args(t, name);
+	if (rc)
+		return rc;
+
+	t->listen_fd = -1;
+	t->id = -1;
+	t->status = connection_none;
+	return 0;
+}
+
+int td_replication_connect_status(td_replication_connect_t *t)
+{
+	const char *log_prefix = t->log_prefix;
+
+	switch (t->status) {
+	case connection_none:
+	case connection_closed:
+		return -1;
+	case connection_in_progress:
+		return 0;
+	case connection_established:
+		return 1;
+	default:
+		EPRINTF("td_replication_connect is corruptted\n");
+		return -2;
+	}
+}
+
+void td_replication_connect_kill(td_replication_connect_t *t)
+{
+	if (t->status != connection_in_progress &&
+	    t->status != connection_established)
+		return;
+
+	UNREGISTER_EVENT(t->id);
+	CLOSE_FD(t->fd);
+	CLOSE_FD(t->listen_fd);
+	t->status = connection_closed;
+}
+
+/* server */
+static void td_replication_server_accept(event_id_t id, char mode,
+					 void *private);
+
+int td_replication_server_start(td_replication_connect_t *t)
+{
+	int opt;
+	int rc = -1;
+	event_id_t id;
+	int fd;
+	const char *log_prefix = t->log_prefix;
+
+	if (t->status == connection_in_progress ||
+	    t->status == connection_established)
+		return rc;
+
+	fd = socket(AF_INET, SOCK_STREAM, 0);
+	if (fd < 0) {
+		EPRINTF("could not create server socket: %d\n", errno);
+		return rc;
+	}
+
+	opt = 1;
+	if (setsockopt(fd, SOL_SOCKET,
+		       SO_REUSEADDR, &opt, sizeof(opt)) < 0)
+		DPRINTF("Error setting REUSEADDR on %d: %d\n", fd, errno);
+
+	if (bind(fd, (struct sockaddr *)&t->sa, sizeof(t->sa)) < 0) {
+		DPRINTF("could not bind server socket %d to %s:%d: %d %s\n",
+			fd, inet_ntoa(t->sa.sin_addr),
+			ntohs(t->sa.sin_port), errno, strerror(errno));
+		if (errno == EADDRNOTAVAIL)
+			rc = -2;
+		goto err;
+	}
+
+	if (listen(fd, t->max_connections)) {
+		EPRINTF("could not listen on socket: %d\n", errno);
+		goto err;
+	}
+
+	/*
+	 * The socket is now bound to the address and listening so we
+	 * may now register the fd with tapdisk
+	 */
+	id =  tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
+					    fd, 0,
+					    td_replication_server_accept, t);
+	if (id < 0) {
+		EPRINTF("error registering server connection event handler: %s",
+			strerror(id));
+		goto err;
+	}
+	t->listen_fd = fd;
+	t->id = id;
+	t->status = connection_in_progress;
+
+	return 0;
+
+err:
+	close(fd);
+	return rc;
+}
+
+static void td_replication_server_accept(event_id_t id, char mode,
+					 void *private)
+{
+	td_replication_connect_t *t = private;
+	int fd;
+	const char *log_prefix = t->log_prefix;
+
+	/* XXX: add address-based black/white list */
+	fd = accept(t->listen_fd, NULL, NULL);
+	if (fd < 0) {
+		EPRINTF("error accepting connection: %d\n", errno);
+		return;
+	}
+
+	if (t->status == connection_established) {
+		EPRINTF("connection is already established\n");
+		close(fd);
+		return;
+	}
+
+	DPRINTF("server accepted connection\n");
+	t->fd = fd;
+	t->status = connection_established;
+	t->callback(t, 0);
+}
+
+int td_replication_server_restart(td_replication_connect_t *t)
+{
+	switch (t->status) {
+	case connection_in_progress:
+		return 0;
+	case connection_established:
+		CLOSE_FD(t->fd);
+		t->status = connection_in_progress;
+		return 0;
+	case connection_none:
+	case connection_closed:
+		return td_replication_server_start(t);
+	default:
+		/* not reached */
+		return -1;
+	}
+}
+
+/* client */
+static void td_replication_retry_connect_event(event_id_t id, char mode,
+					       void *private);
+static void td_replication_connect_event(event_id_t id, char mode,
+					 void *private);
+int td_replication_client_start(td_replication_connect_t *t)
+{
+	event_id_t id;
+	int fd;
+	int rc;
+	int flags;
+	const char *log_prefix = t->log_prefix;
+
+	if (t->status == connection_in_progress ||
+	    t->status == connection_established)
+		return ERROR_INTERNAL;
+
+	DPRINTF("client connecting to %s:%d...\n",
+		inet_ntoa(t->sa.sin_addr), ntohs(t->sa.sin_port));
+
+	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
+		EPRINTF("could not create client socket: %d\n", errno);
+		return ERROR_INTERNAL;
+	}
+
+	/* make socket nonblocking */
+	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
+		flags = 0;
+	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1) {
+		EPRINTF("error setting fd %d to non block mode\n", fd);
+		goto err;
+	}
+
+	/*
+	 * once we have created the socket and populated the address,
+	 * we can now start our non-blocking connect. rather than
+	 * duplicating code we trigger a timeout on the socket fd,
+	 * which calls out nonblocking connect code
+	 */
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, fd, 0,
+					   td_replication_retry_connect_event,
+					   t);
+	if(id < 0) {
+		EPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		goto err;
+	}
+
+	t->fd = fd;
+	t->id = id;
+	t->status = connection_in_progress;
+	return 0;
+
+err:
+	close(fd);
+	return ERROR_INTERNAL;
+}
+
+static void td_replication_client_failed(td_replication_connect_t *t, int rc)
+{
+	td_replication_connect_kill(t);
+	t->callback(t, rc);
+}
+
+static void td_replication_client_done(td_replication_connect_t *t)
+{
+	UNREGISTER_EVENT(t->id);
+	t->status = connection_established;
+	t->callback(t, 0);
+}
+
+static int td_replication_retry_connect(td_replication_connect_t *t)
+{
+	event_id_t id;
+	const char *log_prefix = t->log_prefix;
+
+	UNREGISTER_EVENT(t->id);
+
+	DPRINTF("connect to server 1 second later");
+	id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
+					   t->fd, t->retry_timeout_s,
+					   td_replication_retry_connect_event,
+					   t);
+	if (id < 0) {
+		EPRINTF("error registering timeout client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
+	}
+
+	t->id = id;
+	return 0;
+}
+
+static int td_replication_wait_connect_done(td_replication_connect_t *t)
+{
+	event_id_t id;
+	const char *log_prefix = t->log_prefix;
+
+	UNREGISTER_EVENT(t->id);
+
+	id = tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD,
+					   t->fd, 0,
+					   td_replication_connect_event, t);
+	if (id < 0) {
+		EPRINTF("error registering client connection event handler: %s\n",
+			strerror(id));
+		return ERROR_INTERNAL;
+	}
+	t->id = id;
+
+	return 0;
+}
+
+/* return 1 if we need to reconnect to backup server */
+static int check_connect_errno(int err)
+{
+	/*
+	 * The fd is non-block, so we will not get ETIMEDOUT
+	 * after calling connect(). We only can get this errno
+	 * by getsockopt().
+	 */
+	if (err == ECONNREFUSED || err == ENETUNREACH ||
+	    err == EAGAIN || err == ECONNABORTED ||
+	    err == ETIMEDOUT)
+	    return 1;
+
+	return 0;
+}
+
+static void td_replication_retry_connect_event(event_id_t id, char mode,
+					       void *private)
+{
+	td_replication_connect_t *t = private;
+	int rc, ret;
+	const char *log_prefix = t->log_prefix;
+
+	/* do a non-blocking connect */
+	ret = connect(t->fd, (struct sockaddr *)&t->sa, sizeof(t->sa));
+	if (ret) {
+		if (errno == EINPROGRESS) {
+			/*
+			 * the connect returned EINPROGRESS (nonblocking
+			 * connect) we must wait for the fd to be writeable
+			 * to determine if the connect worked
+			 */
+			rc = td_replication_wait_connect_done(t);
+			if (rc)
+				goto fail;
+			return;
+		}
+
+		if (check_connect_errno(errno)) {
+			rc = td_replication_retry_connect(t);
+			if (rc)
+				goto fail;
+			return;
+		}
+
+		/* not recoverable */
+		EPRINTF("error connection to server %s\n", strerror(errno));
+		rc = ERROR_CONNECTION;
+		goto fail;
+	}
+
+	/* The connection is established unexpectedly */
+	td_replication_client_done(t);
+
+	return;
+
+fail:
+	td_replication_client_failed(t, rc);
+}
+
+/* callback when nonblocking connect() is finished */
+static void td_replication_connect_event(event_id_t id, char mode,
+					 void *private)
+{
+	int socket_errno;
+	socklen_t socket_errno_size;
+	td_replication_connect_t *t = private;
+	int rc;
+	const char *log_prefix = t->log_prefix;
+
+	/* check to see if the connect succeeded */
+	socket_errno_size = sizeof(socket_errno);
+	if (getsockopt(t->fd, SOL_SOCKET, SO_ERROR,
+		       &socket_errno, &socket_errno_size)) {
+		EPRINTF("error getting socket errno\n");
+		return;
+	}
+
+	DPRINTF("socket connect returned %d\n", socket_errno);
+
+	if (socket_errno) {
+		/* the connect did not succeed */
+		if (check_connect_errno(socket_errno)) {
+			/*
+			 * we can probably assume that the backup is down.
+			 * just try again later
+			 */
+			rc = td_replication_retry_connect(t);
+			if (rc)
+				goto fail;
+
+			return;
+		} else {
+			EPRINTF("socket connect returned %d, giving up\n",
+				socket_errno);
+			rc = ERROR_CONNECTION;
+			goto fail;
+		}
+	}
+
+	td_replication_client_done(t);
+
+	return;
+
+fail:
+	td_replication_client_failed(t, rc);
+}
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
new file mode 100644
index 0000000..9e051cc
--- /dev/null
+++ b/tools/blktap2/drivers/block-replication.h
@@ -0,0 +1,111 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author: Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#ifndef BLOCK_REPLICATION_H
+#define BLOCK_REPLICATION_H
+
+#include "scheduler.h"
+#include <sys/socket.h>
+#include <netdb.h>
+
+#define CONTAINER_OF(inner_ptr, outer, member_name)			\
+	({								\
+		typeof(outer) *container_of_;				\
+		container_of_ = (void*)((char*)(inner_ptr) -		\
+				offsetof(typeof(outer), member_name));	\
+		(void)(&container_of_->member_name ==			\
+		       (typeof(inner_ptr))0) /* type check */;		\
+		container_of_;						\
+	})
+
+#define UNREGISTER_EVENT(id)					\
+	do {							\
+		if (id >= 0) {					\
+			tapdisk_server_unregister_event(id);	\
+			id = -1;				\
+		}						\
+	} while (0)
+#define CLOSE_FD(fd)			\
+	do {				\
+		if (fd >= 0) {		\
+			close(fd);	\
+			fd = -1;	\
+		}			\
+	} while (0)
+
+enum {
+	ERROR_INTERNAL = -1,
+	ERROR_CONNECTION = -2,
+};
+
+typedef struct td_replication_connect td_replication_connect_t;
+typedef void td_replication_callback(td_replication_connect_t *r, int rc);
+
+struct td_replication_connect {
+	/*
+	 * caller must fill these in before calling
+	 * td_replication_connect_init()
+	 */
+	const char *log_prefix;
+	td_replication_callback *callback;
+	int retry_timeout_s;
+	int max_connections;
+	/*
+	 * The caller uses this fd to read/write after
+	 * the connection is established
+	 */
+	int fd;
+
+	/* private */
+	struct sockaddr_in sa;
+	int listen_fd;
+	event_id_t id;
+
+	int status;
+};
+
+/* return -errno if failure happened, otherwise return 0 */
+int td_replication_connect_init(td_replication_connect_t *t, const char *name);
+/*
+ * Return value:
+ *   -1: connection is closed or not connected
+ *    0: connection is in progress
+ *    1: connection is established
+ */
+int td_replication_connect_status(td_replication_connect_t *t);
+void td_replication_connect_kill(td_replication_connect_t *t);
+
+/*
+ * Return value:
+ *   -2: this caller should be client
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_server_start(td_replication_connect_t *t);
+/*
+ * Return value:
+ *   -2: this caller should be client
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_server_restart(td_replication_connect_t *t);
+/*
+ * Return value:
+ *   -1: error
+ *    0: connection is in progress
+ */
+int td_replication_client_start(td_replication_connect_t *t);
+
+#endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 13/17] tools: block-remus: connect to backup asynchronously
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (11 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 12/17] tools: blktap2: implement an API to create a connection asynchronously Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-20  2:50   ` Shriram Rajagopalan
  2014-10-14  2:14 ` [PATCH 14/17] block-remus: switch to unprotected mode before closing Wen Congyang
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

Use the API to connect to backup asynchronously.
Before the connection is established, we queue
all I/O requests, and handle them when the connection
is established.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c       | 508 +++++++++++++-----------------
 tools/blktap2/drivers/block-replication.h |   1 +
 2 files changed, 221 insertions(+), 288 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index e5ad782..a2b9f62 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -40,6 +40,7 @@
 #include "hashtable.h"
 #include "hashtable_itr.h"
 #include "hashtable_utility.h"
+#include "block-replication.h"
 
 #include <errno.h>
 #include <inttypes.h>
@@ -49,10 +50,7 @@
 #include <string.h>
 #include <sys/time.h>
 #include <sys/types.h>
-#include <sys/socket.h>
-#include <netdb.h>
 #include <netinet/in.h>
-#include <arpa/inet.h>
 #include <sys/param.h>
 #include <sys/sysctl.h>
 #include <unistd.h>
@@ -63,10 +61,12 @@
 #define RAMDISK_HASHSIZE 128
 
 /* connect retry timeout (seconds) */
-#define REMUS_CONNRETRY_TIMEOUT 10
+#define REMUS_CONNRETRY_TIMEOUT 1
 
 #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
 
+#define MAX_REMUS_REQUESTS      TAPDISK_DATA_REQUESTS
+
 enum tdremus_mode {
 	mode_invalid = 0,
 	mode_unprotected,
@@ -75,16 +75,14 @@ enum tdremus_mode {
 };
 
 struct tdremus_req {
-	uint64_t sector;
-	int nb_sectors;
-	char buf[4096];
+	td_request_t treq;
 };
 
 struct req_ring {
 	/* waste one slot to distinguish between empty and full */
-	struct tdremus_req requests[MAX_REQUESTS * 2 + 1];
-	unsigned int head;
-	unsigned int tail;
+	struct tdremus_req pending_requests[MAX_REMUS_REQUESTS + 1];
+	unsigned int prod;
+	unsigned int cons;
 };
 
 /* TODO: This isn't very pretty, but to properly generate our own treqs (needed
@@ -161,13 +159,14 @@ struct tdremus_state {
 	char*     msg_path; /* output completion message here */
 	poll_fd_t msg_fd;
 
-  /* replication host */
-	struct sockaddr_in sa;
-	poll_fd_t server_fd;    /* server listen port */
+	td_replication_connect_t t;
 	poll_fd_t stream_fd;     /* replication channel */
 
-	/* queue write requests, batch-replicate at submit */
-	struct req_ring write_ring;
+	/*
+	 * queue I/O requests, batch-replicate when
+	 * the connection is established.
+	 */
+	struct req_ring queued_io;
 
 	/* ramdisk data*/
 	struct ramdisk ramdisk;
@@ -206,11 +205,13 @@ static int tdremus_close(td_driver_t *driver);
 
 static int switch_mode(td_driver_t *driver, enum tdremus_mode mode);
 static int ctl_respond(struct tdremus_state *s, const char *response);
+static int ctl_register(struct tdremus_state *s);
+static void ctl_unregister(struct tdremus_state *s);
 
 /* ring functions */
-static inline unsigned int ring_next(struct req_ring* ring, unsigned int pos)
+static inline unsigned int ring_next(unsigned int pos)
 {
-	if (++pos >= MAX_REQUESTS * 2 + 1)
+	if (++pos >= MAX_REMUS_REQUESTS + 1)
 		return 0;
 
 	return pos;
@@ -218,13 +219,26 @@ static inline unsigned int ring_next(struct req_ring* ring, unsigned int pos)
 
 static inline int ring_isempty(struct req_ring* ring)
 {
-	return ring->head == ring->tail;
+	return ring->cons == ring->prod;
 }
 
 static inline int ring_isfull(struct req_ring* ring)
 {
-	return ring_next(ring, ring->tail) == ring->head;
+	return ring_next(ring->prod) == ring->cons;
 }
+
+static void ring_add_request(struct req_ring *ring, const td_request_t *treq)
+{
+	/* If ring is full, it means that tapdisk2 has some bug */
+	if (ring_isfull(ring)) {
+		RPRINTF("OOPS, ring is full\n");
+		exit(1);
+	}
+
+	ring->pending_requests[ring->prod].treq = *treq;
+	ring->prod = ring_next(ring->prod);
+}
+
 /* Prototype declarations */
 static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
 
@@ -724,89 +738,113 @@ static int mwrite(int fd, void* buf, size_t len)
 	select(fd + 1, NULL, &wfds, NULL, &tv);
 }
 
-
-static void inline close_stream_fd(struct tdremus_state *s)
-{
-	if (s->stream_fd.fd < 0)
-		return;
-
-	/* XXX: -2 is magic. replace with macro perhaps? */
-	tapdisk_server_unregister_event(s->stream_fd.id);
-	close(s->stream_fd.fd);
-	s->stream_fd.fd = -2;
-}
-
-static void close_server_fd(struct tdremus_state *s)
-{
-	if (s->server_fd.fd < 0)
-		return;
-
-	tapdisk_server_unregister_event(s->server_fd.id);
-	s->server_fd.id = -1;
-	close(s->stream_fd.fd);
-	s->stream_fd.fd = -1;
-}
-
 /* primary functions */
 static void remus_client_event(event_id_t, char mode, void *private);
+static int primary_forward_request(struct tdremus_state *s,
+				   const td_request_t *treq);
 
-static int primary_blocking_connect(struct tdremus_state *state)
+/*
+ * It is called when we cannot connect to backup, or find I/O error when
+ * reading/writing.
+ */
+static void primary_failed(struct tdremus_state *s, int rc)
 {
-	int fd;
-	int id;
+	td_replication_connect_kill(&s->t);
+	if (rc == ERROR_INTERNAL)
+		RPRINTF("switch to unprotected mode due to internal error");
+	UNREGISTER_EVENT(s->stream_fd.id);
+	switch_mode(s->tdremus_driver, mode_unprotected);
+}
+
+static int remus_handle_queued_io(struct tdremus_state *s)
+{
+	struct req_ring *queued_io = &s->queued_io;
+	unsigned int cons;
+	td_request_t *treq;
 	int rc;
-	int flags;
 
-	RPRINTF("client connecting to %s:%d...\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
+	while (!ring_isempty(queued_io)) {
+		cons = queued_io->cons;
+		treq = &queued_io->pending_requests[cons].treq;
 
-	if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create client socket: %d\n", errno);
-		return -1;
-	}
-
-	do {
-		if ((rc = connect(fd, (struct sockaddr *)&state->sa,
-		    sizeof(state->sa))) < 0)
-		{
-			if (errno == ECONNREFUSED) {
-				RPRINTF("connection refused -- retrying in 1 second\n");
-				sleep(1);
-			} else {
-				RPRINTF("connection failed: %d\n", errno);
-				close(fd);
-				return -1;
-			}
+		if (treq->op == TD_OP_WRITE) {
+			rc = primary_forward_request(s, treq);
+			if (rc)
+				return rc;
 		}
-	} while (rc < 0);
 
-	RPRINTF("client connected\n");
-
-	/* make socket nonblocking */
-	if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
-		flags = 0;
-	if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
-	{
-		RPRINTF("error making socket nonblocking\n");
-		close(fd);
-		return -1;
+		td_forward_request(*treq);
+		queued_io->cons = ring_next(cons);
 	}
 
-	if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, fd, 0, remus_client_event, state)) < 0) {
-		RPRINTF("error registering client event handler: %s\n", strerror(id));
-		close(fd);
-		return -1;
-	}
-
-	state->stream_fd.fd = fd;
-	state->stream_fd.id = id;
 	return 0;
 }
 
-/* on read, just pass request through */
+static void remus_client_established(td_replication_connect_t *t, int rc)
+{
+	struct tdremus_state *s = CONTAINER_OF(t, *s, t);
+	event_id_t id;
+
+	if (rc) {
+		primary_failed(s, rc);
+		return;
+	}
+
+	/* the connect succeeded */
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
+					   0, remus_client_event, s);
+	if(id < 0) {
+		RPRINTF("error registering client event handler: %s\n",
+			strerror(id));
+		primary_failed(s, ERROR_INTERNAL);
+		return;
+	}
+
+	s->stream_fd.fd = t->fd;
+	s->stream_fd.id = id;
+
+	/* handle the queued requests */
+	rc = remus_handle_queued_io(s);
+	if (rc)
+		primary_failed(s, rc);
+}
+
 static void primary_queue_read(td_driver_t *driver, td_request_t treq)
 {
-	/* just pass read through */
-	td_forward_request(treq);
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	struct req_ring *ring = &s->queued_io;
+
+	if (ring_isempty(ring)) {
+		/* just pass read through */
+		td_forward_request(treq);
+		return;
+	}
+
+	ring_add_request(ring, &treq);
+}
+
+static int primary_forward_request(struct tdremus_state *s,
+				   const td_request_t *treq)
+{
+	char header[sizeof(uint32_t) + sizeof(uint64_t)];
+	uint32_t *sectors = (uint32_t *)header;
+	uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
+	td_driver_t *driver = s->tdremus_driver;
+
+	*sectors = treq->secs;
+	*sector = treq->sec;
+
+	if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE)) < 0)
+		return ERROR_IO;
+
+	if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
+		return ERROR_IO;
+
+	if (mwrite(s->stream_fd.fd, treq->buf,
+	    treq->secs * driver->info.sector_size) < 0)
+		return ERROR_IO;
+
+	return 0;
 }
 
 /* TODO:
@@ -819,28 +857,28 @@ static void primary_queue_read(td_driver_t *driver, td_request_t treq)
 static void primary_queue_write(td_driver_t *driver, td_request_t treq)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-
-	char header[sizeof(uint32_t) + sizeof(uint64_t)];
-	uint32_t *sectors = (uint32_t *)header;
-	uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
+	int rc, ret;
 
 	// RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
 
-	/* -1 means we haven't connected yet, -2 means the connection was lost */
-	if(s->stream_fd.fd == -1) {
+	ret = td_replication_connect_status(&s->t);
+	if(ret == -1) {
 		RPRINTF("connecting to backup...\n");
-		primary_blocking_connect(s);
+		s->t.callback = remus_client_established;
+		rc = td_replication_client_start(&s->t);
+		if (rc)
+			goto fail;
 	}
 
-	*sectors = treq.secs;
-	*sector = treq.sec;
+	/* The connection is not established, just queue the request */
+	if (ret != 1) {
+		ring_add_request(&s->queued_io, &treq);
+		return;
+	}
 
-	if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE)) < 0)
-		goto fail;
-	if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
-		goto fail;
-
-	if (mwrite(s->stream_fd.fd, treq.buf, treq.secs * driver->info.sector_size) < 0)
+	/* The connection is established */
+	rc = primary_forward_request(s, &treq);
+	if (rc)
 		goto fail;
 
 	td_forward_request(treq);
@@ -850,7 +888,7 @@ static void primary_queue_write(td_driver_t *driver, td_request_t treq)
  fail:
 	/* switch to unprotected mode and tell tapdisk to retry */
 	RPRINTF("write request replication failed, switching to unprotected mode");
-	switch_mode(s->tdremus_driver, mode_unprotected);
+	primary_failed(s, rc);
 	td_complete_request(treq, -EBUSY);
 }
 
@@ -867,7 +905,7 @@ static int client_flush(td_driver_t *driver)
 
 	if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT, strlen(TDREMUS_COMMIT)) < 0) {
 		RPRINTF("error flushing output");
-		close_stream_fd(s);
+		primary_failed(s, ERROR_IO);
 		return -1;
 	}
 
@@ -886,6 +924,26 @@ static int server_flush(td_driver_t *driver)
 	return ramdisk_flush(driver, s);	
 }
 
+/* It is called when switching the mode from primary to unprotected */
+static int primary_flush(td_driver_t *driver)
+{
+	struct tdremus_state *s = driver->data;
+	struct req_ring *ring = &s->queued_io;
+	unsigned int cons;
+
+	if (ring_isempty(ring))
+		return 0;
+
+	while (!ring_isempty(ring)) {
+		cons = ring->cons;
+		ring->cons = ring_next(cons);
+
+		td_forward_request(ring->pending_requests[cons].treq);
+	}
+
+	return client_flush(driver);
+}
+
 static int primary_start(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -894,7 +952,7 @@ static int primary_start(td_driver_t *driver)
 
 	tapdisk_remus.td_queue_read = primary_queue_read;
 	tapdisk_remus.td_queue_write = primary_queue_write;
-	s->queue_flush = client_flush;
+	s->queue_flush = primary_flush;
 
 	s->stream_fd.fd = -1;
 	s->stream_fd.id = -1;
@@ -913,7 +971,7 @@ static void remus_client_event(event_id_t id, char mode, void *private)
 	if (mread(s->stream_fd.fd, req, sizeof(req) - 1) < 0) {
 		/* replication stream closed or otherwise broken (timeout, reset, &c) */
 		RPRINTF("error reading from backup\n");
-		close_stream_fd(s);
+		primary_failed(s, ERROR_IO);
 		return;
 	}
 
@@ -924,7 +982,7 @@ static void remus_client_event(event_id_t id, char mode, void *private)
 		ctl_respond(s, TDREMUS_DONE);
 	else {
 		RPRINTF("received unknown message: %s\n", req);
-		close_stream_fd(s);
+		primary_failed(s, ERROR_IO);
 	}
 
 	return;
@@ -933,84 +991,36 @@ static void remus_client_event(event_id_t id, char mode, void *private)
 /* backup functions */
 static void remus_server_event(event_id_t id, char mode, void *private);
 
-/* returns the socket that receives write requests */
-static void remus_server_accept(event_id_t id, char mode, void* private)
+/* It is called when we find some I/O error */
+static void backup_failed(struct tdremus_state *s, int rc)
 {
-	struct tdremus_state* s = (struct tdremus_state *) private;
+	UNREGISTER_EVENT(s->stream_fd.id);
+	td_replication_connect_kill(&s->t);
+	/* We will switch to unprotected mode in backup_queue_write() */
+}
 
-	int stream_fd;
-	event_id_t cid;
+/* returns the socket that receives write requests */
+static void remus_server_established(td_replication_connect_t *t, int rc)
+{
+	struct tdremus_state *s = CONTAINER_OF(t, *s, t);
+	event_id_t id;
 
-	/* XXX: add address-based black/white list */
-	if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
-		RPRINTF("error accepting connection: %d\n", errno);
-		return;
-	}
-
-	/* TODO: check to see if we are already replicating. if so just close the
-	 * connection (or do something smarter) */
-	RPRINTF("server accepted connection\n");
+	/* rc is always 0 */
 
 	/* add tapdisk event for replication stream */
-	cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, stream_fd, 0,
-					    remus_server_event, s);
+	id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd, 0,
+					   remus_server_event, s);
 
-	if(cid < 0) {
-		RPRINTF("error registering connection event handler: %s\n", strerror(errno));
-		close(stream_fd);
+	if (id < 0) {
+		RPRINTF("error registering connection event handler: %s\n",
+			strerror(errno));
+		td_replication_server_restart(t);
 		return;
 	}
 
 	/* store replication file descriptor */
-	s->stream_fd.fd = stream_fd;
-	s->stream_fd.id = cid;
-}
-
-/* returns -2 if EADDRNOTAVAIL */
-static int remus_bind(struct tdremus_state* s)
-{
-//  struct sockaddr_in sa;
-	int opt;
-	int rc = -1;
-
-	if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
-		RPRINTF("could not create server socket: %d\n", errno);
-		return rc;
-	}
-	opt = 1;
-	if (setsockopt(s->server_fd.fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)) < 0)
-		RPRINTF("Error setting REUSEADDR on %d: %d\n", s->server_fd.fd, errno);
-
-	if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa, sizeof(s->sa)) < 0) {
-		RPRINTF("could not bind server socket %d to %s:%d: %d %s\n", s->server_fd.fd,
-			inet_ntoa(s->sa.sin_addr), ntohs(s->sa.sin_port), errno, strerror(errno));
-		if (errno != EADDRINUSE)
-			rc = -2;
-		goto err_sfd;
-	}
-	if (listen(s->server_fd.fd, 10)) {
-		RPRINTF("could not listen on socket: %d\n", errno);
-		goto err_sfd;
-	}
-
-	/* The socket s now bound to the address and listening so we may now register
-   * the fd with tapdisk */
-
-	if((s->server_fd.id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
-							    s->server_fd.fd, 0,
-							    remus_server_accept, s)) < 0) {
-		RPRINTF("error registering server connection event handler: %s",
-			strerror(s->server_fd.id));
-		goto err_sfd;
-	}
-
-	return 0;
-
- err_sfd:
-	close(s->server_fd.fd);
-	s->server_fd.fd = -1;
-
-	return rc;
+	s->stream_fd.fd = t->fd;
+	s->stream_fd.id = id;
 }
 
 /* wait for latest checkpoint to be applied */
@@ -1053,6 +1063,8 @@ void backup_queue_write(td_driver_t *driver, td_request_t treq)
 	 * handle the write
 	 */
 
+	/* If we have called backup_failed, calling it again is harmless */
+	backup_failed(s, ERROR_INTERNAL);
 	switch_mode(driver, mode_unprotected);
 	/* TODO: call the appropriate write function rather than return EBUSY */
 	td_complete_request(treq, -EBUSY);
@@ -1061,7 +1073,6 @@ void backup_queue_write(td_driver_t *driver, td_request_t treq)
 static int backup_start(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	int fd;
 
 	if (ramdisk_start(driver) < 0)
 		return -1;
@@ -1073,12 +1084,12 @@ static int backup_start(td_driver_t *driver)
 	return 0;
 }
 
-static int server_do_wreq(td_driver_t *driver)
+static void server_do_wreq(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 	static tdremus_wire_t twreq;
 	char buf[4096];
-	int len, rc;
+	int len, rc = ERROR_IO;
 
 	char header[sizeof(uint32_t) + sizeof(uint64_t)];
 	uint32_t *sectors = (uint32_t *) header;
@@ -1097,28 +1108,28 @@ static int server_do_wreq(td_driver_t *driver)
 	if (len > sizeof(buf)) {
 		/* freak out! */
 		RPRINTF("write request too large: %d/%u\n", len, (unsigned)sizeof(buf));
-		return -1;
+		goto err;
 	}
 
 	if (mread(s->stream_fd.fd, buf, len) < 0)
 		goto err;
 
-	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0)
+	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
+		rc = ERROR_INTERNAL;
 		goto err;
+	}
 
-	return 0;
+	return;
 
  err:
 	/* should start failover */
 	RPRINTF("backup write request error\n");
-	close_stream_fd(s);
-
-	return -1;
+	backup_failed(s, rc);
 }
 
 /* at this point, the server can start applying the most recent
  * ramdisk. */
-static int server_do_creq(td_driver_t *driver)
+static void server_do_creq(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
@@ -1128,9 +1139,7 @@ static int server_do_creq(td_driver_t *driver)
 
 	/* XXX this message should not be sent until flush completes! */
 	if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) != 4)
-		return -1;
-
-	return 0;
+		backup_failed(s, ERROR_IO);
 }
 
 
@@ -1213,11 +1222,6 @@ static int unprotected_start(td_driver_t *driver)
 
 	RPRINTF("failure detected, activating passthrough\n");
 
-	/* close the server socket */
-	close_stream_fd(s);
-
-	close_server_fd(s);
-
 	/* install the unprotected read/write handlers */
 	tapdisk_remus.td_queue_read = unprotected_queue_read;
 	tapdisk_remus.td_queue_write = unprotected_queue_write;
@@ -1227,90 +1231,6 @@ static int unprotected_start(td_driver_t *driver)
 
 
 /* control */
-
-static inline int resolve_address(const char* addr, struct in_addr* ia)
-{
-	struct hostent* he;
-	uint32_t ip;
-
-	if (!(he = gethostbyname(addr))) {
-		RPRINTF("error resolving %s: %d\n", addr, h_errno);
-		return -1;
-	}
-
-	if (!he->h_addr_list[0]) {
-		RPRINTF("no address found for %s\n", addr);
-		return -1;
-	}
-
-	/* network byte order */
-	ip = *((uint32_t**)he->h_addr_list)[0];
-	ia->s_addr = ip;
-
-	return 0;
-}
-
-static int get_args(td_driver_t *driver, const char* name)
-{
-	struct tdremus_state *state = (struct tdremus_state *)driver->data;
-	char* host;
-	char* port;
-//  char* driver_str;
-//  char* parent;
-//  int type;
-//  char* path;
-//  unsigned long ulport;
-//  int i;
-//  struct sockaddr_in server_addr_in;
-
-	int gai_status;
-	int valid_addr;
-	struct addrinfo gai_hints;
-	struct addrinfo *servinfo, *servinfo_itr;
-
-	memset(&gai_hints, 0, sizeof gai_hints);
-	gai_hints.ai_family = AF_UNSPEC;
-	gai_hints.ai_socktype = SOCK_STREAM;
-
-	port = strchr(name, ':');
-	if (!port) {
-		RPRINTF("missing host in %s\n", name);
-		return -ENOENT;
-	}
-	if (!(host = strndup(name, port - name))) {
-		RPRINTF("unable to allocate host\n");
-		return -ENOMEM;
-	}
-	port++;
-
-	if ((gai_status = getaddrinfo(host, port, &gai_hints, &servinfo)) != 0) {
-		RPRINTF("getaddrinfo error: %s\n", gai_strerror(gai_status));
-		return -ENOENT;
-	}
-
-	/* TODO: do something smarter here */
-	valid_addr = 0;
-	for(servinfo_itr = servinfo; servinfo_itr != NULL; servinfo_itr = servinfo_itr->ai_next) {
-		void *addr;
-		char *ipver;
-
-		if (servinfo_itr->ai_family == AF_INET) {
-			valid_addr = 1;
-			memset(&state->sa, 0, sizeof(state->sa));
-			state->sa = *(struct sockaddr_in *)servinfo_itr->ai_addr;
-			break;
-		}
-	}
-	freeaddrinfo(servinfo);
-
-	if (!valid_addr)
-		return -ENOENT;
-
-	RPRINTF("host: %s, port: %d\n", inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
-
-	return 0;
-}
-
 static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -1343,6 +1263,20 @@ static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
 	return rc;
 }
 
+static void ctl_reopen(struct tdremus_state *s)
+{
+	ctl_unregister(s);
+	CLOSE_FD(s->ctl_fd.fd);
+	RPRINTF("FIFO closed\n");
+
+	if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
+		RPRINTF("error reopening FIFO: %d\n", errno);
+		return;
+	}
+
+	ctl_register(s);
+}
+
 static void ctl_request(event_id_t id, char mode, void *private)
 {
 	struct tdremus_state *s = (struct tdremus_state *)private;
@@ -1355,11 +1289,7 @@ static void ctl_request(event_id_t id, char mode, void *private)
 	if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append nul */))) {
 		RPRINTF("0-byte read received, reopening FIFO\n");
 		/*TODO: we may have to unregister/re-register with tapdisk_server */
-		close(s->ctl_fd.fd);
-		RPRINTF("FIFO closed\n");
-		if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
-			RPRINTF("error reopening FIFO: %d\n", errno);
-		}
+		ctl_reopen(s);
 		return;
 	}
 
@@ -1372,7 +1302,7 @@ static void ctl_request(event_id_t id, char mode, void *private)
 	msg[rc] = '\0';
 	if (!strncmp(msg, "flush", 5)) {
 		if (s->mode == mode_primary) {
-			if ((rc = s->queue_flush(driver))) {
+			if ((rc = client_flush(driver))) {
 				RPRINTF("error passing flush request to backup");
 				ctl_respond(s, TDREMUS_FAIL);
 			}
@@ -1521,6 +1451,7 @@ static void ctl_unregister(struct tdremus_state *s)
 static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+	td_replication_connect_t *t = &s->t;
 	int rc;
 	const char *name = image->name;
 	td_flag_t flags = image->flags;
@@ -1531,7 +1462,6 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	remus_image = image;
 
 	memset(s, 0, sizeof(*s));
-	s->server_fd.fd = -1;
 	s->stream_fd.fd = -1;
 	s->ctl_fd.fd = -1;
 	s->msg_fd.fd = -1;
@@ -1540,8 +1470,11 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	 * the driver stack from the stream_fd event handler */
 	s->tdremus_driver = driver;
 
-	/* parse name to get info etc */
-	if ((rc = get_args(driver, name)))
+	t->log_prefix = "remus";
+	t->retry_timeout_s = REMUS_CONNRETRY_TIMEOUT;
+	t->max_connections = 10;
+	t->callback = remus_server_established;
+	if ((rc = td_replication_connect_init(t, name)))
 		return rc;
 
 	if ((rc = ctl_open(driver, name))) {
@@ -1555,7 +1488,7 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 		return rc;
 	}
 
-	if (!(rc = remus_bind(s)))
+	if (!(rc = td_replication_server_start(t)))
 		rc = switch_mode(driver, mode_backup);
 	else if (rc == -2)
 		rc = switch_mode(driver, mode_primary);
@@ -1575,8 +1508,7 @@ static int tdremus_close(td_driver_t *driver)
 	if (s->ramdisk.inprogress)
 		hashtable_destroy(s->ramdisk.inprogress, 0);
 
-	close_server_fd(s);
-	close_stream_fd(s);
+	td_replication_connect_kill(&s->t);
 	ctl_unregister(s);
 	ctl_close(s);
 
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
index 9e051cc..07fd630 100644
--- a/tools/blktap2/drivers/block-replication.h
+++ b/tools/blktap2/drivers/block-replication.h
@@ -48,6 +48,7 @@
 enum {
 	ERROR_INTERNAL = -1,
 	ERROR_CONNECTION = -2,
+	ERROR_IO = -3,
 };
 
 typedef struct td_replication_connect td_replication_connect_t;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 14/17] block-remus: switch to unprotected mode before closing
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (12 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 13/17] tools: block-remus: connect to backup asynchronously Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-20  2:51   ` Shriram Rajagopalan
  2014-10-14  2:14 ` [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c Wen Congyang
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

If the user wants to stop tapdisk2, he will do
the following thing:
1. close the image
2. detach from blktap device

If there is some pending I/O request, close will
fail. But the I/O request is pended in remus until
the connection is established. Introduce a new
callback td_pre_close() to flush these I/O requests.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c       | 14 ++++++++++++++
 tools/blktap2/drivers/block-replication.h |  1 +
 tools/blktap2/drivers/tapdisk-control.c   |  6 ++++++
 tools/blktap2/drivers/tapdisk-interface.c | 18 ++++++++++++++++++
 tools/blktap2/drivers/tapdisk-interface.h |  1 +
 tools/blktap2/drivers/tapdisk-vbd.c       |  9 +++++++++
 tools/blktap2/drivers/tapdisk-vbd.h       |  1 +
 tools/blktap2/drivers/tapdisk.h           |  1 +
 8 files changed, 51 insertions(+)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index a2b9f62..09dc46f 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -752,6 +752,8 @@ static void primary_failed(struct tdremus_state *s, int rc)
 	td_replication_connect_kill(&s->t);
 	if (rc == ERROR_INTERNAL)
 		RPRINTF("switch to unprotected mode due to internal error");
+	if (rc == ERROR_CLOSE)
+		RPRINTF("switch to unprotected mode before closing");
 	UNREGISTER_EVENT(s->stream_fd.id);
 	switch_mode(s->tdremus_driver, mode_unprotected);
 }
@@ -1500,6 +1502,17 @@ static int tdremus_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
 	return -EIO;
 }
 
+static int tdremus_pre_close(td_driver_t *driver)
+{
+	struct tdremus_state *s = (struct tdremus_state *)driver->data;
+
+	if (s->mode != mode_primary)
+		return 0;
+
+	primary_failed(s, ERROR_CLOSE);
+	return 0;
+}
+
 static int tdremus_close(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
@@ -1533,6 +1546,7 @@ struct tap_disk tapdisk_remus = {
 	.td_open            = tdremus_open,
 	.td_queue_read      = unprotected_queue_read,
 	.td_queue_write     = unprotected_queue_write,
+	.td_pre_close       = tdremus_pre_close,
 	.td_close           = tdremus_close,
 	.td_get_parent_id   = tdremus_get_parent_id,
 	.td_validate_parent = tdremus_validate_parent,
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
index 07fd630..358c08b 100644
--- a/tools/blktap2/drivers/block-replication.h
+++ b/tools/blktap2/drivers/block-replication.h
@@ -49,6 +49,7 @@ enum {
 	ERROR_INTERNAL = -1,
 	ERROR_CONNECTION = -2,
 	ERROR_IO = -3,
+	ERROR_CLOSE = -4,
 };
 
 typedef struct td_replication_connect td_replication_connect_t;
diff --git a/tools/blktap2/drivers/tapdisk-control.c b/tools/blktap2/drivers/tapdisk-control.c
index 4e5f748..2fa4cbe 100644
--- a/tools/blktap2/drivers/tapdisk-control.c
+++ b/tools/blktap2/drivers/tapdisk-control.c
@@ -508,6 +508,12 @@ tapdisk_control_close_image(struct tapdisk_control_connection *connection,
 		goto out;
 	}
 
+	/*
+	 * Some I/O requests are pended in the driver, and
+	 * flush these requests first.
+	 */
+	tapdisk_vbd_pre_close_vdi(vbd);
+
 	if (!list_empty(&vbd->pending_requests)) {
 		err = -EAGAIN;
 		goto out;
diff --git a/tools/blktap2/drivers/tapdisk-interface.c b/tools/blktap2/drivers/tapdisk-interface.c
index a29de64..ed92e12 100644
--- a/tools/blktap2/drivers/tapdisk-interface.c
+++ b/tools/blktap2/drivers/tapdisk-interface.c
@@ -105,6 +105,24 @@ td_open(td_image_t *image)
 }
 
 int
+td_pre_close(td_image_t *image)
+{
+	td_driver_t *driver;
+
+	driver = image->driver;
+	if (!driver)
+		return -ENODEV;
+
+	if (!driver->ops->td_pre_close)
+		return 0;
+
+	if (driver->refcnt && td_flag_test(driver->state, TD_DRIVER_OPEN))
+		driver->ops->td_pre_close(driver);
+
+	return 0;
+}
+
+int
 td_close(td_image_t *image)
 {
 	td_driver_t *driver;
diff --git a/tools/blktap2/drivers/tapdisk-interface.h b/tools/blktap2/drivers/tapdisk-interface.h
index adc4376..ba9b3ea 100644
--- a/tools/blktap2/drivers/tapdisk-interface.h
+++ b/tools/blktap2/drivers/tapdisk-interface.h
@@ -34,6 +34,7 @@
 int td_open(td_image_t *);
 int __td_open(td_image_t *, td_disk_info_t *);
 int td_load(td_image_t *);
+int td_pre_close(td_image_t *);
 int td_close(td_image_t *);
 int td_get_parent_id(td_image_t *, td_disk_id_t *);
 int td_validate_parent(td_image_t *, td_image_t *);
diff --git a/tools/blktap2/drivers/tapdisk-vbd.c b/tools/blktap2/drivers/tapdisk-vbd.c
index c665f27..aba545b 100644
--- a/tools/blktap2/drivers/tapdisk-vbd.c
+++ b/tools/blktap2/drivers/tapdisk-vbd.c
@@ -180,6 +180,15 @@ tapdisk_vbd_validate_chain(td_vbd_t *vbd)
 }
 
 void
+tapdisk_vbd_pre_close_vdi(td_vbd_t *vbd)
+{
+	td_image_t *image, *tmp;
+
+	tapdisk_vbd_for_each_image(vbd, image, tmp)
+		td_pre_close(image);
+}
+
+void
 tapdisk_vbd_close_vdi(td_vbd_t *vbd)
 {
 	td_image_t *image, *tmp;
diff --git a/tools/blktap2/drivers/tapdisk-vbd.h b/tools/blktap2/drivers/tapdisk-vbd.h
index be084b2..040f2b8 100644
--- a/tools/blktap2/drivers/tapdisk-vbd.h
+++ b/tools/blktap2/drivers/tapdisk-vbd.h
@@ -181,6 +181,7 @@ void tapdisk_vbd_free_stack(td_vbd_t *);
 int tapdisk_vbd_open_stack(td_vbd_t *, uint16_t, td_flag_t);
 int tapdisk_vbd_open_vdi(td_vbd_t *, const char *,
 			 uint16_t, uint16_t, td_flag_t);
+void tapdisk_vbd_pre_close_vdi(td_vbd_t *);
 void tapdisk_vbd_close_vdi(td_vbd_t *);
 
 int tapdisk_vbd_attach(td_vbd_t *, const char *, int);
diff --git a/tools/blktap2/drivers/tapdisk.h b/tools/blktap2/drivers/tapdisk.h
index 3c3b51d..16efd07 100644
--- a/tools/blktap2/drivers/tapdisk.h
+++ b/tools/blktap2/drivers/tapdisk.h
@@ -158,6 +158,7 @@ struct tap_disk {
 	td_flag_t                    flags;
 	int                          private_data_size;
 	int (*td_open)               (td_driver_t *, td_image_t *, td_uuid_t);
+	int (*td_pre_close)          (td_driver_t *);
 	int (*td_close)              (td_driver_t *);
 	int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
 	int (*td_validate_parent)    (td_driver_t *, td_driver_t *, td_flag_t);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (13 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 14/17] block-remus: switch to unprotected mode before closing Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-20  2:52   ` Shriram Rajagopalan
  2014-10-14  2:14 ` [PATCH 16/17] support blktap remus in xl Wen Congyang
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

COLO will reuse them

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/blktap2/drivers/block-remus.c       | 480 +-----------------------------
 tools/blktap2/drivers/block-replication.c | 460 ++++++++++++++++++++++++++++
 tools/blktap2/drivers/block-replication.h |  65 ++++
 3 files changed, 539 insertions(+), 466 deletions(-)

diff --git a/tools/blktap2/drivers/block-remus.c b/tools/blktap2/drivers/block-remus.c
index 09dc46f..c7b429c 100644
--- a/tools/blktap2/drivers/block-remus.c
+++ b/tools/blktap2/drivers/block-remus.c
@@ -37,9 +37,6 @@
 #include "tapdisk-server.h"
 #include "tapdisk-driver.h"
 #include "tapdisk-interface.h"
-#include "hashtable.h"
-#include "hashtable_itr.h"
-#include "hashtable_utility.h"
 #include "block-replication.h"
 
 #include <errno.h>
@@ -58,7 +55,6 @@
 
 /* timeout for reads and writes in ms */
 #define HEARTBEAT_MS 1000
-#define RAMDISK_HASHSIZE 128
 
 /* connect retry timeout (seconds) */
 #define REMUS_CONNRETRY_TIMEOUT 1
@@ -97,51 +93,6 @@ td_vbd_t *device_vbd = NULL;
 td_image_t *remus_image = NULL;
 struct tap_disk tapdisk_remus;
 
-struct ramdisk {
-	size_t sector_size;
-	struct hashtable* h;
-	/* when a ramdisk is flushed, h is given a new empty hash for writes
-	 * while the old ramdisk (prev) is drained asynchronously.
-	 */
-	struct hashtable* prev;
-	/* count of outstanding requests to the base driver */
-	size_t inflight;
-	/* prev holds the requests to be flushed, while inprogress holds
-	 * requests being flushed. When requests complete, they are removed
-	 * from inprogress.
-	 * Whenever a new flush is merged with ongoing flush (i.e, prev),
-	 * we have to make sure that none of the new requests overlap with
-	 * ones in "inprogress". If it does, keep it back in prev and dont issue
-	 * IO until the current one finishes. If we allow this IO to proceed,
-	 * we might end up with two "overlapping" requests in the disk's queue and
-	 * the disk may not offer any guarantee on which one is written first.
-	 * IOW, make sure we dont create a write-after-write time ordering constraint.
-	 * 
-	 */
-	struct hashtable* inprogress;
-};
-
-/* the ramdisk intercepts the original callback for reads and writes.
- * This holds the original data. */
-/* Might be worth making this a static array in struct ramdisk to avoid
- * a malloc per request */
-
-struct tdremus_state;
-
-struct ramdisk_cbdata {
-	td_callback_t cb;
-	void* private;
-	char* buf;
-	struct tdremus_state* state;
-};
-
-struct ramdisk_write_cbdata {
-	struct tdremus_state* state;
-	char* buf;
-};
-
-typedef void (*queue_rw_t) (td_driver_t *driver, td_request_t treq);
-
 /* poll_fd type for blktap2 fd system. taken from block_log.c */
 typedef struct poll_fd {
 	int        fd;
@@ -168,7 +119,7 @@ struct tdremus_state {
 	 */
 	struct req_ring queued_io;
 
-	/* ramdisk data*/
+	/* ramdisk data */
 	struct ramdisk ramdisk;
 
 	/* mode methods */
@@ -239,404 +190,14 @@ static void ring_add_request(struct req_ring *ring, const td_request_t *treq)
 	ring->prod = ring_next(ring->prod);
 }
 
-/* Prototype declarations */
-static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
-
-/* functions to create and sumbit treq's */
-
-static void
-replicated_write_callback(td_request_t treq, int err)
-{
-	struct tdremus_state *s = (struct tdremus_state *) treq.cb_data;
-	td_vbd_request_t *vreq;
-	int i;
-	uint64_t start;
-	vreq = (td_vbd_request_t *) treq.private;
-
-	/* the write failed for now, lets panic. this is very bad */
-	if (err) {
-		RPRINTF("ramdisk write failed, disk image is not consistent\n");
-		exit(-1);
-	}
-
-	/* The write succeeded. let's pull the vreq off whatever request list
-	 * it is on and free() it */
-	list_del(&vreq->next);
-	free(vreq);
-
-	s->ramdisk.inflight--;
-	start = treq.sec;
-	for (i = 0; i < treq.secs; i++) {
-		hashtable_remove(s->ramdisk.inprogress, &start);
-		start++;
-	}
-	free(treq.buf);
-
-	if (!s->ramdisk.inflight && !s->ramdisk.prev) {
-		/* TODO: the ramdisk has been flushed */
-	}
-}
-
-static inline int
-create_write_request(struct tdremus_state *state, td_sector_t sec, int secs, char *buf)
-{
-	td_request_t treq;
-	td_vbd_request_t *vreq;
-
-	treq.op      = TD_OP_WRITE;
-	treq.buf     = buf;
-	treq.sec     = sec;
-	treq.secs    = secs;
-	treq.image   = remus_image;
-	treq.cb      = replicated_write_callback;
-	treq.cb_data = state;
-	treq.id      = 0;
-	treq.sidx    = 0;
-
-	vreq         = calloc(1, sizeof(td_vbd_request_t));
-	treq.private = vreq;
-
-	if(!vreq)
-		return -1;
-
-	vreq->submitting = 1;
-	INIT_LIST_HEAD(&vreq->next);
-	tapdisk_vbd_move_request(treq.private, &device_vbd->pending_requests);
-
-	/* TODO:
-	 * we should probably leave it up to the caller to forward the request */
-	td_forward_request(treq);
-
-	vreq->submitting--;
-
-	return 0;
-}
-
-
-/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
-static unsigned int uint64_hash(void* k)
-{
-	uint64_t key = *(uint64_t*)k;
-
-	key = (~key) + (key << 18);
-	key = key ^ (key >> 31);
-	key = key * 21;
-	key = key ^ (key >> 11);
-	key = key + (key << 6);
-	key = key ^ (key >> 22);
-
-	return (unsigned int)key;
-}
-
-static int rd_hash_equal(void* k1, void* k2)
-{
-	uint64_t key1, key2;
-
-	key1 = *(uint64_t*)k1;
-	key2 = *(uint64_t*)k2;
-
-	return key1 == key2;
-}
-
-static int ramdisk_read(struct ramdisk* ramdisk, uint64_t sector,
-			int nb_sectors, char* buf)
-{
-	int i;
-	char* v;
-	uint64_t key;
-
-	for (i = 0; i < nb_sectors; i++) {
-		key = sector + i;
-		/* check whether it is queued in a previous flush request */
-		if (!(ramdisk->prev && (v = hashtable_search(ramdisk->prev, &key)))) {
-			/* check whether it is an ongoing flush */
-			if (!(ramdisk->inprogress && (v = hashtable_search(ramdisk->inprogress, &key))))
-				return -1;
-		}
-		memcpy(buf + i * ramdisk->sector_size, v, ramdisk->sector_size);
-	}
-
-	return 0;
-}
-
-static int ramdisk_write_hash(struct hashtable* h, uint64_t sector, char* buf,
-			      size_t len)
-{
-	char* v;
-	uint64_t* key;
-
-	if ((v = hashtable_search(h, &sector))) {
-		memcpy(v, buf, len);
-		return 0;
-	}
-
-	if (!(v = malloc(len))) {
-		DPRINTF("ramdisk_write_hash: malloc failed\n");
-		return -1;
-	}
-	memcpy(v, buf, len);
-	if (!(key = malloc(sizeof(*key)))) {
-		DPRINTF("ramdisk_write_hash: error allocating key\n");
-		free(v);
-		return -1;
-	}
-	*key = sector;
-	if (!hashtable_insert(h, key, v)) {
-		DPRINTF("ramdisk_write_hash failed on sector %" PRIu64 "\n", sector);
-		free(key);
-		free(v);
-		return -1;
-	}
-
-	return 0;
-}
-
-static inline int ramdisk_write(struct ramdisk* ramdisk, uint64_t sector,
-				int nb_sectors, char* buf)
-{
-	int i, rc;
-
-	for (i = 0; i < nb_sectors; i++) {
-		rc = ramdisk_write_hash(ramdisk->h, sector + i,
-					buf + i * ramdisk->sector_size,
-					ramdisk->sector_size);
-		if (rc)
-			return rc;
-	}
-
-	return 0;
-}
-
-static int uint64_compare(const void* k1, const void* k2)
-{
-	uint64_t u1 = *(uint64_t*)k1;
-	uint64_t u2 = *(uint64_t*)k2;
-
-	/* u1 - u2 is unsigned */
-	return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
-}
-
-/* set psectors to an array of the sector numbers in the hash, returning
- * the number of entries (or -1 on error) */
-static int ramdisk_get_sectors(struct hashtable* h, uint64_t** psectors)
-{
-	struct hashtable_itr* itr;
-	uint64_t* sectors;
-	int count;
-
-	if (!(count = hashtable_count(h)))
-		return 0;
-
-	if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
-		DPRINTF("ramdisk_get_sectors: error allocating sector map\n");
-		return -1;
-	}
-	sectors = *psectors;
-
-	itr = hashtable_iterator(h);
-	count = 0;
-	do {
-		sectors[count++] = *(uint64_t*)hashtable_iterator_key(itr);
-	} while (hashtable_iterator_advance(itr));
-	free(itr);
-
-	return count;
-}
-
-/*
-  return -1 for OOM
-  return -2 for merge lookup failure
-  return -3 for WAW race
-  return 0 on success.
-*/
-static int merge_requests(struct ramdisk* ramdisk, uint64_t start,
-			size_t count, char **mergedbuf)
-{
-	char* buf;
-	char* sector;
-	int i;
-	uint64_t *key;
-	int rc = 0;
-
-	if (!(buf = valloc(count * ramdisk->sector_size))) {
-		DPRINTF("merge_request: allocation failed\n");
-		return -1;
-	}
-
-	for (i = 0; i < count; i++) {
-		if (!(sector = hashtable_search(ramdisk->prev, &start))) {
-			DPRINTF("merge_request: lookup failed on %"PRIu64"\n", start);
-			free(buf);
-			rc = -2;
-			goto fail;
-		}
-
-		/* Check inprogress requests to avoid waw non-determinism */
-		if (hashtable_search(ramdisk->inprogress, &start)) {
-			DPRINTF("merge_request: WAR RACE on %"PRIu64"\n", start);
-			free(buf);
-			rc = -3;
-			goto fail;
-		}
-		/* Insert req into inprogress (brief period of duplication of hash entries until
-		 * they are removed from prev. Read tracking would not be reading wrong entries)
-		 */
-		if (!(key = malloc(sizeof(*key)))) {
-			DPRINTF("%s: error allocating key\n", __FUNCTION__);
-			free(buf);			
-			rc = -1;
-			goto fail;
-		}
-		*key = start;
-		if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
-			DPRINTF("%s failed to insert sector %" PRIu64 " into inprogress hash\n", 
-				__FUNCTION__, start);
-			free(key);
-			free(buf);
-			rc = -1;
-			goto fail;
-		}
-		memcpy(buf + i * ramdisk->sector_size, sector, ramdisk->sector_size);
-		start++;
-	}
-
-	*mergedbuf = buf;
-	return 0;
-fail:
-	for (start--; i >0; i--, start--)
-		hashtable_remove(ramdisk->inprogress, &start);
-	return rc;
-}
-
-/* The underlying driver may not handle having the whole ramdisk queued at
- * once. We queue what we can and let the callbacks attempt to queue more. */
-/* NOTE: may be called from callback, while dd->private still belongs to
- * the underlying driver */
-static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s)
-{
-	uint64_t* sectors;
-	char* buf = NULL;
-	uint64_t base, batchlen;
-	int i, j, count = 0;
-
-	// RPRINTF("ramdisk flush\n");
-
-	if ((count = ramdisk_get_sectors(s->ramdisk.prev, &sectors)) <= 0)
-		return count;
-
-	/* Create the inprogress table if empty */
-	if (!s->ramdisk.inprogress)
-		s->ramdisk.inprogress = create_hashtable(RAMDISK_HASHSIZE,
-							uint64_hash,
-							rd_hash_equal);
-	
-	/*
-	  RPRINTF("ramdisk: flushing %d sectors\n", count);
-	*/
-
-	/* sort and merge sectors to improve disk performance */
-	qsort(sectors, count, sizeof(*sectors), uint64_compare);
-
-	for (i = 0; i < count;) {
-		base = sectors[i++];
-		while (i < count && sectors[i] == sectors[i-1] + 1)
-			i++;
-		batchlen = sectors[i-1] - base + 1;
-
-		j = merge_requests(&s->ramdisk, base, batchlen, &buf);
-			
-		if (j) {
-			RPRINTF("ramdisk_flush: merge_requests failed:%s\n",
-				j == -1? "OOM": (j==-2? "missing sector" : "WAW race"));
-			if (j == -3) continue;
-			free(sectors);
-			return -1;
-		}
-
-		/* NOTE: create_write_request() creates a treq AND forwards it down
-		 * the driver chain */
-		// RPRINTF("forwarding write request at %" PRIu64 ", length: %" PRIu64 "\n", base, batchlen);
-		create_write_request(s, base, batchlen, buf);
-		//RPRINTF("write request at %" PRIu64 ", length: %" PRIu64 " forwarded\n", base, batchlen);
-
-		s->ramdisk.inflight++;
-
-		for (j = 0; j < batchlen; j++) {
-			buf = hashtable_search(s->ramdisk.prev, &base);
-			free(buf);
-			hashtable_remove(s->ramdisk.prev, &base);
-			base++;
-		}
-	}
-
-	if (!hashtable_count(s->ramdisk.prev)) {
-		/* everything is in flight */
-		hashtable_destroy(s->ramdisk.prev, 0);
-		s->ramdisk.prev = NULL;
-	}
-
-	free(sectors);
-
-	// RPRINTF("ramdisk flush done\n");
-	return 0;
-}
-
-/* flush ramdisk contents to disk */
-static int ramdisk_start_flush(td_driver_t *driver)
-{
-	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	uint64_t* key;
-	char* buf;
-	int rc = 0;
-	int i, j, count, batchlen;
-	uint64_t* sectors;
-
-	if (!hashtable_count(s->ramdisk.h)) {
-		/*
-		  RPRINTF("Nothing to flush\n");
-		*/
-		return 0;
-	}
-
-	if (s->ramdisk.prev) {
-		/* a flush request issued while a previous flush is still in progress
-		 * will merge with the previous request. If you want the previous
-		 * request to be consistent, wait for it to complete. */
-		if ((count = ramdisk_get_sectors(s->ramdisk.h, &sectors)) < 0)
-			return count;
-
-		for (i = 0; i < count; i++) {
-			buf = hashtable_search(s->ramdisk.h, sectors + i);
-			ramdisk_write_hash(s->ramdisk.prev, sectors[i], buf,
-					   s->ramdisk.sector_size);
-		}
-		free(sectors);
-
-		hashtable_destroy (s->ramdisk.h, 1);
-	} else
-		s->ramdisk.prev = s->ramdisk.h;
-
-	/* We create a new hashtable so that new writes can be performed before
-	 * the old hashtable is completely drained. */
-	s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
-					rd_hash_equal);
-
-	return ramdisk_flush(driver, s);
-}
-
-
 static int ramdisk_start(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
-	if (s->ramdisk.h) {
-		RPRINTF("ramdisk already allocated\n");
-		return 0;
-	}
-
 	s->ramdisk.sector_size = driver->info.sector_size;
-	s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
-					rd_hash_equal);
+	s->ramdisk.log_prefix = "remus";
+	s->ramdisk.image = remus_image;
+	ramdisk_init(&s->ramdisk);
 
 	DPRINTF("Ramdisk started, %zu bytes/sector\n", s->ramdisk.sector_size);
 
@@ -917,13 +478,9 @@ static int client_flush(td_driver_t *driver)
 static int server_flush(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
-	/* 
-	 * Nothing to flush in beginning.
-	 */
-	if (!s->ramdisk.prev)
-		return 0;
+
 	/* Try to flush any remaining requests */
-	return ramdisk_flush(driver, s);	
+	return ramdisk_flush_pended_requests(&s->ramdisk);
 }
 
 /* It is called when switching the mode from primary to unprotected */
@@ -1030,10 +587,7 @@ static inline int server_writes_inflight(td_driver_t *driver)
 {
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
-	if (!s->ramdisk.inflight && !s->ramdisk.prev)
-		return 0;
-
-	return 1;
+	return ramdisk_writes_inflight(&s->ramdisk);
 }
 
 /* Due to block device prefetching this code may be called on the server side
@@ -1116,7 +670,9 @@ static void server_do_wreq(td_driver_t *driver)
 	if (mread(s->stream_fd.fd, buf, len) < 0)
 		goto err;
 
-	if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
+	if (ramdisk_cache_write_request(&s->ramdisk, *sector, *sectors,
+					driver->info.sector_size, buf,
+					"remus") < 0) {
 		rc = ERROR_INTERNAL;
 		goto err;
 	}
@@ -1137,7 +693,7 @@ static void server_do_creq(td_driver_t *driver)
 
 	// RPRINTF("committing buffer\n");
 
-	ramdisk_start_flush(driver);
+	ramdisk_start_flush(&s->ramdisk);
 
 	/* XXX this message should not be sent until flush completes! */
 	if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) != 4)
@@ -1184,12 +740,7 @@ void unprotected_queue_read(td_driver_t *driver, td_request_t treq)
 
 	/* wait for previous ramdisk to flush  before servicing reads */
 	if (server_writes_inflight(driver)) {
-		/* for now lets just return EBUSY.
-		 * if there are any left-over requests in prev,
-		 * kick em again.
-		 */
-		if(!s->ramdisk.inflight) /* nothing in inprogress */
-			ramdisk_flush(driver, s);
+		ramdisk_flush_pended_requests(&s->ramdisk);
 
 		td_complete_request(treq, -EBUSY);
 	}
@@ -1207,8 +758,7 @@ void unprotected_queue_write(td_driver_t *driver, td_request_t treq)
 	/* wait for previous ramdisk to flush */
 	if (server_writes_inflight(driver)) {
 		RPRINTF("queue_write: waiting for queue to drain");
-		if(!s->ramdisk.inflight) /* nothing in inprogress. Kick prev */
-			ramdisk_flush(driver, s);
+		ramdisk_flush_pended_requests(&s->ramdisk);
 		td_complete_request(treq, -EBUSY);
 	}
 	else {
@@ -1518,9 +1068,7 @@ static int tdremus_close(td_driver_t *driver)
 	struct tdremus_state *s = (struct tdremus_state *)driver->data;
 
 	RPRINTF("closing\n");
-	if (s->ramdisk.inprogress)
-		hashtable_destroy(s->ramdisk.inprogress, 0);
-
+	ramdisk_destroy(&s->ramdisk);
 	td_replication_connect_kill(&s->t);
 	ctl_unregister(s);
 	ctl_close(s);
diff --git a/tools/blktap2/drivers/block-replication.c b/tools/blktap2/drivers/block-replication.c
index e4b2679..82d7609 100644
--- a/tools/blktap2/drivers/block-replication.c
+++ b/tools/blktap2/drivers/block-replication.c
@@ -15,6 +15,10 @@
 
 #include "tapdisk-server.h"
 #include "block-replication.h"
+#include "tapdisk-interface.h"
+#include "hashtable.h"
+#include "hashtable_itr.h"
+#include "hashtable_utility.h"
 
 #include <string.h>
 #include <errno.h>
@@ -30,6 +34,8 @@
 #define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "%s: " _f, log_prefix, ## _a)
 #define EPRINTF(_f, _a...) syslog (LOG_ERR, "%s: " _f, log_prefix, ## _a)
 
+#define RAMDISK_HASHSIZE 128
+
 /* connection status */
 enum {
 	connection_none,
@@ -466,3 +472,457 @@ static void td_replication_connect_event(event_id_t id, char mode,
 fail:
 	td_replication_client_failed(t, rc);
 }
+
+
+/* I/O replication */
+static void replicated_write_callback(td_request_t treq, int err)
+{
+	ramdisk_t *ramdisk = treq.cb_data;
+	td_vbd_request_t *vreq = treq.private;
+	int i;
+	uint64_t start;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/* the write failed for now, lets panic. this is very bad */
+	if (err) {
+		EPRINTF("ramdisk write failed, disk image is not consistent\n");
+		exit(-1);
+	}
+
+	/*
+	 * The write succeeded. let's pull the vreq off whatever request list
+	 * it is on and free() it
+	 */
+	list_del(&vreq->next);
+	free(vreq);
+
+	ramdisk->inflight--;
+	start = treq.sec;
+	for (i = 0; i < treq.secs; i++) {
+		hashtable_remove(ramdisk->inprogress, &start);
+		start++;
+	}
+	free(treq.buf);
+
+	if (!ramdisk->inflight && ramdisk->prev)
+		ramdisk_flush_pended_requests(ramdisk);
+}
+
+static int
+create_write_request(ramdisk_t *ramdisk, td_sector_t sec, int secs, char *buf)
+{
+	td_request_t treq;
+	td_vbd_request_t *vreq;
+	td_vbd_t *vbd = ramdisk->image->private;
+
+	treq.op      = TD_OP_WRITE;
+	treq.buf     = buf;
+	treq.sec     = sec;
+	treq.secs    = secs;
+	treq.image   = ramdisk->image;
+	treq.cb      = replicated_write_callback;
+	treq.cb_data = ramdisk;
+	treq.id      = 0;
+	treq.sidx    = 0;
+
+	vreq         = calloc(1, sizeof(td_vbd_request_t));
+	treq.private = vreq;
+
+	if(!vreq)
+		return -1;
+
+	vreq->submitting = 1;
+	INIT_LIST_HEAD(&vreq->next);
+	tapdisk_vbd_move_request(treq.private, &vbd->pending_requests);
+
+	td_forward_request(treq);
+
+	vreq->submitting--;
+
+	return 0;
+}
+
+/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
+static unsigned int uint64_hash(void *k)
+{
+	uint64_t key = *(uint64_t*)k;
+
+	key = (~key) + (key << 18);
+	key = key ^ (key >> 31);
+	key = key * 21;
+	key = key ^ (key >> 11);
+	key = key + (key << 6);
+	key = key ^ (key >> 22);
+
+	return (unsigned int)key;
+}
+
+static int rd_hash_equal(void *k1, void *k2)
+{
+	uint64_t key1, key2;
+
+	key1 = *(uint64_t*)k1;
+	key2 = *(uint64_t*)k2;
+
+	return key1 == key2;
+}
+
+static int uint64_compare(const void *k1, const void *k2)
+{
+	uint64_t u1 = *(uint64_t*)k1;
+	uint64_t u2 = *(uint64_t*)k2;
+
+	/* u1 - u2 is unsigned */
+	return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
+}
+
+static struct hashtable *ramdisk_new_hashtable(void)
+{
+	return create_hashtable(RAMDISK_HASHSIZE, uint64_hash, rd_hash_equal);
+}
+
+/*
+ * set psectors to an array of the sector numbers in the hash, returning
+ * the number of entries (or -1 on error)
+ */
+static int ramdisk_get_sectors(struct hashtable *h, uint64_t **psectors,
+			       const char *log_prefix)
+{
+	struct hashtable_itr* itr;
+	uint64_t* sectors;
+	int count;
+
+	if (!(count = hashtable_count(h)))
+		return 0;
+
+	if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
+		DPRINTF("ramdisk_get_sectors: error allocating sector map\n");
+		return -1;
+	}
+	sectors = *psectors;
+
+	itr = hashtable_iterator(h);
+	count = 0;
+	do {
+		sectors[count++] = *(uint64_t*)hashtable_iterator_key(itr);
+	} while (hashtable_iterator_advance(itr));
+	free(itr);
+
+	return count;
+}
+
+static int ramdisk_write_hash(struct hashtable *h, uint64_t sector, char *buf,
+			      size_t len, const char *log_prefix)
+{
+	char *v;
+	uint64_t *key;
+
+	if ((v = hashtable_search(h, &sector))) {
+		memcpy(v, buf, len);
+		return 0;
+	}
+
+	if (!(v = malloc(len))) {
+		DPRINTF("ramdisk_write_hash: malloc failed\n");
+		return -1;
+	}
+	memcpy(v, buf, len);
+	if (!(key = malloc(sizeof(*key)))) {
+		DPRINTF("ramdisk_write_hash: error allocating key\n");
+		free(v);
+		return -1;
+	}
+	*key = sector;
+	if (!hashtable_insert(h, key, v)) {
+		DPRINTF("ramdisk_write_hash failed on sector %" PRIu64 "\n", sector);
+		free(key);
+		free(v);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * return -1 for OOM
+ * return -2 for merge lookup failure(should not happen)
+ * return -3 for WAW race
+ * return 0 on success.
+ */
+static int merge_requests(ramdisk_t *ramdisk, uint64_t start,
+			  size_t count, char **mergedbuf)
+{
+	char* buf;
+	char* sector;
+	int i;
+	uint64_t *key;
+	int rc = 0;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	if (!(buf = valloc(count * ramdisk->sector_size))) {
+		DPRINTF("merge_request: allocation failed\n");
+		return -1;
+	}
+
+	for (i = 0; i < count; i++) {
+		if (!(sector = hashtable_search(ramdisk->prev, &start))) {
+			EPRINTF("merge_request: lookup failed on %"PRIu64"\n",
+				start);
+			free(buf);
+			rc = -2;
+			goto fail;
+		}
+
+		/* Check inprogress requests to avoid waw non-determinism */
+		if (hashtable_search(ramdisk->inprogress, &start)) {
+			DPRINTF("merge_request: WAR RACE on %"PRIu64"\n",
+				start);
+			free(buf);
+			rc = -3;
+			goto fail;
+		}
+
+		/*
+		 * Insert req into inprogress (brief period of duplication of
+		 * hash entries until they are removed from prev. Read tracking
+		 * would not be reading wrong entries)
+		 */
+		if (!(key = malloc(sizeof(*key)))) {
+			EPRINTF("%s: error allocating key\n", __FUNCTION__);
+			free(buf);
+			rc = -1;
+			goto fail;
+		}
+		*key = start;
+		if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
+			EPRINTF("%s failed to insert sector %" PRIu64 " into inprogress hash\n",
+				__FUNCTION__, start);
+			free(key);
+			free(buf);
+			rc = -1;
+			goto fail;
+		}
+
+		memcpy(buf + i * ramdisk->sector_size, sector, ramdisk->sector_size);
+		start++;
+	}
+
+	*mergedbuf = buf;
+	return 0;
+fail:
+	for (start--; i > 0; i--, start--)
+		hashtable_remove(ramdisk->inprogress, &start);
+	return rc;
+}
+
+#define HASHTABLE_DESTROY(hashtable, free)			\
+	do {							\
+		if (hashtable) {				\
+			hashtable_destroy(hashtable, free);	\
+			hashtable = NULL;			\
+		}						\
+	} while (0)
+
+int ramdisk_init(ramdisk_t *ramdisk)
+{
+	ramdisk->inflight = 0;
+	ramdisk->prev = NULL;
+	ramdisk->inprogress = NULL;
+	ramdisk->primary_cache = ramdisk_new_hashtable();
+	if (!ramdisk->primary_cache)
+		return -1;
+
+	return 0;
+}
+
+void ramdisk_destroy(ramdisk_t *ramdisk)
+{
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/*
+	 * ramdisk_destroy() is called only when we will close the tapdisk image.
+	 * In this case, there are no pending requests in vbd.
+	 *
+	 * If ramdisk->inflight is not 0, it means that the requests created by
+	 * us are still in vbd->pending_requests.
+	 */
+	if (ramdisk->inflight) {
+		/* should not happen */
+		EPRINTF("cannot destroy ramdisk\n");
+		return;
+	}
+
+	HASHTABLE_DESTROY(ramdisk->inprogress, 0);
+	HASHTABLE_DESTROY(ramdisk->prev, 1);
+	HASHTABLE_DESTROY(ramdisk->primary_cache, 1);
+}
+
+int ramdisk_read(ramdisk_t *ramdisk, uint64_t sector,
+		 int nb_sectors, char *buf)
+{
+	int i;
+	char *v;
+	uint64_t key;
+
+	for (i = 0; i < nb_sectors; i++) {
+		key = sector + i;
+		/* check whether it is queued in a previous flush request */
+		if (!(ramdisk->prev &&
+		    (v = hashtable_search(ramdisk->prev, &key)))) {
+			/* check whether it is an ongoing flush */
+			if (!(ramdisk->inprogress &&
+			    (v = hashtable_search(ramdisk->inprogress, &key))))
+				return -1;
+		}
+		memcpy(buf + i * ramdisk->sector_size, v, ramdisk->sector_size);
+	}
+
+	return 0;
+}
+
+int ramdisk_cache_write_request(ramdisk_t *ramdisk, uint64_t sector,
+				int nb_sectors, size_t sector_size,
+				char *buf, const char *log_prefix)
+{
+	int i, rc;
+
+	for (i = 0; i < nb_sectors; i++) {
+		rc = ramdisk_write_hash(ramdisk->primary_cache, sector + i,
+					buf + i * sector_size,
+					sector_size, log_prefix);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+int ramdisk_flush_pended_requests(ramdisk_t *ramdisk)
+{
+	uint64_t *sectors;
+	char *buf = NULL;
+	uint64_t base, batchlen;
+	int i, j, count = 0;
+	const char *log_prefix = ramdisk->log_prefix;
+
+	/* everything is in flight */
+	if (!ramdisk->prev)
+		return 0;
+
+	count = ramdisk_get_sectors(ramdisk->prev, &sectors, log_prefix);
+	if (count <= 0)
+		/* should not happen */
+		return count;
+
+	/* Create the inprogress table if empty */
+	if (!ramdisk->inprogress) {
+		ramdisk->inprogress = ramdisk_new_hashtable();
+		if (!ramdisk->inprogress) {
+			EPRINTF("ramdisk_flush: creating the inprogress table failed:OOM\n");
+			return -1;
+		}
+	}
+
+	/* sort and merge sectors to improve disk performance */
+	qsort(sectors, count, sizeof(*sectors), uint64_compare);
+
+	for (i = 0; i < count;) {
+		base = sectors[i++];
+		while (i < count && sectors[i] == sectors[i-1] + 1)
+			i++;
+		batchlen = sectors[i-1] - base + 1;
+
+		j = merge_requests(ramdisk, base, batchlen, &buf);
+		if (j) {
+			EPRINTF("ramdisk_flush: merge_requests failed:%s\n",
+				j == -1 ? "OOM" :
+					(j == -2 ? "missing sector" :
+						 "WAW race"));
+			if (j == -3)
+				continue;
+			free(sectors);
+			return -1;
+		}
+
+		/*
+		 * NOTE: create_write_request() creates a treq AND forwards
+		 * it down the driver chain
+		 *
+		 * TODO: handle create_write_request()'s error.
+		 */
+		create_write_request(ramdisk, base, batchlen, buf);
+
+		ramdisk->inflight++;
+
+		for (j = 0; j < batchlen; j++) {
+			buf = hashtable_search(ramdisk->prev, &base);
+			free(buf);
+			hashtable_remove(ramdisk->prev, &base);
+			base++;
+		}
+	}
+
+	if (!hashtable_count(ramdisk->prev))
+		/* everything is in flight */
+		HASHTABLE_DESTROY(ramdisk->prev, 0);
+
+	free(sectors);
+	return 0;
+}
+
+int ramdisk_start_flush(ramdisk_t *ramdisk)
+{
+	uint64_t *key;
+	char *buf;
+	int rc = 0;
+	int i, j, count, batchlen;
+	uint64_t *sectors;
+	const char *log_prefix = ramdisk->log_prefix;
+	struct hashtable *cache;
+
+	cache = ramdisk->primary_cache;
+	if (!hashtable_count(cache))
+		return 0;
+
+	if (ramdisk->prev) {
+		/*
+		 * a flush request issued while a previous flush is still in
+		 * progress will merge with the previous request. If you want
+		 * the previous request to be consistent, wait for it to
+		 * complete.
+		 */
+		count = ramdisk_get_sectors(cache, &sectors, log_prefix);
+		if (count < 0 )
+			return count;
+
+		for (i = 0; i < count; i++) {
+			buf = hashtable_search(cache, sectors + i);
+			ramdisk_write_hash(ramdisk->prev, sectors[i], buf,
+					   ramdisk->sector_size, log_prefix);
+		}
+		free(sectors);
+
+		hashtable_destroy(cache, 1);
+	} else
+		ramdisk->prev = cache;
+
+	/*
+	 * We create a new hashtable so that new writes can be performed before
+	 * the old hashtable is completely drained.
+	 */
+	ramdisk->primary_cache = ramdisk_new_hashtable();
+	if (!ramdisk->primary_cache) {
+		EPRINTF("ramdisk_start_flush: creating cache table failed: OOM\n");
+		return -1;
+	}
+
+	return ramdisk_flush_pended_requests(ramdisk);
+}
+
+int ramdisk_writes_inflight(ramdisk_t *ramdisk)
+{
+	if (!ramdisk->inflight && !ramdisk->prev)
+		return 0;
+
+	return 1;
+}
diff --git a/tools/blktap2/drivers/block-replication.h b/tools/blktap2/drivers/block-replication.h
index 358c08b..cbdac3c 100644
--- a/tools/blktap2/drivers/block-replication.h
+++ b/tools/blktap2/drivers/block-replication.h
@@ -110,4 +110,69 @@ int td_replication_server_restart(td_replication_connect_t *t);
  */
 int td_replication_client_start(td_replication_connect_t *t);
 
+/* I/O replication */
+typedef struct ramdisk ramdisk_t;
+struct ramdisk {
+	size_t sector_size;
+	const char *log_prefix;
+	td_image_t *image;
+
+	/* private */
+	/* count of outstanding requests to the base driver */
+	size_t inflight;
+	/* prev holds the requests to be flushed, while inprogress holds
+	 * requests being flushed. When requests complete, they are removed
+	 * from inprogress.
+	 * Whenever a new flush is merged with ongoing flush (i.e, prev),
+	 * we have to make sure that none of the new requests overlap with
+	 * ones in "inprogress". If it does, keep it back in prev and dont issue
+	 * IO until the current one finishes. If we allow this IO to proceed,
+	 * we might end up with two "overlapping" requests in the disk's queue and
+	 * the disk may not offer any guarantee on which one is written first.
+	 * IOW, make sure we dont create a write-after-write time ordering constraint.
+	 */
+	struct hashtable *prev;
+	struct hashtable *inprogress;
+	/*
+	 * The primary write request is queued in this
+	 * hashtable, and will be flushed to ramdisk when
+	 * the checkpoint finishes.
+	 */
+	struct hashtable *primary_cache;
+};
+
+int ramdisk_init(ramdisk_t *ramdisk);
+void ramdisk_destroy(ramdisk_t *ramdisk);
+
+/*
+ * try to read from ramdisk. Return -1 if some sectors are not in
+ * ramdisk. Otherwise, return 0.
+ */
+int ramdisk_read(ramdisk_t *ramdisk, uint64_t sector,
+		 int nb_sectors, char *buf);
+
+/*
+ * cache the write requests, and it will be flushed after a
+ * new checkpoint finishes
+ */
+int ramdisk_cache_write_request(ramdisk_t *ramdisk, uint64_t sector,
+				int nb_sectors, size_t sector_size,
+				char* buf, const char *log_prefix);
+
+/* flush pended write requests to disk */
+int ramdisk_flush_pended_requests(ramdisk_t *ramdisk);
+/*
+ * flush cached write requests to disk. If WAW is detected, the cached
+ * write requests will be moved to pended queue. The pended write
+ * requests will be auto flushed after all inprogress write requests
+ * are flushed to disk. This function don't wait all write requests
+ * are flushed to disk.
+ */
+int ramdisk_start_flush(ramdisk_t *ramdisk);
+/*
+ * Return true if some write reqeusts are inprogress or pended,
+ * otherwise return false
+ */
+int ramdisk_writes_inflight(ramdisk_t *ramdisk);
+
 #endif
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 16/17] support blktap remus in xl
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (14 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-14  2:14 ` [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends Wen Congyang
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Shriram Rajagopalan, Yang Hongyang, Lai Jiangshan

With this patch, we can use blktap remus like this:
disk = [ 'format=raw,devtype=disk,access=w,vdev=hda,backendtype=tap,filter=remus,filter-params=192.168.3.1:9000,target=filename' ]

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
---
 tools/libxl/libxl.c           | 25 +++++++++++++++++++++++--
 tools/libxl/libxl_blktap2.c   | 38 +++++++++++++++++++++++++++++++++-----
 tools/libxl/libxl_device.c    | 35 ++++++++++++++++++++++++++++++++++-
 tools/libxl/libxl_dm.c        |  4 +++-
 tools/libxl/libxl_internal.h  |  8 ++++++--
 tools/libxl/libxl_noblktap2.c |  8 ++++++--
 tools/libxl/libxl_types.idl   |  2 ++
 tools/libxl/libxlu_disk_l.l   |  2 ++
 8 files changed, 109 insertions(+), 13 deletions(-)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index efc3ca6..f96c73b 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2394,7 +2394,8 @@ static void device_disk_add(libxl__egc *egc, uint32_t domid,
             case LIBXL_DISK_BACKEND_TAP:
                 if (dev == NULL) {
                     dev = libxl__blktap_devpath(gc, disk->pdev_path,
-                                                disk->format);
+                                                disk->format, disk->filter,
+                                                disk->filter_params);
                     if (!dev) {
                         LOG(ERROR, "failed to get blktap devpath for %p\n",
                             disk->pdev_path);
@@ -2406,6 +2407,11 @@ static void device_disk_add(libxl__egc *egc, uint32_t domid,
                 flexarray_append(back, libxl__sprintf(gc, "%s:%s",
                     libxl_disk_format_to_string(disk->format),
                     disk->pdev_path));
+                if (disk->filter) {
+                    flexarray_append(back, "filter-params");
+                    flexarray_append(back, libxl__sprintf(gc, "%s:%s",
+                        disk->filter, disk->filter_params));
+                }
 
                 /* tap backends with scripts are rejected by
                  * libxl__device_disk_set_backend */
@@ -2607,6 +2613,20 @@ static int libxl__device_disk_from_xs_be(libxl__gc *gc,
          * phy in type(see device_disk_add())
          */
         disk->backend = LIBXL_DISK_BACKEND_TAP;
+
+        rc = read_params(gc, GCSPRINTF("%s/filter-params", be_path),
+                         &tmp, &disk->filter_params);
+        if (rc)
+            goto cleanup;
+        if (!tmp) {
+            LOG(ERROR, "corrupted filter-params: %s", disk->filter_params);
+            goto cleanup;
+        }
+        disk->filter = strdup(tmp);
+        if (!disk->filter) {
+            LOGE(ERROR, "no memory to store filter");
+            goto cleanup;
+        }
     } else {
         /* "params" may not be present; but everything else must be. */
         rc = read_params(gc, GCSPRINTF("%s/params", be_path),
@@ -3059,7 +3079,8 @@ void libxl__device_disk_local_initiate_attach(libxl__egc *egc,
                 break;
             case LIBXL_DISK_FORMAT_VHD:
                 dev = libxl__blktap_devpath(gc, disk->pdev_path,
-                                            disk->format);
+                                            disk->format, disk->filter,
+                                            disk->filter_params);
                 break;
             case LIBXL_DISK_FORMAT_QCOW:
             case LIBXL_DISK_FORMAT_QCOW2:
diff --git a/tools/libxl/libxl_blktap2.c b/tools/libxl/libxl_blktap2.c
index 7656fe4..ebe0271 100644
--- a/tools/libxl/libxl_blktap2.c
+++ b/tools/libxl/libxl_blktap2.c
@@ -25,22 +25,33 @@ int libxl__blktap_enabled(libxl__gc *gc)
 
 char *libxl__blktap_devpath(libxl__gc *gc,
                             const char *disk,
-                            libxl_disk_format format)
+                            libxl_disk_format format,
+                            const char *filter,
+                            const char *filter_params)
 {
-    const char *type;
+    const char *type, *disk_params;
     char *params, *devname = NULL;
     tap_list_t tap;
     int err;
 
     type = libxl__device_disk_string_of_format(format);
-    err = tap_ctl_find(type, disk, &tap);
+    if (!type)
+        return NULL;
+
+    if (filter) {
+        disk_params = libxl__sprintf(gc, "%s|%s:%s", filter_params, type, disk);
+        type = filter;
+    } else {
+        disk_params = disk;
+    }
+    err = tap_ctl_find(type, disk_params, &tap);
     if (err == 0) {
         devname = libxl__sprintf(gc, "/dev/xen/blktap-2/tapdev%d", tap.minor);
         if (devname)
             return devname;
     }
 
-    params = libxl__sprintf(gc, "%s:%s", type, disk);
+    params = libxl__sprintf(gc, "%s:%s", type, disk_params);
     err = tap_ctl_create(params, &devname);
     if (!err) {
         libxl__ptr_add(gc, devname);
@@ -51,7 +62,9 @@ char *libxl__blktap_devpath(libxl__gc *gc,
 }
 
 
-int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
+int libxl__device_destroy_tapdisk(libxl__gc *gc,
+                                  const char *params,
+                                  const char *filter_params)
 {
     char *type, *disk;
     int err, rc;
@@ -77,6 +90,21 @@ int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
 
     type = libxl__device_disk_string_of_format(format);
 
+    if (filter_params) {
+        char *tmp;
+        char *tmp_type = type, *tmp_disk = disk;
+
+        type = libxl__strdup(gc, filter_params);
+        tmp = strchr(type, ':');
+
+        if (!tmp) {
+            LOG(ERROR, "Unable to parse filter-params %s", filter_params);
+            return ERROR_FAIL;
+        }
+        *tmp++ = '\0';
+        disk = libxl__sprintf(gc, "%s|%s:%s", tmp, tmp_type, tmp_disk);
+    }
+
     err = tap_ctl_find(type, disk, &tap);
     if (err < 0) {
         /* returns -errno */
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 4b51ded..0b2a68d 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -196,6 +196,9 @@ static int disk_try_backend(disk_try_backend_args *a,
             goto bad_format;
         }
 
+        if (a->disk->filter) goto bad_filter;
+        if (a->disk->filter_params) goto bad_filter_params;
+
         if (a->disk->backend_domid != LIBXL_TOOLSTACK_DOMID) {
             LOG(DEBUG, "Disk vdev=%s, is using a storage driver domain, "
                        "skipping physical device check", a->disk->vdev);
@@ -232,10 +235,25 @@ static int disk_try_backend(disk_try_backend_args *a,
               a->disk->format == LIBXL_DISK_FORMAT_VHD)) {
             goto bad_format;
         }
+
+        if (a->disk->filter && !a->disk->filter_params) {
+            LOG(DEBUG, "Disk vdev=%s, backend tap unsuitable due to missing "
+                "filter_params=...", a->disk->vdev);
+            return 0;
+        }
+
+        if (!a->disk->filter && a->disk->filter_params) {
+            LOG(DEBUG, "Disk vdev=%s, backend tap unsuitable due to missing "
+                "filter=...", a->disk->vdev);
+            return 0;
+        }
+
         return backend;
 
     case LIBXL_DISK_BACKEND_QDISK:
         if (a->disk->script) goto bad_script;
+        if (a->disk->filter) goto bad_filter;
+        if (a->disk->filter_params) goto bad_filter_params;
         return backend;
 
     default:
@@ -256,6 +274,16 @@ static int disk_try_backend(disk_try_backend_args *a,
     LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with script=...",
         a->disk->vdev, libxl_disk_backend_to_string(backend));
     return 0;
+
+ bad_filter:
+    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with filter=...",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
+
+ bad_filter_params:
+    LOG(DEBUG, "Disk vdev=%s, backend %s not compatible with filter-params=...",
+        a->disk->vdev, libxl_disk_backend_to_string(backend));
+    return 0;
 }
 
 int libxl__device_disk_set_backend(libxl__gc *gc, libxl_device_disk *disk) {
@@ -572,6 +600,8 @@ int libxl__device_destroy(libxl__gc *gc, libxl__device *dev)
     const char *fe_path = libxl__device_frontend_path(gc, dev);
     const char *tapdisk_path = GCSPRINTF("%s/%s", be_path, "tapdisk-params");
     const char *tapdisk_params;
+    const char *filter_path = GCSPRINTF("%s/%s", be_path, "filter-params");
+    const char *filter_params;
     xs_transaction_t t = 0;
     int rc;
     uint32_t domid;
@@ -587,6 +617,9 @@ int libxl__device_destroy(libxl__gc *gc, libxl__device *dev)
         rc = libxl__xs_read_checked(gc, t, tapdisk_path, &tapdisk_params);
         if (rc) goto out;
 
+        rc = libxl__xs_read_checked(gc, t, filter_path, &filter_params);
+        if (rc) goto out;
+
         if (domid == LIBXL_TOOLSTACK_DOMID) {
             /*
              * The toolstack domain is in charge for removing both the
@@ -608,7 +641,7 @@ int libxl__device_destroy(libxl__gc *gc, libxl__device *dev)
     }
 
     if (tapdisk_params)
-        rc = libxl__device_destroy_tapdisk(gc, tapdisk_params);
+        rc = libxl__device_destroy_tapdisk(gc, tapdisk_params, filter_params);
 
 out:
     libxl__xs_transaction_abort(gc, &t);
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index d8992bb..a39d46c 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -750,7 +750,9 @@ static char ** libxl__build_device_model_args_new(libxl__gc *gc,
 
                 if (disks[i].backend == LIBXL_DISK_BACKEND_TAP)
                     pdev_path = libxl__blktap_devpath(gc, disks[i].pdev_path,
-                                                      disks[i].format);
+                                                      disks[i].format,
+                                                      disks[i].filter,
+                                                      disks[i].filter_params);
                 else
                     pdev_path = disks[i].pdev_path;
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 83bef59..282b03f 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1541,14 +1541,18 @@ _hidden int libxl__blktap_enabled(libxl__gc *gc);
  */
 _hidden char *libxl__blktap_devpath(libxl__gc *gc,
                                     const char *disk,
-                                    libxl_disk_format format);
+                                    libxl_disk_format format,
+                                    const char *filter,
+                                    const char *filter_params);
 
 /* libxl__device_destroy_tapdisk:
  *   Destroys any tapdisk process associated with the backend represented
  *   by be_path.
  *   Always logs on failure.
  */
-_hidden int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params);
+_hidden int libxl__device_destroy_tapdisk(libxl__gc *gc,
+                                          const char *params,
+                                          const char *filter_params);
 
 _hidden int libxl__device_from_disk(libxl__gc *gc, uint32_t domid,
                                    libxl_device_disk *disk,
diff --git a/tools/libxl/libxl_noblktap2.c b/tools/libxl/libxl_noblktap2.c
index 5a86ed1..ba3120b 100644
--- a/tools/libxl/libxl_noblktap2.c
+++ b/tools/libxl/libxl_noblktap2.c
@@ -23,12 +23,16 @@ int libxl__blktap_enabled(libxl__gc *gc)
 
 char *libxl__blktap_devpath(libxl__gc *gc,
                             const char *disk,
-                            libxl_disk_format format)
+                            libxl_disk_format format,
+                            const char *filter,
+                            const char *filter_params)
 {
     return NULL;
 }
 
-int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
+int libxl__device_destroy_tapdisk(libxl__gc *gc,
+                                  const char *params,
+                                  const char *filter_params)
 {
     return 0;
 }
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index bbb03e2..275be93 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -465,6 +465,8 @@ libxl_device_disk = Struct("device_disk", [
     ("is_cdrom", integer),
     ("direct_io_safe", bool),
     ("discard_enable", libxl_defbool),
+    ("filter", string),
+    ("filter_params", string),
     ])
 
 libxl_device_nic = Struct("device_nic", [
diff --git a/tools/libxl/libxlu_disk_l.l b/tools/libxl/libxlu_disk_l.l
index 1a5deb5..cfd2e3f 100644
--- a/tools/libxl/libxlu_disk_l.l
+++ b/tools/libxl/libxlu_disk_l.l
@@ -176,6 +176,8 @@ script=[^,]*,?	{ STRIP(','); SAVESTRING("script", script, FROMEQUALS); }
 direct-io-safe,? { DPC->disk->direct_io_safe = 1; }
 discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, true); }
 no-discard,?	{ libxl_defbool_set(&DPC->disk->discard_enable, false); }
+filter=[^,]*,?	{ STRIP(','); SAVESTRING("filter", filter, FROMEQUALS); }
+filter-params=[^,]*,?	{ STRIP(','); SAVESTRING("filter-params", filter_params, FROMEQUALS); }
 
  /* the target magic parameter, eats the rest of the string */
 
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (15 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 16/17] support blktap remus in xl Wen Congyang
@ 2014-10-14  2:14 ` Wen Congyang
  2014-10-20  3:00   ` Shriram Rajagopalan
  2014-10-14 15:48 ` [PATCH 00/17] blktap2 related bugfix patches Ian Jackson
  2014-10-27 18:32 ` Konrad Rzeszutek Wilk
  18 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-14  2:14 UTC (permalink / raw)
  To: xen devel
  Cc: Ian Campbell, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, Yang Hongyang, Lai Jiangshan

Just for test

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 tools/libxl/Makefile                  |   2 +-
 tools/libxl/libxl_create.c            |   8 ++
 tools/libxl/libxl_internal.h          |   2 +
 tools/libxl/libxl_remus_device.c      |   6 +
 tools/libxl/libxl_remus_disk_blktap.c | 209 ++++++++++++++++++++++++++++++++++
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_remus_disk_blktap.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 0bf666f..b58c2ff 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -56,7 +56,7 @@ else
 LIBXL_OBJS-y += libxl_nonetbuffer.o
 endif
 
-LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
+LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o libxl_remus_disk_blktap.o
 
 LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
 LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 8b82584..e634694 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -853,6 +853,14 @@ static void initiate_domain_create(libxl__egc *egc,
     for (i = 0; i < d_config->num_disks; i++) {
         ret = libxl__device_disk_setdefault(gc, &d_config->disks[i]);
         if (ret) goto error_out;
+
+        /* TODO: cleanup it when destroying the domain */
+        if (d_config->disks[i].backend == LIBXL_DISK_BACKEND_TAP &&
+            d_config->disks[i].filter)
+            libxl__blktap_devpath(gc, d_config->disks[i].pdev_path,
+                                  d_config->disks[i].format,
+                                  d_config->disks[i].filter,
+                                  d_config->disks[i].filter_params);
     }
 
     dcs->bl.ao = ao;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 282b03f..a7c2334 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2672,6 +2672,8 @@ int init_subkind_nic(libxl__remus_devices_state *rds);
 void cleanup_subkind_nic(libxl__remus_devices_state *rds);
 int init_subkind_drbd_disk(libxl__remus_devices_state *rds);
 void cleanup_subkind_drbd_disk(libxl__remus_devices_state *rds);
+int init_subkind_blktap_disk(libxl__remus_devices_state *rds);
+void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds);
 
 typedef void libxl__remus_callback(libxl__egc *,
                                    libxl__remus_devices_state *, int rc);
diff --git a/tools/libxl/libxl_remus_device.c b/tools/libxl/libxl_remus_device.c
index a6cb7f6..ef272ac 100644
--- a/tools/libxl/libxl_remus_device.c
+++ b/tools/libxl/libxl_remus_device.c
@@ -19,9 +19,11 @@
 
 extern const libxl__remus_device_instance_ops remus_device_nic;
 extern const libxl__remus_device_instance_ops remus_device_drbd_disk;
+extern const libxl__remus_device_instance_ops remus_device_blktap2_disk;
 static const libxl__remus_device_instance_ops *remus_ops[] = {
     &remus_device_nic,
     &remus_device_drbd_disk,
+    &remus_device_blktap2_disk,
     NULL,
 };
 
@@ -41,6 +43,9 @@ static int init_device_subkind(libxl__remus_devices_state *rds)
     rc = init_subkind_drbd_disk(rds);
     if (rc) goto out;
 
+    rc = init_subkind_blktap_disk(rds);
+    if (rc) goto out;
+
     rc = 0;
 out:
     return rc;
@@ -55,6 +60,7 @@ static void cleanup_device_subkind(libxl__remus_devices_state *rds)
         cleanup_subkind_nic(rds);
 
     cleanup_subkind_drbd_disk(rds);
+    cleanup_subkind_blktap_disk(rds);
 }
 
 /*----- setup() and teardown() -----*/
diff --git a/tools/libxl/libxl_remus_disk_blktap.c b/tools/libxl/libxl_remus_disk_blktap.c
new file mode 100644
index 0000000..3ae77d6
--- /dev/null
+++ b/tools/libxl/libxl_remus_disk_blktap.c
@@ -0,0 +1,209 @@
+/*
+ * Copyright (C) 2014 FUJITSU LIMITED
+ * Author Wen Congyang <wency@cn.fujitsu.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include "libxl_internal.h"
+
+#include <string.h>
+#include <sys/un.h>
+
+#define     BLKTAP2_REQUEST     "flush"
+#define     BLKTAP2_RESPONSE    "done"
+#define     BLKTAP_CTRL_DIR     "/var/run/tap"
+
+typedef struct libxl__remus_blktap2_disk {
+    char *name;
+    char *ctl_fifo_path;
+    char *msg_fifo_path;
+    int ctl_fd;
+    int msg_fd;
+    libxl__ev_fd ev;
+    libxl__remus_device *dev;
+}libxl__remus_blktap2_disk;
+
+int init_subkind_blktap_disk(libxl__remus_devices_state *rds)
+{
+    return 0;
+}
+
+void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds)
+{
+    return;
+}
+/* ========== setup() and teardown() ========== */
+static void blktap2_remus_setup(libxl__egc *egc, libxl__remus_device *dev)
+{
+    const libxl_device_disk *disk = dev->backend_dev;
+    libxl__remus_blktap2_disk *blktap2_disk;
+    int rc;
+    int i, l;
+
+    STATE_AO_GC(dev->rds->ao);
+
+    if (disk->backend != LIBXL_DISK_BACKEND_TAP ||
+        !disk->filter ||
+        strcmp(disk->filter, "remus")) {
+        rc = ERROR_REMUS_DEVOPS_DOES_NOT_MATCH;
+        goto out;
+    }
+
+    dev->matched = 1;
+    GCNEW(blktap2_disk);
+    dev->concrete_data = blktap2_disk;
+    blktap2_disk->ctl_fd = -1;
+    blktap2_disk->msg_fd = -1;
+    blktap2_disk->dev = dev;
+
+    blktap2_disk->name = libxl__strdup(gc, disk->filter_params);
+    blktap2_disk->ctl_fifo_path = GCSPRINTF("%s/remus_%s",
+                                            BLKTAP_CTRL_DIR,
+                                            blktap2_disk->name);
+    /* scrub fifo pathname */
+    l = strlen(blktap2_disk->ctl_fifo_path);
+    for (i = strlen(BLKTAP_CTRL_DIR) + 1; i < l; i++) {
+        if (strchr(":/", blktap2_disk->ctl_fifo_path[i]))
+            blktap2_disk->ctl_fifo_path[i] = '_';
+    }
+    blktap2_disk->msg_fifo_path = GCSPRINTF("%s.msg",
+                                            blktap2_disk->ctl_fifo_path);
+
+    blktap2_disk->ctl_fd = open(blktap2_disk->ctl_fifo_path, O_WRONLY);
+    blktap2_disk->msg_fd = open(blktap2_disk->msg_fifo_path, O_RDONLY);
+    if (blktap2_disk->ctl_fd < 0 || blktap2_disk->msg_fd < 0) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    libxl__ev_fd_init(&blktap2_disk->ev);
+
+    rc = 0;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void blktap2_remus_teardown(libxl__egc *egc,
+                                   libxl__remus_device *dev)
+{
+    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
+
+    if (blktap2_disk->ctl_fd > 0) {
+        close(blktap2_disk->ctl_fd);
+        blktap2_disk->ctl_fd = -1;
+    }
+
+    if (blktap2_disk->msg_fd > 0) {
+        close(blktap2_disk->msg_fd);
+        blktap2_disk->msg_fd = -1;
+    }
+
+    dev->aodev.rc = 0;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+/* ========== checkpointing APIs ========== */
+/*
+ * When a new checkpoint is triggered, we do the following thing:
+ *  1. send BLKTAP2_REQUEST to tapdisk2
+ *  2. tapdisk2 send "creq"
+ *  3. secondary vm's tapdisk2 reply "done"
+ *  4. tapdisk2 writes BLKTAP2_RESPONSE to the socket
+ *  5. read BLKTAP2_RESPONSE from the socket
+ * Step1 and 5 are implemented here.
+ */
+static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
+                                     int fd, short events, short revents);
+
+static void blktap2_remus_postsuspend(libxl__egc *egc,
+                                      libxl__remus_device *dev)
+{
+    int ret;
+    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
+    int rc = 0;
+
+    /* fifo fd, and not block */
+    ret = write(blktap2_disk->ctl_fd, BLKTAP2_REQUEST, strlen(BLKTAP2_REQUEST));
+    if (ret < strlen(BLKTAP2_REQUEST))
+        rc = ERROR_FAIL;
+
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+static void blktap2_remus_commit(libxl__egc *egc,
+                                 libxl__remus_device *dev)
+{
+    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
+    int rc;
+
+    /* Convenience aliases */
+    const int fd = blktap2_disk->msg_fd;
+    libxl__ev_fd *const ev = &blktap2_disk->ev;
+
+    STATE_AO_GC(dev->rds->ao);
+
+    rc = libxl__ev_fd_register(gc, ev, blktap2_control_readable, fd, POLLIN);
+    if (rc) {
+        dev->aodev.rc = rc;
+        dev->aodev.callback(egc, &dev->aodev);
+    }
+}
+
+static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
+                                     int fd, short events, short revents)
+{
+    libxl__remus_blktap2_disk *blktap2_disk =
+                CONTAINER_OF(ev, *blktap2_disk, ev);
+    int rc = 0, ret;
+    char response[5];
+
+    /* Convenience aliases */
+    libxl__remus_device *const dev = blktap2_disk->dev;
+
+    EGC_GC;
+
+    libxl__ev_fd_deregister(gc, ev);
+
+    if (revents & ~POLLIN) {
+        LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)", revents);
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    ret = read(fd, response, sizeof(response) - 1);
+    if (ret < sizeof(response) - 1) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    response[4] = '\0';
+    if (strcmp(response, BLKTAP2_RESPONSE))
+        rc = ERROR_FAIL;
+
+out:
+    dev->aodev.rc = rc;
+    dev->aodev.callback(egc, &dev->aodev);
+}
+
+
+const libxl__remus_device_instance_ops remus_device_blktap2_disk = {
+    .kind = LIBXL__DEVICE_KIND_VBD,
+    .setup = blktap2_remus_setup,
+    .teardown = blktap2_remus_teardown,
+    .postsuspend = blktap2_remus_postsuspend,
+    .commit = blktap2_remus_commit,
+};
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (16 preceding siblings ...)
  2014-10-14  2:14 ` [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends Wen Congyang
@ 2014-10-14 15:48 ` Ian Jackson
  2014-10-15  1:05   ` Wen Congyang
  2014-10-27 18:32 ` Konrad Rzeszutek Wilk
  18 siblings, 1 reply; 50+ messages in thread
From: Ian Jackson @ 2014-10-14 15:48 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
> These bugs are found when we implement COLO, or rebase
> COLO to upstream xen. They are independent patches, so
> post them in separate series.

blktap2 is unmaintained AFAICT.

In the last year there has been only one commit which shows evidence
of someone caring even slightly about tools/blktap2/.  The last
substantial attention was in March 2013.

(I'm disregarding commits which touch tools/blktap2/ to fix up compile
problems with new compilers, sort out build system and file
rearrangements, etc.)

The file you are touching in your 01/17 was last edited (by anyone, at
all) in January 2010.

Under the circumstances, we should probably take all these changes
without looking for anyone to ack them.

Perhaps you would like to become the maintainers of blktap2 ? :-)

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-14 15:48 ` [PATCH 00/17] blktap2 related bugfix patches Ian Jackson
@ 2014-10-15  1:05   ` Wen Congyang
  2014-10-19 20:34     ` Shriram Rajagopalan
  2014-10-20 14:25     ` George Dunlap
  0 siblings, 2 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-15  1:05 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Lai Jiangshan, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/14/2014 11:48 PM, Ian Jackson wrote:
> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>> These bugs are found when we implement COLO, or rebase
>> COLO to upstream xen. They are independent patches, so
>> post them in separate series.
> 
> blktap2 is unmaintained AFAICT.
> 
> In the last year there has been only one commit which shows evidence
> of someone caring even slightly about tools/blktap2/.  The last
> substantial attention was in March 2013.
> 
> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
> problems with new compilers, sort out build system and file
> rearrangements, etc.)
> 
> The file you are touching in your 01/17 was last edited (by anyone, at
> all) in January 2010.
> 
> Under the circumstances, we should probably take all these changes
> without looking for anyone to ack them.
> 
> Perhaps you would like to become the maintainers of blktap2 ? :-)

Hmm, I don't have any knowledge about disk format, but blktap2 have
such codes(For example: block-vhd.c, block-qcow.c...). I think I can
maintain the rest codes.

The block-remus related modification should be reviewed and acked by remus
maintainers.

Thanks
Wen Congyang

> 
> Ian.
> .
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-15  1:05   ` Wen Congyang
@ 2014-10-19 20:34     ` Shriram Rajagopalan
  2014-10-20 14:25     ` George Dunlap
  1 sibling, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-19 20:34 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Ian Campbell, Dong Eddie, Jiang Yunhong, Ian Jackson, xen devel,
	Yang Hongyang, Lai Jiangshan


[-- Attachment #1.1: Type: text/plain, Size: 1680 bytes --]

I am fine with the block Remus mods. If you are going to modify that file
for COLO, you might as well maintain them, as no one seems to be using
blktap2 with Remus and libxl

Shriram
On Oct 14, 2014 9:06 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:

> On 10/14/2014 11:48 PM, Ian Jackson wrote:
> > Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
> >> These bugs are found when we implement COLO, or rebase
> >> COLO to upstream xen. They are independent patches, so
> >> post them in separate series.
> >
> > blktap2 is unmaintained AFAICT.
> >
> > In the last year there has been only one commit which shows evidence
> > of someone caring even slightly about tools/blktap2/.  The last
> > substantial attention was in March 2013.
> >
> > (I'm disregarding commits which touch tools/blktap2/ to fix up compile
> > problems with new compilers, sort out build system and file
> > rearrangements, etc.)
> >
> > The file you are touching in your 01/17 was last edited (by anyone, at
> > all) in January 2010.
> >
> > Under the circumstances, we should probably take all these changes
> > without looking for anyone to ack them.
> >
> > Perhaps you would like to become the maintainers of blktap2 ? :-)
>
> Hmm, I don't have any knowledge about disk format, but blktap2 have
> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
> maintain the rest codes.
>
> The block-remus related modification should be reviewed and acked by remus
> maintainers.
>
> Thanks
> Wen Congyang
>
> >
> > Ian.
> > .
> >
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

[-- Attachment #1.2: Type: text/html, Size: 2291 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/17] tools: block-remus: fix memory leak
  2014-10-14  2:13 ` [PATCH 05/17] tools: block-remus: fix memory leak Wen Congyang
@ 2014-10-20  2:33   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:33 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 1217 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> Fix the following two memory leak:
> 1. If s->ramdisk.prev is not NULL, we merge the write requests in
>    s->ramdisk.h into s->ramdisk.prev, and then destroy s->ramdisk.h.
>    But we forget to free hash value when destroying s->ramdisk.h.
>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Jiang Yunhong <yunhong.jiang@intel.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index fd5f209..55363a3 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -599,7 +599,7 @@ static int ramdisk_start_flush(td_driver_t *driver)
>                 }
>                 free(sectors);
>
> -               hashtable_destroy (s->ramdisk.h, 0);
> +               hashtable_destroy (s->ramdisk.h, 1);
>         } else
>                 s->ramdisk.prev = s->ramdisk.h;
>
> --
> 1.9.3
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

[-- Attachment #1.2: Type: text/html, Size: 1872 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/17] tools: blktap2: use correct way to define array.
  2014-10-14  2:13 ` [PATCH 09/17] tools: blktap2: use correct way to define array Wen Congyang
@ 2014-10-20  2:37   ` Shriram Rajagopalan
  2014-10-20  2:52     ` Wen Congyang
  0 siblings, 1 reply; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:37 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 2803 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> Currently, we use the following way to define an array:
> type array[] = {
>     [index] = xxx,
>     0,
> };
> So array[index+1] will be NULL. If index is not the last
> index, it will override another index.
>
> tapdisk_vbd_index is not defined, but array[DISK_TYPE_VINDEX]
> is overridden, so we don't find this problem when building
> the source.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/tapdisk-disktype.c | 12 ++----------
>  tools/blktap2/drivers/tapdisk-disktype.h |  2 +-
>  2 files changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/tools/blktap2/drivers/tapdisk-disktype.c
b/tools/blktap2/drivers/tapdisk-disktype.c
> index e9a6890..8d1383b 100644
> --- a/tools/blktap2/drivers/tapdisk-disktype.c
> +++ b/tools/blktap2/drivers/tapdisk-disktype.c
> @@ -82,12 +82,6 @@ static const disk_info_t block_cache_disk = {
>         1,
>  };
>
> -static const disk_info_t vhd_index_disk = {
> -       "vhdi",
> -       "vhd index image (vhdi)",
> -       1,
> -};
> -
>  static const disk_info_t log_disk = {
>         "log",
>         "write logger (log)",
> @@ -110,9 +104,8 @@ const disk_info_t *tapdisk_disk_types[] = {
>         [DISK_TYPE_QCOW]        = &qcow_disk,
>         [DISK_TYPE_BLOCK_CACHE] = &block_cache_disk,
>         [DISK_TYPE_LOG] = &log_disk,
> -       [DISK_TYPE_VINDEX]      = &vhd_index_disk,
>         [DISK_TYPE_REMUS]       = &remus_disk,
> -       0,
> +       [DISK_TYPE_MAX]         = NULL,
>  };
>
>  extern struct tap_disk tapdisk_aio;
> @@ -137,10 +130,9 @@ const struct tap_disk *tapdisk_disk_drivers[] = {
>         [DISK_TYPE_RAM]         = &tapdisk_ram,
>         [DISK_TYPE_QCOW]        = &tapdisk_qcow,
>         [DISK_TYPE_BLOCK_CACHE] = &tapdisk_block_cache,
> -       [DISK_TYPE_VINDEX]      = &tapdisk_vhd_index,
>         [DISK_TYPE_LOG]         = &tapdisk_log,
>         [DISK_TYPE_REMUS]       = &tapdisk_remus,
> -       0,
> +       [DISK_TYPE_MAX]         = NULL,
>  };
>
>  int
> diff --git a/tools/blktap2/drivers/tapdisk-disktype.h
b/tools/blktap2/drivers/tapdisk-disktype.h
> index b697eea..c574990 100644
> --- a/tools/blktap2/drivers/tapdisk-disktype.h
> +++ b/tools/blktap2/drivers/tapdisk-disktype.h
> @@ -39,7 +39,7 @@
>  #define DISK_TYPE_BLOCK_CACHE 7
>  #define DISK_TYPE_LOG         8
>  #define DISK_TYPE_REMUS       9
> -#define DISK_TYPE_VINDEX      10
> +#define DISK_TYPE_MAX         10
>
>  #define DISK_TYPE_NAME_MAX    32
>
> --
> 1.9.3
>

I can only ack changes to block-remus file. I cannot ack changes to other
parts of the blktap2 subsystem, as I am not their maintainer nor do I know
much about that code. So I leave it to IanJ or IanC's discretion.

[-- Attachment #1.2: Type: text/html, Size: 3859 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/17] tools: block-remus: fix bug in ctl_request()
  2014-10-14  2:13 ` [PATCH 10/17] tools: block-remus: fix bug in ctl_request() Wen Congyang
@ 2014-10-20  2:38   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:38 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 1809 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> ctl_request() handles the command which the users writes to ctl fifo. The
> user will read the response from msg fifo. This patch fixes the following
bugs:
> 1. If the command is not "flush", we don't respond, and the user will wait
>    the forever.
> 2. If the current mode is not mode_primary, we don't respond in
s->queue_flush(),
>    so call s->queue_flush() only if the mode is mode_primary.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index 55363a3..9be47f6 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -1513,13 +1513,18 @@ static void ctl_request(event_id_t id, char mode,
void *private)
>         /* TODO: need to get driver somehow */
>         msg[rc] = '\0';
>         if (!strncmp(msg, "flush", 5)) {
> -               if (s->queue_flush)
> +               if (s->mode == mode_primary) {
>                         if ((rc = s->queue_flush(driver))) {
>                                 RPRINTF("error passing flush request to
backup");
>                                 ctl_respond(s, TDREMUS_FAIL);
>                         }
> +               } else {
> +                       RPRINTF("We are not in primary mode\n");
> +                       ctl_respond(s, TDREMUS_FAIL);
> +               }
>         } else {
>                 RPRINTF("unknown command: %s\n", msg);
> +               ctl_respond(s, TDREMUS_FAIL);
>         }
>  }
>
> --
> 1.9.3
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

[-- Attachment #1.2: Type: text/html, Size: 2606 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/17] tools: block-remus: connect to backup asynchronously
  2014-10-14  2:14 ` [PATCH 13/17] tools: block-remus: connect to backup asynchronously Wen Congyang
@ 2014-10-20  2:50   ` Shriram Rajagopalan
  2014-10-20  3:00     ` Wen Congyang
  0 siblings, 1 reply; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:50 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 29952 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> Use the API to connect to backup asynchronously.
> Before the connection is established, we queue
> all I/O requests, and handle them when the connection
> is established.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c       | 508
+++++++++++++-----------------
>  tools/blktap2/drivers/block-replication.h |   1 +
>  2 files changed, 221 insertions(+), 288 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index e5ad782..a2b9f62 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -40,6 +40,7 @@
>  #include "hashtable.h"
>  #include "hashtable_itr.h"
>  #include "hashtable_utility.h"
> +#include "block-replication.h"
>
>  #include <errno.h>
>  #include <inttypes.h>
> @@ -49,10 +50,7 @@
>  #include <string.h>
>  #include <sys/time.h>
>  #include <sys/types.h>
> -#include <sys/socket.h>
> -#include <netdb.h>
>  #include <netinet/in.h>
> -#include <arpa/inet.h>
>  #include <sys/param.h>
>  #include <sys/sysctl.h>
>  #include <unistd.h>
> @@ -63,10 +61,12 @@
>  #define RAMDISK_HASHSIZE 128
>
>  /* connect retry timeout (seconds) */
> -#define REMUS_CONNRETRY_TIMEOUT 10
> +#define REMUS_CONNRETRY_TIMEOUT 1
>
>  #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
>
> +#define MAX_REMUS_REQUESTS      TAPDISK_DATA_REQUESTS
> +
>  enum tdremus_mode {
>         mode_invalid = 0,
>         mode_unprotected,
> @@ -75,16 +75,14 @@ enum tdremus_mode {
>  };
>
>  struct tdremus_req {
> -       uint64_t sector;
> -       int nb_sectors;
> -       char buf[4096];
> +       td_request_t treq;
>  };
>
>  struct req_ring {
>         /* waste one slot to distinguish between empty and full */
> -       struct tdremus_req requests[MAX_REQUESTS * 2 + 1];
> -       unsigned int head;
> -       unsigned int tail;
> +       struct tdremus_req pending_requests[MAX_REMUS_REQUESTS + 1];
> +       unsigned int prod;
> +       unsigned int cons;
>  };
>
>  /* TODO: This isn't very pretty, but to properly generate our own treqs
(needed
> @@ -161,13 +159,14 @@ struct tdremus_state {
>         char*     msg_path; /* output completion message here */
>         poll_fd_t msg_fd;
>
> -  /* replication host */
> -       struct sockaddr_in sa;
> -       poll_fd_t server_fd;    /* server listen port */
> +       td_replication_connect_t t;
>         poll_fd_t stream_fd;     /* replication channel */
>
> -       /* queue write requests, batch-replicate at submit */
> -       struct req_ring write_ring;
> +       /*
> +        * queue I/O requests, batch-replicate when
> +        * the connection is established.
> +        */
> +       struct req_ring queued_io;
>
>         /* ramdisk data*/
>         struct ramdisk ramdisk;
> @@ -206,11 +205,13 @@ static int tdremus_close(td_driver_t *driver);
>
>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode);
>  static int ctl_respond(struct tdremus_state *s, const char *response);
> +static int ctl_register(struct tdremus_state *s);
> +static void ctl_unregister(struct tdremus_state *s);
>
>  /* ring functions */
> -static inline unsigned int ring_next(struct req_ring* ring, unsigned int
pos)
> +static inline unsigned int ring_next(unsigned int pos)
>  {
> -       if (++pos >= MAX_REQUESTS * 2 + 1)
> +       if (++pos >= MAX_REMUS_REQUESTS + 1)
>                 return 0;
>
>         return pos;
> @@ -218,13 +219,26 @@ static inline unsigned int ring_next(struct
req_ring* ring, unsigned int pos)
>
>  static inline int ring_isempty(struct req_ring* ring)
>  {
> -       return ring->head == ring->tail;
> +       return ring->cons == ring->prod;
>  }
>
>  static inline int ring_isfull(struct req_ring* ring)
>  {
> -       return ring_next(ring, ring->tail) == ring->head;
> +       return ring_next(ring->prod) == ring->cons;
>  }
> +
> +static void ring_add_request(struct req_ring *ring, const td_request_t
*treq)
> +{
> +       /* If ring is full, it means that tapdisk2 has some bug */
> +       if (ring_isfull(ring)) {
> +               RPRINTF("OOPS, ring is full\n");
> +               exit(1);
> +       }
> +
> +       ring->pending_requests[ring->prod].treq = *treq;
> +       ring->prod = ring_next(ring->prod);
> +}
> +
>  /* Prototype declarations */
>  static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
>
> @@ -724,89 +738,113 @@ static int mwrite(int fd, void* buf, size_t len)
>         select(fd + 1, NULL, &wfds, NULL, &tv);
>  }
>
> -
> -static void inline close_stream_fd(struct tdremus_state *s)
> -{
> -       if (s->stream_fd.fd < 0)
> -               return;
> -
> -       /* XXX: -2 is magic. replace with macro perhaps? */
> -       tapdisk_server_unregister_event(s->stream_fd.id);
> -       close(s->stream_fd.fd);
> -       s->stream_fd.fd = -2;
> -}
> -
> -static void close_server_fd(struct tdremus_state *s)
> -{
> -       if (s->server_fd.fd < 0)
> -               return;
> -
> -       tapdisk_server_unregister_event(s->server_fd.id);
> -       s->server_fd.id = -1;
> -       close(s->stream_fd.fd);
> -       s->stream_fd.fd = -1;
> -}
> -
>  /* primary functions */
>  static void remus_client_event(event_id_t, char mode, void *private);
> +static int primary_forward_request(struct tdremus_state *s,
> +                                  const td_request_t *treq);
>
> -static int primary_blocking_connect(struct tdremus_state *state)
> +/*
> + * It is called when we cannot connect to backup, or find I/O error when
> + * reading/writing.
> + */
> +static void primary_failed(struct tdremus_state *s, int rc)
>  {
> -       int fd;
> -       int id;
> +       td_replication_connect_kill(&s->t);
> +       if (rc == ERROR_INTERNAL)
> +               RPRINTF("switch to unprotected mode due to internal
error");
> +       UNREGISTER_EVENT(s->stream_fd.id);
> +       switch_mode(s->tdremus_driver, mode_unprotected);
> +}
> +
> +static int remus_handle_queued_io(struct tdremus_state *s)
> +{
> +       struct req_ring *queued_io = &s->queued_io;
> +       unsigned int cons;
> +       td_request_t *treq;
>         int rc;
> -       int flags;
>
> -       RPRINTF("client connecting to %s:%d...\n",
inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
> +       while (!ring_isempty(queued_io)) {
> +               cons = queued_io->cons;
> +               treq = &queued_io->pending_requests[cons].treq;
>
> -       if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
> -               RPRINTF("could not create client socket: %d\n", errno);
> -               return -1;
> -       }
> -
> -       do {
> -               if ((rc = connect(fd, (struct sockaddr *)&state->sa,
> -                   sizeof(state->sa))) < 0)
> -               {
> -                       if (errno == ECONNREFUSED) {
> -                               RPRINTF("connection refused -- retrying
in 1 second\n");
> -                               sleep(1);
> -                       } else {
> -                               RPRINTF("connection failed: %d\n", errno);
> -                               close(fd);
> -                               return -1;
> -                       }
> +               if (treq->op == TD_OP_WRITE) {
> +                       rc = primary_forward_request(s, treq);
> +                       if (rc)
> +                               return rc;
>                 }
> -       } while (rc < 0);
>
> -       RPRINTF("client connected\n");
> -
> -       /* make socket nonblocking */
> -       if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
> -               flags = 0;
> -       if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
> -       {
> -               RPRINTF("error making socket nonblocking\n");
> -               close(fd);
> -               return -1;
> +               td_forward_request(*treq);
> +               queued_io->cons = ring_next(cons);
>         }
>
> -       if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
fd, 0, remus_client_event, state)) < 0) {
> -               RPRINTF("error registering client event handler: %s\n",
strerror(id));
> -               close(fd);
> -               return -1;
> -       }
> -
> -       state->stream_fd.fd = fd;
> -       state->stream_fd.id = id;
>         return 0;
>  }
>
> -/* on read, just pass request through */
> +static void remus_client_established(td_replication_connect_t *t, int rc)
> +{
> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
> +       event_id_t id;
> +
> +       if (rc) {
> +               primary_failed(s, rc);
> +               return;
> +       }
> +
> +       /* the connect succeeded */
> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
> +                                          0, remus_client_event, s);
> +       if(id < 0) {
> +               RPRINTF("error registering client event handler: %s\n",
> +                       strerror(id));
> +               primary_failed(s, ERROR_INTERNAL);
> +               return;
> +       }
> +
> +       s->stream_fd.fd = t->fd;
> +       s->stream_fd.id = id;
> +
> +       /* handle the queued requests */
> +       rc = remus_handle_queued_io(s);
> +       if (rc)
> +               primary_failed(s, rc);
> +}
> +
>  static void primary_queue_read(td_driver_t *driver, td_request_t treq)
>  {
> -       /* just pass read through */
> -       td_forward_request(treq);
> +       struct tdremus_state *s = (struct tdremus_state *)driver->data;
> +       struct req_ring *ring = &s->queued_io;
> +
> +       if (ring_isempty(ring)) {
> +               /* just pass read through */
> +               td_forward_request(treq);
> +               return;
> +       }
> +
> +       ring_add_request(ring, &treq);
> +}
> +
> +static int primary_forward_request(struct tdremus_state *s,
> +                                  const td_request_t *treq)
> +{
> +       char header[sizeof(uint32_t) + sizeof(uint64_t)];
> +       uint32_t *sectors = (uint32_t *)header;
> +       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
> +       td_driver_t *driver = s->tdremus_driver;
> +
> +       *sectors = treq->secs;
> +       *sector = treq->sec;
> +
> +       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE))
< 0)
> +               return ERROR_IO;
> +
> +       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
> +               return ERROR_IO;
> +
> +       if (mwrite(s->stream_fd.fd, treq->buf,
> +           treq->secs * driver->info.sector_size) < 0)
> +               return ERROR_IO;
> +
> +       return 0;
>  }
>
>  /* TODO:
> @@ -819,28 +857,28 @@ static void primary_queue_read(td_driver_t *driver,
td_request_t treq)
>  static void primary_queue_write(td_driver_t *driver, td_request_t treq)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> -
> -       char header[sizeof(uint32_t) + sizeof(uint64_t)];
> -       uint32_t *sectors = (uint32_t *)header;
> -       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
> +       int rc, ret;
>
>         // RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
>
> -       /* -1 means we haven't connected yet, -2 means the connection was
lost */
> -       if(s->stream_fd.fd == -1) {
> +       ret = td_replication_connect_status(&s->t);
> +       if(ret == -1) {
>                 RPRINTF("connecting to backup...\n");
> -               primary_blocking_connect(s);
> +               s->t.callback = remus_client_established;
> +               rc = td_replication_client_start(&s->t);
> +               if (rc)
> +                       goto fail;
>         }
>
> -       *sectors = treq.secs;
> -       *sector = treq.sec;
> +       /* The connection is not established, just queue the request */
> +       if (ret != 1) {
> +               ring_add_request(&s->queued_io, &treq);
> +               return;
> +       }
>
> -       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE))
< 0)
> -               goto fail;
> -       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
> -               goto fail;
> -
> -       if (mwrite(s->stream_fd.fd, treq.buf, treq.secs *
driver->info.sector_size) < 0)
> +       /* The connection is established */
> +       rc = primary_forward_request(s, &treq);
> +       if (rc)
>                 goto fail;
>
>         td_forward_request(treq);
> @@ -850,7 +888,7 @@ static void primary_queue_write(td_driver_t *driver,
td_request_t treq)
>   fail:
>         /* switch to unprotected mode and tell tapdisk to retry */
>         RPRINTF("write request replication failed, switching to
unprotected mode");
> -       switch_mode(s->tdremus_driver, mode_unprotected);
> +       primary_failed(s, rc);
>         td_complete_request(treq, -EBUSY);
>  }
>
> @@ -867,7 +905,7 @@ static int client_flush(td_driver_t *driver)
>
>         if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
strlen(TDREMUS_COMMIT)) < 0) {
>                 RPRINTF("error flushing output");
> -               close_stream_fd(s);
> +               primary_failed(s, ERROR_IO);
>                 return -1;
>         }
>
> @@ -886,6 +924,26 @@ static int server_flush(td_driver_t *driver)
>         return ramdisk_flush(driver, s);
>  }
>
> +/* It is called when switching the mode from primary to unprotected */
> +static int primary_flush(td_driver_t *driver)
> +{
> +       struct tdremus_state *s = driver->data;
> +       struct req_ring *ring = &s->queued_io;
> +       unsigned int cons;
> +
> +       if (ring_isempty(ring))
> +               return 0;
> +
> +       while (!ring_isempty(ring)) {
> +               cons = ring->cons;
> +               ring->cons = ring_next(cons);
> +
> +               td_forward_request(ring->pending_requests[cons].treq);
> +       }
> +
> +       return client_flush(driver);
> +}
> +
>  static int primary_start(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> @@ -894,7 +952,7 @@ static int primary_start(td_driver_t *driver)
>
>         tapdisk_remus.td_queue_read = primary_queue_read;
>         tapdisk_remus.td_queue_write = primary_queue_write;
> -       s->queue_flush = client_flush;
> +       s->queue_flush = primary_flush;
>
>         s->stream_fd.fd = -1;
>         s->stream_fd.id = -1;
> @@ -913,7 +971,7 @@ static void remus_client_event(event_id_t id, char
mode, void *private)
>         if (mread(s->stream_fd.fd, req, sizeof(req) - 1) < 0) {
>                 /* replication stream closed or otherwise broken
(timeout, reset, &c) */
>                 RPRINTF("error reading from backup\n");
> -               close_stream_fd(s);
> +               primary_failed(s, ERROR_IO);
>                 return;
>         }
>
> @@ -924,7 +982,7 @@ static void remus_client_event(event_id_t id, char
mode, void *private)
>                 ctl_respond(s, TDREMUS_DONE);
>         else {
>                 RPRINTF("received unknown message: %s\n", req);
> -               close_stream_fd(s);
> +               primary_failed(s, ERROR_IO);
>         }
>
>         return;
> @@ -933,84 +991,36 @@ static void remus_client_event(event_id_t id, char
mode, void *private)
>  /* backup functions */
>  static void remus_server_event(event_id_t id, char mode, void *private);
>
> -/* returns the socket that receives write requests */
> -static void remus_server_accept(event_id_t id, char mode, void* private)
> +/* It is called when we find some I/O error */
> +static void backup_failed(struct tdremus_state *s, int rc)
>  {
> -       struct tdremus_state* s = (struct tdremus_state *) private;
> +       UNREGISTER_EVENT(s->stream_fd.id);
> +       td_replication_connect_kill(&s->t);
> +       /* We will switch to unprotected mode in backup_queue_write() */
> +}
>
> -       int stream_fd;
> -       event_id_t cid;
> +/* returns the socket that receives write requests */
> +static void remus_server_established(td_replication_connect_t *t, int rc)
> +{
> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
> +       event_id_t id;
>
> -       /* XXX: add address-based black/white list */
> -       if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
> -               RPRINTF("error accepting connection: %d\n", errno);
> -               return;
> -       }
> -
> -       /* TODO: check to see if we are already replicating. if so just
close the
> -        * connection (or do something smarter) */
> -       RPRINTF("server accepted connection\n");
> +       /* rc is always 0 */
>
>         /* add tapdisk event for replication stream */
> -       cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
stream_fd, 0,
> -                                           remus_server_event, s);
> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
0,
> +                                          remus_server_event, s);
>
> -       if(cid < 0) {
> -               RPRINTF("error registering connection event handler:
%s\n", strerror(errno));
> -               close(stream_fd);
> +       if (id < 0) {
> +               RPRINTF("error registering connection event handler:
%s\n",
> +                       strerror(errno));
> +               td_replication_server_restart(t);
>                 return;
>         }
>
>         /* store replication file descriptor */
> -       s->stream_fd.fd = stream_fd;
> -       s->stream_fd.id = cid;
> -}
> -
> -/* returns -2 if EADDRNOTAVAIL */
> -static int remus_bind(struct tdremus_state* s)
> -{
> -//  struct sockaddr_in sa;
> -       int opt;
> -       int rc = -1;
> -
> -       if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
> -               RPRINTF("could not create server socket: %d\n", errno);
> -               return rc;
> -       }
> -       opt = 1;
> -       if (setsockopt(s->server_fd.fd, SOL_SOCKET, SO_REUSEADDR, &opt,
sizeof(opt)) < 0)
> -               RPRINTF("Error setting REUSEADDR on %d: %d\n",
s->server_fd.fd, errno);
> -
> -       if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa,
sizeof(s->sa)) < 0) {
> -               RPRINTF("could not bind server socket %d to %s:%d: %d
%s\n", s->server_fd.fd,
> -                       inet_ntoa(s->sa.sin_addr), ntohs(s->sa.sin_port),
errno, strerror(errno));
> -               if (errno != EADDRINUSE)
> -                       rc = -2;
> -               goto err_sfd;
> -       }
> -       if (listen(s->server_fd.fd, 10)) {
> -               RPRINTF("could not listen on socket: %d\n", errno);
> -               goto err_sfd;
> -       }
> -
> -       /* The socket s now bound to the address and listening so we may
now register
> -   * the fd with tapdisk */
> -
> -       if((s->server_fd.id =
tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> -
 s->server_fd.fd, 0,
> -
 remus_server_accept, s)) < 0) {
> -               RPRINTF("error registering server connection event
handler: %s",
> -                       strerror(s->server_fd.id));
> -               goto err_sfd;
> -       }
> -
> -       return 0;
> -
> - err_sfd:
> -       close(s->server_fd.fd);
> -       s->server_fd.fd = -1;
> -
> -       return rc;
> +       s->stream_fd.fd = t->fd;
> +       s->stream_fd.id = id;
>  }
>
>  /* wait for latest checkpoint to be applied */
> @@ -1053,6 +1063,8 @@ void backup_queue_write(td_driver_t *driver,
td_request_t treq)
>          * handle the write
>          */
>
> +       /* If we have called backup_failed, calling it again is harmless
*/
> +       backup_failed(s, ERROR_INTERNAL);
>         switch_mode(driver, mode_unprotected);
>         /* TODO: call the appropriate write function rather than return
EBUSY */
>         td_complete_request(treq, -EBUSY);
> @@ -1061,7 +1073,6 @@ void backup_queue_write(td_driver_t *driver,
td_request_t treq)
>  static int backup_start(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> -       int fd;
>
>         if (ramdisk_start(driver) < 0)
>                 return -1;
> @@ -1073,12 +1084,12 @@ static int backup_start(td_driver_t *driver)
>         return 0;
>  }
>
> -static int server_do_wreq(td_driver_t *driver)
> +static void server_do_wreq(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>         static tdremus_wire_t twreq;
>         char buf[4096];
> -       int len, rc;
> +       int len, rc = ERROR_IO;
>
>         char header[sizeof(uint32_t) + sizeof(uint64_t)];
>         uint32_t *sectors = (uint32_t *) header;
> @@ -1097,28 +1108,28 @@ static int server_do_wreq(td_driver_t *driver)
>         if (len > sizeof(buf)) {
>                 /* freak out! */
>                 RPRINTF("write request too large: %d/%u\n", len,
(unsigned)sizeof(buf));
> -               return -1;
> +               goto err;
>         }
>
>         if (mread(s->stream_fd.fd, buf, len) < 0)
>                 goto err;
>
> -       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0)
> +       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
> +               rc = ERROR_INTERNAL;
>                 goto err;
> +       }
>
> -       return 0;
> +       return;
>
>   err:
>         /* should start failover */
>         RPRINTF("backup write request error\n");
> -       close_stream_fd(s);
> -
> -       return -1;
> +       backup_failed(s, rc);
>  }
>
>  /* at this point, the server can start applying the most recent
>   * ramdisk. */
> -static int server_do_creq(td_driver_t *driver)
> +static void server_do_creq(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>
> @@ -1128,9 +1139,7 @@ static int server_do_creq(td_driver_t *driver)
>
>         /* XXX this message should not be sent until flush completes! */
>         if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) !=
4)
> -               return -1;
> -
> -       return 0;
> +               backup_failed(s, ERROR_IO);
>  }
>
>
> @@ -1213,11 +1222,6 @@ static int unprotected_start(td_driver_t *driver)
>
>         RPRINTF("failure detected, activating passthrough\n");
>
> -       /* close the server socket */
> -       close_stream_fd(s);
> -
> -       close_server_fd(s);
> -
>         /* install the unprotected read/write handlers */
>         tapdisk_remus.td_queue_read = unprotected_queue_read;
>         tapdisk_remus.td_queue_write = unprotected_queue_write;
> @@ -1227,90 +1231,6 @@ static int unprotected_start(td_driver_t *driver)
>
>
>  /* control */
> -
> -static inline int resolve_address(const char* addr, struct in_addr* ia)
> -{
> -       struct hostent* he;
> -       uint32_t ip;
> -
> -       if (!(he = gethostbyname(addr))) {
> -               RPRINTF("error resolving %s: %d\n", addr, h_errno);
> -               return -1;
> -       }
> -
> -       if (!he->h_addr_list[0]) {
> -               RPRINTF("no address found for %s\n", addr);
> -               return -1;
> -       }
> -
> -       /* network byte order */
> -       ip = *((uint32_t**)he->h_addr_list)[0];
> -       ia->s_addr = ip;
> -
> -       return 0;
> -}
> -
> -static int get_args(td_driver_t *driver, const char* name)
> -{
> -       struct tdremus_state *state = (struct tdremus_state
*)driver->data;
> -       char* host;
> -       char* port;
> -//  char* driver_str;
> -//  char* parent;
> -//  int type;
> -//  char* path;
> -//  unsigned long ulport;
> -//  int i;
> -//  struct sockaddr_in server_addr_in;
> -
> -       int gai_status;
> -       int valid_addr;
> -       struct addrinfo gai_hints;
> -       struct addrinfo *servinfo, *servinfo_itr;
> -
> -       memset(&gai_hints, 0, sizeof gai_hints);
> -       gai_hints.ai_family = AF_UNSPEC;
> -       gai_hints.ai_socktype = SOCK_STREAM;
> -
> -       port = strchr(name, ':');
> -       if (!port) {
> -               RPRINTF("missing host in %s\n", name);
> -               return -ENOENT;
> -       }
> -       if (!(host = strndup(name, port - name))) {
> -               RPRINTF("unable to allocate host\n");
> -               return -ENOMEM;
> -       }
> -       port++;
> -
> -       if ((gai_status = getaddrinfo(host, port, &gai_hints, &servinfo))
!= 0) {
> -               RPRINTF("getaddrinfo error: %s\n",
gai_strerror(gai_status));
> -               return -ENOENT;
> -       }
> -
> -       /* TODO: do something smarter here */
> -       valid_addr = 0;
> -       for(servinfo_itr = servinfo; servinfo_itr != NULL; servinfo_itr =
servinfo_itr->ai_next) {
> -               void *addr;
> -               char *ipver;
> -
> -               if (servinfo_itr->ai_family == AF_INET) {
> -                       valid_addr = 1;
> -                       memset(&state->sa, 0, sizeof(state->sa));
> -                       state->sa = *(struct sockaddr_in
*)servinfo_itr->ai_addr;
> -                       break;
> -               }
> -       }
> -       freeaddrinfo(servinfo);
> -
> -       if (!valid_addr)
> -               return -ENOENT;
> -
> -       RPRINTF("host: %s, port: %d\n", inet_ntoa(state->sa.sin_addr),
ntohs(state->sa.sin_port));
> -
> -       return 0;
> -}
> -
>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> @@ -1343,6 +1263,20 @@ static int switch_mode(td_driver_t *driver, enum
tdremus_mode mode)
>         return rc;
>  }
>
> +static void ctl_reopen(struct tdremus_state *s)
> +{
> +       ctl_unregister(s);
> +       CLOSE_FD(s->ctl_fd.fd);
> +       RPRINTF("FIFO closed\n");
> +
> +       if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
> +               RPRINTF("error reopening FIFO: %d\n", errno);
> +               return;
> +       }
> +
> +       ctl_register(s);
> +}
> +
>  static void ctl_request(event_id_t id, char mode, void *private)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)private;
> @@ -1355,11 +1289,7 @@ static void ctl_request(event_id_t id, char mode,
void *private)
>         if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append nul
*/))) {
>                 RPRINTF("0-byte read received, reopening FIFO\n");
>                 /*TODO: we may have to unregister/re-register with
tapdisk_server */
> -               close(s->ctl_fd.fd);
> -               RPRINTF("FIFO closed\n");
> -               if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
> -                       RPRINTF("error reopening FIFO: %d\n", errno);
> -               }
> +               ctl_reopen(s);
>                 return;
>         }
>
> @@ -1372,7 +1302,7 @@ static void ctl_request(event_id_t id, char mode,
void *private)
>         msg[rc] = '\0';
>         if (!strncmp(msg, "flush", 5)) {
>                 if (s->mode == mode_primary) {
> -                       if ((rc = s->queue_flush(driver))) {
> +                       if ((rc = client_flush(driver))) {
>                                 RPRINTF("error passing flush request to
backup");
>                                 ctl_respond(s, TDREMUS_FAIL);
>                         }
> @@ -1521,6 +1451,7 @@ static void ctl_unregister(struct tdremus_state *s)
>  static int tdremus_open(td_driver_t *driver, td_image_t *image,
td_uuid_t uuid)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> +       td_replication_connect_t *t = &s->t;
>         int rc;
>         const char *name = image->name;
>         td_flag_t flags = image->flags;
> @@ -1531,7 +1462,6 @@ static int tdremus_open(td_driver_t *driver,
td_image_t *image, td_uuid_t uuid)
>         remus_image = image;
>
>         memset(s, 0, sizeof(*s));
> -       s->server_fd.fd = -1;
>         s->stream_fd.fd = -1;
>         s->ctl_fd.fd = -1;
>         s->msg_fd.fd = -1;
> @@ -1540,8 +1470,11 @@ static int tdremus_open(td_driver_t *driver,
td_image_t *image, td_uuid_t uuid)
>          * the driver stack from the stream_fd event handler */
>         s->tdremus_driver = driver;
>
> -       /* parse name to get info etc */
> -       if ((rc = get_args(driver, name)))
> +       t->log_prefix = "remus";
> +       t->retry_timeout_s = REMUS_CONNRETRY_TIMEOUT;
> +       t->max_connections = 10;
> +       t->callback = remus_server_established;
> +       if ((rc = td_replication_connect_init(t, name)))
>                 return rc;
>
>         if ((rc = ctl_open(driver, name))) {
> @@ -1555,7 +1488,7 @@ static int tdremus_open(td_driver_t *driver,
td_image_t *image, td_uuid_t uuid)
>                 return rc;
>         }
>
> -       if (!(rc = remus_bind(s)))
> +       if (!(rc = td_replication_server_start(t)))
>                 rc = switch_mode(driver, mode_backup);
>         else if (rc == -2)
>                 rc = switch_mode(driver, mode_primary);
> @@ -1575,8 +1508,7 @@ static int tdremus_close(td_driver_t *driver)
>         if (s->ramdisk.inprogress)
>                 hashtable_destroy(s->ramdisk.inprogress, 0);
>
> -       close_server_fd(s);
> -       close_stream_fd(s);
> +       td_replication_connect_kill(&s->t);
>         ctl_unregister(s);
>         ctl_close(s);
>
> diff --git a/tools/blktap2/drivers/block-replication.h
b/tools/blktap2/drivers/block-replication.h
> index 9e051cc..07fd630 100644
> --- a/tools/blktap2/drivers/block-replication.h
> +++ b/tools/blktap2/drivers/block-replication.h
> @@ -48,6 +48,7 @@
>  enum {
>         ERROR_INTERNAL = -1,
>         ERROR_CONNECTION = -2,
> +       ERROR_IO = -3,
>  };
>
>  typedef struct td_replication_connect td_replication_connect_t;
> --
> 1.9.3
>

The code looks ok. Have you tested this, with some read/write workload
inside the guest? Especially read after write style sanity checks to ensure
that there is no data corruption (caused by stale ramdisk data flushed to
disk or served to guest), before a connection to backup has been
established.
I am acking this piece under good faith that you have tested all these
cases.

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

[-- Attachment #1.2: Type: text/html, Size: 41445 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/17] block-remus: switch to unprotected mode before closing
  2014-10-14  2:14 ` [PATCH 14/17] block-remus: switch to unprotected mode before closing Wen Congyang
@ 2014-10-20  2:51   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:51 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 6988 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> If the user wants to stop tapdisk2, he will do
> the following thing:
> 1. close the image
> 2. detach from blktap device
>
> If there is some pending I/O request, close will
> fail. But the I/O request is pended in remus until
> the connection is established. Introduce a new
> callback td_pre_close() to flush these I/O requests.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c       | 14 ++++++++++++++
>  tools/blktap2/drivers/block-replication.h |  1 +
>  tools/blktap2/drivers/tapdisk-control.c   |  6 ++++++
>  tools/blktap2/drivers/tapdisk-interface.c | 18 ++++++++++++++++++
>  tools/blktap2/drivers/tapdisk-interface.h |  1 +
>  tools/blktap2/drivers/tapdisk-vbd.c       |  9 +++++++++
>  tools/blktap2/drivers/tapdisk-vbd.h       |  1 +
>  tools/blktap2/drivers/tapdisk.h           |  1 +
>  8 files changed, 51 insertions(+)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index a2b9f62..09dc46f 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -752,6 +752,8 @@ static void primary_failed(struct tdremus_state *s,
int rc)
>         td_replication_connect_kill(&s->t);
>         if (rc == ERROR_INTERNAL)
>                 RPRINTF("switch to unprotected mode due to internal
error");
> +       if (rc == ERROR_CLOSE)
> +               RPRINTF("switch to unprotected mode before closing");
>         UNREGISTER_EVENT(s->stream_fd.id);
>         switch_mode(s->tdremus_driver, mode_unprotected);
>  }
> @@ -1500,6 +1502,17 @@ static int tdremus_open(td_driver_t *driver,
td_image_t *image, td_uuid_t uuid)
>         return -EIO;
>  }
>
> +static int tdremus_pre_close(td_driver_t *driver)
> +{
> +       struct tdremus_state *s = (struct tdremus_state *)driver->data;
> +
> +       if (s->mode != mode_primary)
> +               return 0;
> +
> +       primary_failed(s, ERROR_CLOSE);
> +       return 0;
> +}
> +
>  static int tdremus_close(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> @@ -1533,6 +1546,7 @@ struct tap_disk tapdisk_remus = {
>         .td_open            = tdremus_open,
>         .td_queue_read      = unprotected_queue_read,
>         .td_queue_write     = unprotected_queue_write,
> +       .td_pre_close       = tdremus_pre_close,
>         .td_close           = tdremus_close,
>         .td_get_parent_id   = tdremus_get_parent_id,
>         .td_validate_parent = tdremus_validate_parent,
> diff --git a/tools/blktap2/drivers/block-replication.h
b/tools/blktap2/drivers/block-replication.h
> index 07fd630..358c08b 100644
> --- a/tools/blktap2/drivers/block-replication.h
> +++ b/tools/blktap2/drivers/block-replication.h
> @@ -49,6 +49,7 @@ enum {
>         ERROR_INTERNAL = -1,
>         ERROR_CONNECTION = -2,
>         ERROR_IO = -3,
> +       ERROR_CLOSE = -4,
>  };
>
>  typedef struct td_replication_connect td_replication_connect_t;
> diff --git a/tools/blktap2/drivers/tapdisk-control.c
b/tools/blktap2/drivers/tapdisk-control.c
> index 4e5f748..2fa4cbe 100644
> --- a/tools/blktap2/drivers/tapdisk-control.c
> +++ b/tools/blktap2/drivers/tapdisk-control.c
> @@ -508,6 +508,12 @@ tapdisk_control_close_image(struct
tapdisk_control_connection *connection,
>                 goto out;
>         }
>
> +       /*
> +        * Some I/O requests are pended in the driver, and
> +        * flush these requests first.
> +        */
> +       tapdisk_vbd_pre_close_vdi(vbd);
> +
>         if (!list_empty(&vbd->pending_requests)) {
>                 err = -EAGAIN;
>                 goto out;
> diff --git a/tools/blktap2/drivers/tapdisk-interface.c
b/tools/blktap2/drivers/tapdisk-interface.c
> index a29de64..ed92e12 100644
> --- a/tools/blktap2/drivers/tapdisk-interface.c
> +++ b/tools/blktap2/drivers/tapdisk-interface.c
> @@ -105,6 +105,24 @@ td_open(td_image_t *image)
>  }
>
>  int
> +td_pre_close(td_image_t *image)
> +{
> +       td_driver_t *driver;
> +
> +       driver = image->driver;
> +       if (!driver)
> +               return -ENODEV;
> +
> +       if (!driver->ops->td_pre_close)
> +               return 0;
> +
> +       if (driver->refcnt && td_flag_test(driver->state, TD_DRIVER_OPEN))
> +               driver->ops->td_pre_close(driver);
> +
> +       return 0;
> +}
> +
> +int
>  td_close(td_image_t *image)
>  {
>         td_driver_t *driver;
> diff --git a/tools/blktap2/drivers/tapdisk-interface.h
b/tools/blktap2/drivers/tapdisk-interface.h
> index adc4376..ba9b3ea 100644
> --- a/tools/blktap2/drivers/tapdisk-interface.h
> +++ b/tools/blktap2/drivers/tapdisk-interface.h
> @@ -34,6 +34,7 @@
>  int td_open(td_image_t *);
>  int __td_open(td_image_t *, td_disk_info_t *);
>  int td_load(td_image_t *);
> +int td_pre_close(td_image_t *);
>  int td_close(td_image_t *);
>  int td_get_parent_id(td_image_t *, td_disk_id_t *);
>  int td_validate_parent(td_image_t *, td_image_t *);
> diff --git a/tools/blktap2/drivers/tapdisk-vbd.c
b/tools/blktap2/drivers/tapdisk-vbd.c
> index c665f27..aba545b 100644
> --- a/tools/blktap2/drivers/tapdisk-vbd.c
> +++ b/tools/blktap2/drivers/tapdisk-vbd.c
> @@ -180,6 +180,15 @@ tapdisk_vbd_validate_chain(td_vbd_t *vbd)
>  }
>
>  void
> +tapdisk_vbd_pre_close_vdi(td_vbd_t *vbd)
> +{
> +       td_image_t *image, *tmp;
> +
> +       tapdisk_vbd_for_each_image(vbd, image, tmp)
> +               td_pre_close(image);
> +}
> +
> +void
>  tapdisk_vbd_close_vdi(td_vbd_t *vbd)
>  {
>         td_image_t *image, *tmp;
> diff --git a/tools/blktap2/drivers/tapdisk-vbd.h
b/tools/blktap2/drivers/tapdisk-vbd.h
> index be084b2..040f2b8 100644
> --- a/tools/blktap2/drivers/tapdisk-vbd.h
> +++ b/tools/blktap2/drivers/tapdisk-vbd.h
> @@ -181,6 +181,7 @@ void tapdisk_vbd_free_stack(td_vbd_t *);
>  int tapdisk_vbd_open_stack(td_vbd_t *, uint16_t, td_flag_t);
>  int tapdisk_vbd_open_vdi(td_vbd_t *, const char *,
>                          uint16_t, uint16_t, td_flag_t);
> +void tapdisk_vbd_pre_close_vdi(td_vbd_t *);
>  void tapdisk_vbd_close_vdi(td_vbd_t *);
>
>  int tapdisk_vbd_attach(td_vbd_t *, const char *, int);
> diff --git a/tools/blktap2/drivers/tapdisk.h
b/tools/blktap2/drivers/tapdisk.h
> index 3c3b51d..16efd07 100644
> --- a/tools/blktap2/drivers/tapdisk.h
> +++ b/tools/blktap2/drivers/tapdisk.h
> @@ -158,6 +158,7 @@ struct tap_disk {
>         td_flag_t                    flags;
>         int                          private_data_size;
>         int (*td_open)               (td_driver_t *, td_image_t *,
td_uuid_t);
> +       int (*td_pre_close)          (td_driver_t *);
>         int (*td_close)              (td_driver_t *);
>         int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
>         int (*td_validate_parent)    (td_driver_t *, td_driver_t *,
td_flag_t);
> --
> 1.9.3
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

[-- Attachment #1.2: Type: text/html, Size: 9181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/17] tools: blktap2: use correct way to define array.
  2014-10-20  2:37   ` Shriram Rajagopalan
@ 2014-10-20  2:52     ` Wen Congyang
  0 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-20  2:52 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 10:37 AM, Shriram Rajagopalan wrote:
> On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>>
>> Currently, we use the following way to define an array:
>> type array[] = {
>>     [index] = xxx,
>>     0,
>> };
>> So array[index+1] will be NULL. If index is not the last
>> index, it will override another index.
>>
>> tapdisk_vbd_index is not defined, but array[DISK_TYPE_VINDEX]
>> is overridden, so we don't find this problem when building
>> the source.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
>> ---
>>  tools/blktap2/drivers/tapdisk-disktype.c | 12 ++----------
>>  tools/blktap2/drivers/tapdisk-disktype.h |  2 +-
>>  2 files changed, 3 insertions(+), 11 deletions(-)
>>
>> diff --git a/tools/blktap2/drivers/tapdisk-disktype.c
> b/tools/blktap2/drivers/tapdisk-disktype.c
>> index e9a6890..8d1383b 100644
>> --- a/tools/blktap2/drivers/tapdisk-disktype.c
>> +++ b/tools/blktap2/drivers/tapdisk-disktype.c
>> @@ -82,12 +82,6 @@ static const disk_info_t block_cache_disk = {
>>         1,
>>  };
>>
>> -static const disk_info_t vhd_index_disk = {
>> -       "vhdi",
>> -       "vhd index image (vhdi)",
>> -       1,
>> -};
>> -
>>  static const disk_info_t log_disk = {
>>         "log",
>>         "write logger (log)",
>> @@ -110,9 +104,8 @@ const disk_info_t *tapdisk_disk_types[] = {
>>         [DISK_TYPE_QCOW]        = &qcow_disk,
>>         [DISK_TYPE_BLOCK_CACHE] = &block_cache_disk,
>>         [DISK_TYPE_LOG] = &log_disk,
>> -       [DISK_TYPE_VINDEX]      = &vhd_index_disk,
>>         [DISK_TYPE_REMUS]       = &remus_disk,
>> -       0,
>> +       [DISK_TYPE_MAX]         = NULL,
>>  };
>>
>>  extern struct tap_disk tapdisk_aio;
>> @@ -137,10 +130,9 @@ const struct tap_disk *tapdisk_disk_drivers[] = {
>>         [DISK_TYPE_RAM]         = &tapdisk_ram,
>>         [DISK_TYPE_QCOW]        = &tapdisk_qcow,
>>         [DISK_TYPE_BLOCK_CACHE] = &tapdisk_block_cache,
>> -       [DISK_TYPE_VINDEX]      = &tapdisk_vhd_index,
>>         [DISK_TYPE_LOG]         = &tapdisk_log,
>>         [DISK_TYPE_REMUS]       = &tapdisk_remus,
>> -       0,
>> +       [DISK_TYPE_MAX]         = NULL,
>>  };
>>
>>  int
>> diff --git a/tools/blktap2/drivers/tapdisk-disktype.h
> b/tools/blktap2/drivers/tapdisk-disktype.h
>> index b697eea..c574990 100644
>> --- a/tools/blktap2/drivers/tapdisk-disktype.h
>> +++ b/tools/blktap2/drivers/tapdisk-disktype.h
>> @@ -39,7 +39,7 @@
>>  #define DISK_TYPE_BLOCK_CACHE 7
>>  #define DISK_TYPE_LOG         8
>>  #define DISK_TYPE_REMUS       9
>> -#define DISK_TYPE_VINDEX      10
>> +#define DISK_TYPE_MAX         10
>>
>>  #define DISK_TYPE_NAME_MAX    32
>>
>> --
>> 1.9.3
>>
> 
> I can only ack changes to block-remus file. I cannot ack changes to other
> parts of the blktap2 subsystem, as I am not their maintainer nor do I know
> much about that code. So I leave it to IanJ or IanC's discretion.
> 

OK, still thanks for your review.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c
  2014-10-14  2:14 ` [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c Wen Congyang
@ 2014-10-20  2:52   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:52 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 40279 bytes --]

On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> COLO will reuse them
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c       | 480
+-----------------------------
>  tools/blktap2/drivers/block-replication.c | 460
++++++++++++++++++++++++++++
>  tools/blktap2/drivers/block-replication.h |  65 ++++
>  3 files changed, 539 insertions(+), 466 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index 09dc46f..c7b429c 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -37,9 +37,6 @@
>  #include "tapdisk-server.h"
>  #include "tapdisk-driver.h"
>  #include "tapdisk-interface.h"
> -#include "hashtable.h"
> -#include "hashtable_itr.h"
> -#include "hashtable_utility.h"
>  #include "block-replication.h"
>
>  #include <errno.h>
> @@ -58,7 +55,6 @@
>
>  /* timeout for reads and writes in ms */
>  #define HEARTBEAT_MS 1000
> -#define RAMDISK_HASHSIZE 128
>
>  /* connect retry timeout (seconds) */
>  #define REMUS_CONNRETRY_TIMEOUT 1
> @@ -97,51 +93,6 @@ td_vbd_t *device_vbd = NULL;
>  td_image_t *remus_image = NULL;
>  struct tap_disk tapdisk_remus;
>
> -struct ramdisk {
> -       size_t sector_size;
> -       struct hashtable* h;
> -       /* when a ramdisk is flushed, h is given a new empty hash for
writes
> -        * while the old ramdisk (prev) is drained asynchronously.
> -        */
> -       struct hashtable* prev;
> -       /* count of outstanding requests to the base driver */
> -       size_t inflight;
> -       /* prev holds the requests to be flushed, while inprogress holds
> -        * requests being flushed. When requests complete, they are
removed
> -        * from inprogress.
> -        * Whenever a new flush is merged with ongoing flush (i.e, prev),
> -        * we have to make sure that none of the new requests overlap with
> -        * ones in "inprogress". If it does, keep it back in prev and
dont issue
> -        * IO until the current one finishes. If we allow this IO to
proceed,
> -        * we might end up with two "overlapping" requests in the disk's
queue and
> -        * the disk may not offer any guarantee on which one is written
first.
> -        * IOW, make sure we dont create a write-after-write time
ordering constraint.
> -        *
> -        */
> -       struct hashtable* inprogress;
> -};
> -
> -/* the ramdisk intercepts the original callback for reads and writes.
> - * This holds the original data. */
> -/* Might be worth making this a static array in struct ramdisk to avoid
> - * a malloc per request */
> -
> -struct tdremus_state;
> -
> -struct ramdisk_cbdata {
> -       td_callback_t cb;
> -       void* private;
> -       char* buf;
> -       struct tdremus_state* state;
> -};
> -
> -struct ramdisk_write_cbdata {
> -       struct tdremus_state* state;
> -       char* buf;
> -};
> -
> -typedef void (*queue_rw_t) (td_driver_t *driver, td_request_t treq);
> -
>  /* poll_fd type for blktap2 fd system. taken from block_log.c */
>  typedef struct poll_fd {
>         int        fd;
> @@ -168,7 +119,7 @@ struct tdremus_state {
>          */
>         struct req_ring queued_io;
>
> -       /* ramdisk data*/
> +       /* ramdisk data */
>         struct ramdisk ramdisk;
>
>         /* mode methods */
> @@ -239,404 +190,14 @@ static void ring_add_request(struct req_ring
*ring, const td_request_t *treq)
>         ring->prod = ring_next(ring->prod);
>  }
>
> -/* Prototype declarations */
> -static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
> -
> -/* functions to create and sumbit treq's */
> -
> -static void
> -replicated_write_callback(td_request_t treq, int err)
> -{
> -       struct tdremus_state *s = (struct tdremus_state *) treq.cb_data;
> -       td_vbd_request_t *vreq;
> -       int i;
> -       uint64_t start;
> -       vreq = (td_vbd_request_t *) treq.private;
> -
> -       /* the write failed for now, lets panic. this is very bad */
> -       if (err) {
> -               RPRINTF("ramdisk write failed, disk image is not
consistent\n");
> -               exit(-1);
> -       }
> -
> -       /* The write succeeded. let's pull the vreq off whatever request
list
> -        * it is on and free() it */
> -       list_del(&vreq->next);
> -       free(vreq);
> -
> -       s->ramdisk.inflight--;
> -       start = treq.sec;
> -       for (i = 0; i < treq.secs; i++) {
> -               hashtable_remove(s->ramdisk.inprogress, &start);
> -               start++;
> -       }
> -       free(treq.buf);
> -
> -       if (!s->ramdisk.inflight && !s->ramdisk.prev) {
> -               /* TODO: the ramdisk has been flushed */
> -       }
> -}
> -
> -static inline int
> -create_write_request(struct tdremus_state *state, td_sector_t sec, int
secs, char *buf)
> -{
> -       td_request_t treq;
> -       td_vbd_request_t *vreq;
> -
> -       treq.op      = TD_OP_WRITE;
> -       treq.buf     = buf;
> -       treq.sec     = sec;
> -       treq.secs    = secs;
> -       treq.image   = remus_image;
> -       treq.cb      = replicated_write_callback;
> -       treq.cb_data = state;
> -       treq.id      = 0;
> -       treq.sidx    = 0;
> -
> -       vreq         = calloc(1, sizeof(td_vbd_request_t));
> -       treq.private = vreq;
> -
> -       if(!vreq)
> -               return -1;
> -
> -       vreq->submitting = 1;
> -       INIT_LIST_HEAD(&vreq->next);
> -       tapdisk_vbd_move_request(treq.private,
&device_vbd->pending_requests);
> -
> -       /* TODO:
> -        * we should probably leave it up to the caller to forward the
request */
> -       td_forward_request(treq);
> -
> -       vreq->submitting--;
> -
> -       return 0;
> -}
> -
> -
> -/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
> -static unsigned int uint64_hash(void* k)
> -{
> -       uint64_t key = *(uint64_t*)k;
> -
> -       key = (~key) + (key << 18);
> -       key = key ^ (key >> 31);
> -       key = key * 21;
> -       key = key ^ (key >> 11);
> -       key = key + (key << 6);
> -       key = key ^ (key >> 22);
> -
> -       return (unsigned int)key;
> -}
> -
> -static int rd_hash_equal(void* k1, void* k2)
> -{
> -       uint64_t key1, key2;
> -
> -       key1 = *(uint64_t*)k1;
> -       key2 = *(uint64_t*)k2;
> -
> -       return key1 == key2;
> -}
> -
> -static int ramdisk_read(struct ramdisk* ramdisk, uint64_t sector,
> -                       int nb_sectors, char* buf)
> -{
> -       int i;
> -       char* v;
> -       uint64_t key;
> -
> -       for (i = 0; i < nb_sectors; i++) {
> -               key = sector + i;
> -               /* check whether it is queued in a previous flush request
*/
> -               if (!(ramdisk->prev && (v =
hashtable_search(ramdisk->prev, &key)))) {
> -                       /* check whether it is an ongoing flush */
> -                       if (!(ramdisk->inprogress && (v =
hashtable_search(ramdisk->inprogress, &key))))
> -                               return -1;
> -               }
> -               memcpy(buf + i * ramdisk->sector_size, v,
ramdisk->sector_size);
> -       }
> -
> -       return 0;
> -}
> -
> -static int ramdisk_write_hash(struct hashtable* h, uint64_t sector,
char* buf,
> -                             size_t len)
> -{
> -       char* v;
> -       uint64_t* key;
> -
> -       if ((v = hashtable_search(h, &sector))) {
> -               memcpy(v, buf, len);
> -               return 0;
> -       }
> -
> -       if (!(v = malloc(len))) {
> -               DPRINTF("ramdisk_write_hash: malloc failed\n");
> -               return -1;
> -       }
> -       memcpy(v, buf, len);
> -       if (!(key = malloc(sizeof(*key)))) {
> -               DPRINTF("ramdisk_write_hash: error allocating key\n");
> -               free(v);
> -               return -1;
> -       }
> -       *key = sector;
> -       if (!hashtable_insert(h, key, v)) {
> -               DPRINTF("ramdisk_write_hash failed on sector %" PRIu64
"\n", sector);
> -               free(key);
> -               free(v);
> -               return -1;
> -       }
> -
> -       return 0;
> -}
> -
> -static inline int ramdisk_write(struct ramdisk* ramdisk, uint64_t sector,
> -                               int nb_sectors, char* buf)
> -{
> -       int i, rc;
> -
> -       for (i = 0; i < nb_sectors; i++) {
> -               rc = ramdisk_write_hash(ramdisk->h, sector + i,
> -                                       buf + i * ramdisk->sector_size,
> -                                       ramdisk->sector_size);
> -               if (rc)
> -                       return rc;
> -       }
> -
> -       return 0;
> -}
> -
> -static int uint64_compare(const void* k1, const void* k2)
> -{
> -       uint64_t u1 = *(uint64_t*)k1;
> -       uint64_t u2 = *(uint64_t*)k2;
> -
> -       /* u1 - u2 is unsigned */
> -       return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
> -}
> -
> -/* set psectors to an array of the sector numbers in the hash, returning
> - * the number of entries (or -1 on error) */
> -static int ramdisk_get_sectors(struct hashtable* h, uint64_t** psectors)
> -{
> -       struct hashtable_itr* itr;
> -       uint64_t* sectors;
> -       int count;
> -
> -       if (!(count = hashtable_count(h)))
> -               return 0;
> -
> -       if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
> -               DPRINTF("ramdisk_get_sectors: error allocating sector
map\n");
> -               return -1;
> -       }
> -       sectors = *psectors;
> -
> -       itr = hashtable_iterator(h);
> -       count = 0;
> -       do {
> -               sectors[count++] =
*(uint64_t*)hashtable_iterator_key(itr);
> -       } while (hashtable_iterator_advance(itr));
> -       free(itr);
> -
> -       return count;
> -}
> -
> -/*
> -  return -1 for OOM
> -  return -2 for merge lookup failure
> -  return -3 for WAW race
> -  return 0 on success.
> -*/
> -static int merge_requests(struct ramdisk* ramdisk, uint64_t start,
> -                       size_t count, char **mergedbuf)
> -{
> -       char* buf;
> -       char* sector;
> -       int i;
> -       uint64_t *key;
> -       int rc = 0;
> -
> -       if (!(buf = valloc(count * ramdisk->sector_size))) {
> -               DPRINTF("merge_request: allocation failed\n");
> -               return -1;
> -       }
> -
> -       for (i = 0; i < count; i++) {
> -               if (!(sector = hashtable_search(ramdisk->prev, &start))) {
> -                       DPRINTF("merge_request: lookup failed on
%"PRIu64"\n", start);
> -                       free(buf);
> -                       rc = -2;
> -                       goto fail;
> -               }
> -
> -               /* Check inprogress requests to avoid waw non-determinism
*/
> -               if (hashtable_search(ramdisk->inprogress, &start)) {
> -                       DPRINTF("merge_request: WAR RACE on %"PRIu64"\n",
start);
> -                       free(buf);
> -                       rc = -3;
> -                       goto fail;
> -               }
> -               /* Insert req into inprogress (brief period of
duplication of hash entries until
> -                * they are removed from prev. Read tracking would not be
reading wrong entries)
> -                */
> -               if (!(key = malloc(sizeof(*key)))) {
> -                       DPRINTF("%s: error allocating key\n",
__FUNCTION__);
> -                       free(buf);
> -                       rc = -1;
> -                       goto fail;
> -               }
> -               *key = start;
> -               if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
> -                       DPRINTF("%s failed to insert sector %" PRIu64 "
into inprogress hash\n",
> -                               __FUNCTION__, start);
> -                       free(key);
> -                       free(buf);
> -                       rc = -1;
> -                       goto fail;
> -               }
> -               memcpy(buf + i * ramdisk->sector_size, sector,
ramdisk->sector_size);
> -               start++;
> -       }
> -
> -       *mergedbuf = buf;
> -       return 0;
> -fail:
> -       for (start--; i >0; i--, start--)
> -               hashtable_remove(ramdisk->inprogress, &start);
> -       return rc;
> -}
> -
> -/* The underlying driver may not handle having the whole ramdisk queued
at
> - * once. We queue what we can and let the callbacks attempt to queue
more. */
> -/* NOTE: may be called from callback, while dd->private still belongs to
> - * the underlying driver */
> -static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s)
> -{
> -       uint64_t* sectors;
> -       char* buf = NULL;
> -       uint64_t base, batchlen;
> -       int i, j, count = 0;
> -
> -       // RPRINTF("ramdisk flush\n");
> -
> -       if ((count = ramdisk_get_sectors(s->ramdisk.prev, &sectors)) <= 0)
> -               return count;
> -
> -       /* Create the inprogress table if empty */
> -       if (!s->ramdisk.inprogress)
> -               s->ramdisk.inprogress = create_hashtable(RAMDISK_HASHSIZE,
> -                                                       uint64_hash,
> -                                                       rd_hash_equal);
> -
> -       /*
> -         RPRINTF("ramdisk: flushing %d sectors\n", count);
> -       */
> -
> -       /* sort and merge sectors to improve disk performance */
> -       qsort(sectors, count, sizeof(*sectors), uint64_compare);
> -
> -       for (i = 0; i < count;) {
> -               base = sectors[i++];
> -               while (i < count && sectors[i] == sectors[i-1] + 1)
> -                       i++;
> -               batchlen = sectors[i-1] - base + 1;
> -
> -               j = merge_requests(&s->ramdisk, base, batchlen, &buf);
> -
> -               if (j) {
> -                       RPRINTF("ramdisk_flush: merge_requests
failed:%s\n",
> -                               j == -1? "OOM": (j==-2? "missing sector"
: "WAW race"));
> -                       if (j == -3) continue;
> -                       free(sectors);
> -                       return -1;
> -               }
> -
> -               /* NOTE: create_write_request() creates a treq AND
forwards it down
> -                * the driver chain */
> -               // RPRINTF("forwarding write request at %" PRIu64 ",
length: %" PRIu64 "\n", base, batchlen);
> -               create_write_request(s, base, batchlen, buf);
> -               //RPRINTF("write request at %" PRIu64 ", length: %"
PRIu64 " forwarded\n", base, batchlen);
> -
> -               s->ramdisk.inflight++;
> -
> -               for (j = 0; j < batchlen; j++) {
> -                       buf = hashtable_search(s->ramdisk.prev, &base);
> -                       free(buf);
> -                       hashtable_remove(s->ramdisk.prev, &base);
> -                       base++;
> -               }
> -       }
> -
> -       if (!hashtable_count(s->ramdisk.prev)) {
> -               /* everything is in flight */
> -               hashtable_destroy(s->ramdisk.prev, 0);
> -               s->ramdisk.prev = NULL;
> -       }
> -
> -       free(sectors);
> -
> -       // RPRINTF("ramdisk flush done\n");
> -       return 0;
> -}
> -
> -/* flush ramdisk contents to disk */
> -static int ramdisk_start_flush(td_driver_t *driver)
> -{
> -       struct tdremus_state *s = (struct tdremus_state *)driver->data;
> -       uint64_t* key;
> -       char* buf;
> -       int rc = 0;
> -       int i, j, count, batchlen;
> -       uint64_t* sectors;
> -
> -       if (!hashtable_count(s->ramdisk.h)) {
> -               /*
> -                 RPRINTF("Nothing to flush\n");
> -               */
> -               return 0;
> -       }
> -
> -       if (s->ramdisk.prev) {
> -               /* a flush request issued while a previous flush is still
in progress
> -                * will merge with the previous request. If you want the
previous
> -                * request to be consistent, wait for it to complete. */
> -               if ((count = ramdisk_get_sectors(s->ramdisk.h, &sectors))
< 0)
> -                       return count;
> -
> -               for (i = 0; i < count; i++) {
> -                       buf = hashtable_search(s->ramdisk.h, sectors + i);
> -                       ramdisk_write_hash(s->ramdisk.prev, sectors[i],
buf,
> -                                          s->ramdisk.sector_size);
> -               }
> -               free(sectors);
> -
> -               hashtable_destroy (s->ramdisk.h, 1);
> -       } else
> -               s->ramdisk.prev = s->ramdisk.h;
> -
> -       /* We create a new hashtable so that new writes can be performed
before
> -        * the old hashtable is completely drained. */
> -       s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
> -                                       rd_hash_equal);
> -
> -       return ramdisk_flush(driver, s);
> -}
> -
> -
>  static int ramdisk_start(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>
> -       if (s->ramdisk.h) {
> -               RPRINTF("ramdisk already allocated\n");
> -               return 0;
> -       }
> -
>         s->ramdisk.sector_size = driver->info.sector_size;
> -       s->ramdisk.h = create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
> -                                       rd_hash_equal);
> +       s->ramdisk.log_prefix = "remus";
> +       s->ramdisk.image = remus_image;
> +       ramdisk_init(&s->ramdisk);
>
>         DPRINTF("Ramdisk started, %zu bytes/sector\n",
s->ramdisk.sector_size);
>
> @@ -917,13 +478,9 @@ static int client_flush(td_driver_t *driver)
>  static int server_flush(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> -       /*
> -        * Nothing to flush in beginning.
> -        */
> -       if (!s->ramdisk.prev)
> -               return 0;
> +
>         /* Try to flush any remaining requests */
> -       return ramdisk_flush(driver, s);
> +       return ramdisk_flush_pended_requests(&s->ramdisk);
>  }
>
>  /* It is called when switching the mode from primary to unprotected */
> @@ -1030,10 +587,7 @@ static inline int
server_writes_inflight(td_driver_t *driver)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>
> -       if (!s->ramdisk.inflight && !s->ramdisk.prev)
> -               return 0;
> -
> -       return 1;
> +       return ramdisk_writes_inflight(&s->ramdisk);
>  }
>
>  /* Due to block device prefetching this code may be called on the server
side
> @@ -1116,7 +670,9 @@ static void server_do_wreq(td_driver_t *driver)
>         if (mread(s->stream_fd.fd, buf, len) < 0)
>                 goto err;
>
> -       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
> +       if (ramdisk_cache_write_request(&s->ramdisk, *sector, *sectors,
> +                                       driver->info.sector_size, buf,
> +                                       "remus") < 0) {
>                 rc = ERROR_INTERNAL;
>                 goto err;
>         }
> @@ -1137,7 +693,7 @@ static void server_do_creq(td_driver_t *driver)
>
>         // RPRINTF("committing buffer\n");
>
> -       ramdisk_start_flush(driver);
> +       ramdisk_start_flush(&s->ramdisk);
>
>         /* XXX this message should not be sent until flush completes! */
>         if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) !=
4)
> @@ -1184,12 +740,7 @@ void unprotected_queue_read(td_driver_t *driver,
td_request_t treq)
>
>         /* wait for previous ramdisk to flush  before servicing reads */
>         if (server_writes_inflight(driver)) {
> -               /* for now lets just return EBUSY.
> -                * if there are any left-over requests in prev,
> -                * kick em again.
> -                */
> -               if(!s->ramdisk.inflight) /* nothing in inprogress */
> -                       ramdisk_flush(driver, s);
> +               ramdisk_flush_pended_requests(&s->ramdisk);
>
>                 td_complete_request(treq, -EBUSY);
>         }
> @@ -1207,8 +758,7 @@ void unprotected_queue_write(td_driver_t *driver,
td_request_t treq)
>         /* wait for previous ramdisk to flush */
>         if (server_writes_inflight(driver)) {
>                 RPRINTF("queue_write: waiting for queue to drain");
> -               if(!s->ramdisk.inflight) /* nothing in inprogress. Kick
prev */
> -                       ramdisk_flush(driver, s);
> +               ramdisk_flush_pended_requests(&s->ramdisk);
>                 td_complete_request(treq, -EBUSY);
>         }
>         else {
> @@ -1518,9 +1068,7 @@ static int tdremus_close(td_driver_t *driver)
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>
>         RPRINTF("closing\n");
> -       if (s->ramdisk.inprogress)
> -               hashtable_destroy(s->ramdisk.inprogress, 0);
> -
> +       ramdisk_destroy(&s->ramdisk);
>         td_replication_connect_kill(&s->t);
>         ctl_unregister(s);
>         ctl_close(s);
> diff --git a/tools/blktap2/drivers/block-replication.c
b/tools/blktap2/drivers/block-replication.c
> index e4b2679..82d7609 100644
> --- a/tools/blktap2/drivers/block-replication.c
> +++ b/tools/blktap2/drivers/block-replication.c
> @@ -15,6 +15,10 @@
>
>  #include "tapdisk-server.h"
>  #include "block-replication.h"
> +#include "tapdisk-interface.h"
> +#include "hashtable.h"
> +#include "hashtable_itr.h"
> +#include "hashtable_utility.h"
>
>  #include <string.h>
>  #include <errno.h>
> @@ -30,6 +34,8 @@
>  #define DPRINTF(_f, _a...) syslog (LOG_DEBUG, "%s: " _f, log_prefix, ##
_a)
>  #define EPRINTF(_f, _a...) syslog (LOG_ERR, "%s: " _f, log_prefix, ## _a)
>
> +#define RAMDISK_HASHSIZE 128
> +
>  /* connection status */
>  enum {
>         connection_none,
> @@ -466,3 +472,457 @@ static void td_replication_connect_event(event_id_t
id, char mode,
>  fail:
>         td_replication_client_failed(t, rc);
>  }
> +
> +
> +/* I/O replication */
> +static void replicated_write_callback(td_request_t treq, int err)
> +{
> +       ramdisk_t *ramdisk = treq.cb_data;
> +       td_vbd_request_t *vreq = treq.private;
> +       int i;
> +       uint64_t start;
> +       const char *log_prefix = ramdisk->log_prefix;
> +
> +       /* the write failed for now, lets panic. this is very bad */
> +       if (err) {
> +               EPRINTF("ramdisk write failed, disk image is not
consistent\n");
> +               exit(-1);
> +       }
> +
> +       /*
> +        * The write succeeded. let's pull the vreq off whatever request
list
> +        * it is on and free() it
> +        */
> +       list_del(&vreq->next);
> +       free(vreq);
> +
> +       ramdisk->inflight--;
> +       start = treq.sec;
> +       for (i = 0; i < treq.secs; i++) {
> +               hashtable_remove(ramdisk->inprogress, &start);
> +               start++;
> +       }
> +       free(treq.buf);
> +
> +       if (!ramdisk->inflight && ramdisk->prev)
> +               ramdisk_flush_pended_requests(ramdisk);
> +}
> +
> +static int
> +create_write_request(ramdisk_t *ramdisk, td_sector_t sec, int secs, char
*buf)
> +{
> +       td_request_t treq;
> +       td_vbd_request_t *vreq;
> +       td_vbd_t *vbd = ramdisk->image->private;
> +
> +       treq.op      = TD_OP_WRITE;
> +       treq.buf     = buf;
> +       treq.sec     = sec;
> +       treq.secs    = secs;
> +       treq.image   = ramdisk->image;
> +       treq.cb      = replicated_write_callback;
> +       treq.cb_data = ramdisk;
> +       treq.id      = 0;
> +       treq.sidx    = 0;
> +
> +       vreq         = calloc(1, sizeof(td_vbd_request_t));
> +       treq.private = vreq;
> +
> +       if(!vreq)
> +               return -1;
> +
> +       vreq->submitting = 1;
> +       INIT_LIST_HEAD(&vreq->next);
> +       tapdisk_vbd_move_request(treq.private, &vbd->pending_requests);
> +
> +       td_forward_request(treq);
> +
> +       vreq->submitting--;
> +
> +       return 0;
> +}
> +
> +/* http://www.concentric.net/~Ttwang/tech/inthash.htm */
> +static unsigned int uint64_hash(void *k)
> +{
> +       uint64_t key = *(uint64_t*)k;
> +
> +       key = (~key) + (key << 18);
> +       key = key ^ (key >> 31);
> +       key = key * 21;
> +       key = key ^ (key >> 11);
> +       key = key + (key << 6);
> +       key = key ^ (key >> 22);
> +
> +       return (unsigned int)key;
> +}
> +
> +static int rd_hash_equal(void *k1, void *k2)
> +{
> +       uint64_t key1, key2;
> +
> +       key1 = *(uint64_t*)k1;
> +       key2 = *(uint64_t*)k2;
> +
> +       return key1 == key2;
> +}
> +
> +static int uint64_compare(const void *k1, const void *k2)
> +{
> +       uint64_t u1 = *(uint64_t*)k1;
> +       uint64_t u2 = *(uint64_t*)k2;
> +
> +       /* u1 - u2 is unsigned */
> +       return u1 < u2 ? -1 : u1 > u2 ? 1 : 0;
> +}
> +
> +static struct hashtable *ramdisk_new_hashtable(void)
> +{
> +       return create_hashtable(RAMDISK_HASHSIZE, uint64_hash,
rd_hash_equal);
> +}
> +
> +/*
> + * set psectors to an array of the sector numbers in the hash, returning
> + * the number of entries (or -1 on error)
> + */
> +static int ramdisk_get_sectors(struct hashtable *h, uint64_t **psectors,
> +                              const char *log_prefix)
> +{
> +       struct hashtable_itr* itr;
> +       uint64_t* sectors;
> +       int count;
> +
> +       if (!(count = hashtable_count(h)))
> +               return 0;
> +
> +       if (!(*psectors = malloc(count * sizeof(uint64_t)))) {
> +               DPRINTF("ramdisk_get_sectors: error allocating sector
map\n");
> +               return -1;
> +       }
> +       sectors = *psectors;
> +
> +       itr = hashtable_iterator(h);
> +       count = 0;
> +       do {
> +               sectors[count++] =
*(uint64_t*)hashtable_iterator_key(itr);
> +       } while (hashtable_iterator_advance(itr));
> +       free(itr);
> +
> +       return count;
> +}
> +
> +static int ramdisk_write_hash(struct hashtable *h, uint64_t sector, char
*buf,
> +                             size_t len, const char *log_prefix)
> +{
> +       char *v;
> +       uint64_t *key;
> +
> +       if ((v = hashtable_search(h, &sector))) {
> +               memcpy(v, buf, len);
> +               return 0;
> +       }
> +
> +       if (!(v = malloc(len))) {
> +               DPRINTF("ramdisk_write_hash: malloc failed\n");
> +               return -1;
> +       }
> +       memcpy(v, buf, len);
> +       if (!(key = malloc(sizeof(*key)))) {
> +               DPRINTF("ramdisk_write_hash: error allocating key\n");
> +               free(v);
> +               return -1;
> +       }
> +       *key = sector;
> +       if (!hashtable_insert(h, key, v)) {
> +               DPRINTF("ramdisk_write_hash failed on sector %" PRIu64
"\n", sector);
> +               free(key);
> +               free(v);
> +               return -1;
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * return -1 for OOM
> + * return -2 for merge lookup failure(should not happen)
> + * return -3 for WAW race
> + * return 0 on success.
> + */
> +static int merge_requests(ramdisk_t *ramdisk, uint64_t start,
> +                         size_t count, char **mergedbuf)
> +{
> +       char* buf;
> +       char* sector;
> +       int i;
> +       uint64_t *key;
> +       int rc = 0;
> +       const char *log_prefix = ramdisk->log_prefix;
> +
> +       if (!(buf = valloc(count * ramdisk->sector_size))) {
> +               DPRINTF("merge_request: allocation failed\n");
> +               return -1;
> +       }
> +
> +       for (i = 0; i < count; i++) {
> +               if (!(sector = hashtable_search(ramdisk->prev, &start))) {
> +                       EPRINTF("merge_request: lookup failed on
%"PRIu64"\n",
> +                               start);
> +                       free(buf);
> +                       rc = -2;
> +                       goto fail;
> +               }
> +
> +               /* Check inprogress requests to avoid waw non-determinism
*/
> +               if (hashtable_search(ramdisk->inprogress, &start)) {
> +                       DPRINTF("merge_request: WAR RACE on %"PRIu64"\n",
> +                               start);
> +                       free(buf);
> +                       rc = -3;
> +                       goto fail;
> +               }
> +
> +               /*
> +                * Insert req into inprogress (brief period of
duplication of
> +                * hash entries until they are removed from prev. Read
tracking
> +                * would not be reading wrong entries)
> +                */
> +               if (!(key = malloc(sizeof(*key)))) {
> +                       EPRINTF("%s: error allocating key\n",
__FUNCTION__);
> +                       free(buf);
> +                       rc = -1;
> +                       goto fail;
> +               }
> +               *key = start;
> +               if (!hashtable_insert(ramdisk->inprogress, key, NULL)) {
> +                       EPRINTF("%s failed to insert sector %" PRIu64 "
into inprogress hash\n",
> +                               __FUNCTION__, start);
> +                       free(key);
> +                       free(buf);
> +                       rc = -1;
> +                       goto fail;
> +               }
> +
> +               memcpy(buf + i * ramdisk->sector_size, sector,
ramdisk->sector_size);
> +               start++;
> +       }
> +
> +       *mergedbuf = buf;
> +       return 0;
> +fail:
> +       for (start--; i > 0; i--, start--)
> +               hashtable_remove(ramdisk->inprogress, &start);
> +       return rc;
> +}
> +
> +#define HASHTABLE_DESTROY(hashtable, free)                     \
> +       do {                                                    \
> +               if (hashtable) {                                \
> +                       hashtable_destroy(hashtable, free);     \
> +                       hashtable = NULL;                       \
> +               }                                               \
> +       } while (0)
> +
> +int ramdisk_init(ramdisk_t *ramdisk)
> +{
> +       ramdisk->inflight = 0;
> +       ramdisk->prev = NULL;
> +       ramdisk->inprogress = NULL;
> +       ramdisk->primary_cache = ramdisk_new_hashtable();
> +       if (!ramdisk->primary_cache)
> +               return -1;
> +
> +       return 0;
> +}
> +
> +void ramdisk_destroy(ramdisk_t *ramdisk)
> +{
> +       const char *log_prefix = ramdisk->log_prefix;
> +
> +       /*
> +        * ramdisk_destroy() is called only when we will close the
tapdisk image.
> +        * In this case, there are no pending requests in vbd.
> +        *
> +        * If ramdisk->inflight is not 0, it means that the requests
created by
> +        * us are still in vbd->pending_requests.
> +        */
> +       if (ramdisk->inflight) {
> +               /* should not happen */
> +               EPRINTF("cannot destroy ramdisk\n");
> +               return;
> +       }
> +
> +       HASHTABLE_DESTROY(ramdisk->inprogress, 0);
> +       HASHTABLE_DESTROY(ramdisk->prev, 1);
> +       HASHTABLE_DESTROY(ramdisk->primary_cache, 1);
> +}
> +
> +int ramdisk_read(ramdisk_t *ramdisk, uint64_t sector,
> +                int nb_sectors, char *buf)
> +{
> +       int i;
> +       char *v;
> +       uint64_t key;
> +
> +       for (i = 0; i < nb_sectors; i++) {
> +               key = sector + i;
> +               /* check whether it is queued in a previous flush request
*/
> +               if (!(ramdisk->prev &&
> +                   (v = hashtable_search(ramdisk->prev, &key)))) {
> +                       /* check whether it is an ongoing flush */
> +                       if (!(ramdisk->inprogress &&
> +                           (v = hashtable_search(ramdisk->inprogress,
&key))))
> +                               return -1;
> +               }
> +               memcpy(buf + i * ramdisk->sector_size, v,
ramdisk->sector_size);
> +       }
> +
> +       return 0;
> +}
> +
> +int ramdisk_cache_write_request(ramdisk_t *ramdisk, uint64_t sector,
> +                               int nb_sectors, size_t sector_size,
> +                               char *buf, const char *log_prefix)
> +{
> +       int i, rc;
> +
> +       for (i = 0; i < nb_sectors; i++) {
> +               rc = ramdisk_write_hash(ramdisk->primary_cache, sector +
i,
> +                                       buf + i * sector_size,
> +                                       sector_size, log_prefix);
> +               if (rc)
> +                       return rc;
> +       }
> +
> +       return 0;
> +}
> +
> +int ramdisk_flush_pended_requests(ramdisk_t *ramdisk)
> +{
> +       uint64_t *sectors;
> +       char *buf = NULL;
> +       uint64_t base, batchlen;
> +       int i, j, count = 0;
> +       const char *log_prefix = ramdisk->log_prefix;
> +
> +       /* everything is in flight */
> +       if (!ramdisk->prev)
> +               return 0;
> +
> +       count = ramdisk_get_sectors(ramdisk->prev, &sectors, log_prefix);
> +       if (count <= 0)
> +               /* should not happen */
> +               return count;
> +
> +       /* Create the inprogress table if empty */
> +       if (!ramdisk->inprogress) {
> +               ramdisk->inprogress = ramdisk_new_hashtable();
> +               if (!ramdisk->inprogress) {
> +                       EPRINTF("ramdisk_flush: creating the inprogress
table failed:OOM\n");
> +                       return -1;
> +               }
> +       }
> +
> +       /* sort and merge sectors to improve disk performance */
> +       qsort(sectors, count, sizeof(*sectors), uint64_compare);
> +
> +       for (i = 0; i < count;) {
> +               base = sectors[i++];
> +               while (i < count && sectors[i] == sectors[i-1] + 1)
> +                       i++;
> +               batchlen = sectors[i-1] - base + 1;
> +
> +               j = merge_requests(ramdisk, base, batchlen, &buf);
> +               if (j) {
> +                       EPRINTF("ramdisk_flush: merge_requests
failed:%s\n",
> +                               j == -1 ? "OOM" :
> +                                       (j == -2 ? "missing sector" :
> +                                                "WAW race"));
> +                       if (j == -3)
> +                               continue;
> +                       free(sectors);
> +                       return -1;
> +               }
> +
> +               /*
> +                * NOTE: create_write_request() creates a treq AND
forwards
> +                * it down the driver chain
> +                *
> +                * TODO: handle create_write_request()'s error.
> +                */
> +               create_write_request(ramdisk, base, batchlen, buf);
> +
> +               ramdisk->inflight++;
> +
> +               for (j = 0; j < batchlen; j++) {
> +                       buf = hashtable_search(ramdisk->prev, &base);
> +                       free(buf);
> +                       hashtable_remove(ramdisk->prev, &base);
> +                       base++;
> +               }
> +       }
> +
> +       if (!hashtable_count(ramdisk->prev))
> +               /* everything is in flight */
> +               HASHTABLE_DESTROY(ramdisk->prev, 0);
> +
> +       free(sectors);
> +       return 0;
> +}
> +
> +int ramdisk_start_flush(ramdisk_t *ramdisk)
> +{
> +       uint64_t *key;
> +       char *buf;
> +       int rc = 0;
> +       int i, j, count, batchlen;
> +       uint64_t *sectors;
> +       const char *log_prefix = ramdisk->log_prefix;
> +       struct hashtable *cache;
> +
> +       cache = ramdisk->primary_cache;
> +       if (!hashtable_count(cache))
> +               return 0;
> +
> +       if (ramdisk->prev) {
> +               /*
> +                * a flush request issued while a previous flush is still
in
> +                * progress will merge with the previous request. If you
want
> +                * the previous request to be consistent, wait for it to
> +                * complete.
> +                */
> +               count = ramdisk_get_sectors(cache, &sectors, log_prefix);
> +               if (count < 0 )
> +                       return count;
> +
> +               for (i = 0; i < count; i++) {
> +                       buf = hashtable_search(cache, sectors + i);
> +                       ramdisk_write_hash(ramdisk->prev, sectors[i], buf,
> +                                          ramdisk->sector_size,
log_prefix);
> +               }
> +               free(sectors);
> +
> +               hashtable_destroy(cache, 1);
> +       } else
> +               ramdisk->prev = cache;
> +
> +       /*
> +        * We create a new hashtable so that new writes can be performed
before
> +        * the old hashtable is completely drained.
> +        */
> +       ramdisk->primary_cache = ramdisk_new_hashtable();
> +       if (!ramdisk->primary_cache) {
> +               EPRINTF("ramdisk_start_flush: creating cache table
failed: OOM\n");
> +               return -1;
> +       }
> +
> +       return ramdisk_flush_pended_requests(ramdisk);
> +}
> +
> +int ramdisk_writes_inflight(ramdisk_t *ramdisk)
> +{
> +       if (!ramdisk->inflight && !ramdisk->prev)
> +               return 0;
> +
> +       return 1;
> +}
> diff --git a/tools/blktap2/drivers/block-replication.h
b/tools/blktap2/drivers/block-replication.h
> index 358c08b..cbdac3c 100644
> --- a/tools/blktap2/drivers/block-replication.h
> +++ b/tools/blktap2/drivers/block-replication.h
> @@ -110,4 +110,69 @@ int
td_replication_server_restart(td_replication_connect_t *t);
>   */
>  int td_replication_client_start(td_replication_connect_t *t);
>
> +/* I/O replication */
> +typedef struct ramdisk ramdisk_t;
> +struct ramdisk {
> +       size_t sector_size;
> +       const char *log_prefix;
> +       td_image_t *image;
> +
> +       /* private */
> +       /* count of outstanding requests to the base driver */
> +       size_t inflight;
> +       /* prev holds the requests to be flushed, while inprogress holds
> +        * requests being flushed. When requests complete, they are
removed
> +        * from inprogress.
> +        * Whenever a new flush is merged with ongoing flush (i.e, prev),
> +        * we have to make sure that none of the new requests overlap with
> +        * ones in "inprogress". If it does, keep it back in prev and
dont issue
> +        * IO until the current one finishes. If we allow this IO to
proceed,
> +        * we might end up with two "overlapping" requests in the disk's
queue and
> +        * the disk may not offer any guarantee on which one is written
first.
> +        * IOW, make sure we dont create a write-after-write time
ordering constraint.
> +        */
> +       struct hashtable *prev;
> +       struct hashtable *inprogress;
> +       /*
> +        * The primary write request is queued in this
> +        * hashtable, and will be flushed to ramdisk when
> +        * the checkpoint finishes.
> +        */
> +       struct hashtable *primary_cache;
> +};
> +
> +int ramdisk_init(ramdisk_t *ramdisk);
> +void ramdisk_destroy(ramdisk_t *ramdisk);
> +
> +/*
> + * try to read from ramdisk. Return -1 if some sectors are not in
> + * ramdisk. Otherwise, return 0.
> + */
> +int ramdisk_read(ramdisk_t *ramdisk, uint64_t sector,
> +                int nb_sectors, char *buf);
> +
> +/*
> + * cache the write requests, and it will be flushed after a
> + * new checkpoint finishes
> + */
> +int ramdisk_cache_write_request(ramdisk_t *ramdisk, uint64_t sector,
> +                               int nb_sectors, size_t sector_size,
> +                               char* buf, const char *log_prefix);
> +
> +/* flush pended write requests to disk */
> +int ramdisk_flush_pended_requests(ramdisk_t *ramdisk);
> +/*
> + * flush cached write requests to disk. If WAW is detected, the cached
> + * write requests will be moved to pended queue. The pended write
> + * requests will be auto flushed after all inprogress write requests
> + * are flushed to disk. This function don't wait all write requests
> + * are flushed to disk.
> + */
> +int ramdisk_start_flush(ramdisk_t *ramdisk);
> +/*
> + * Return true if some write reqeusts are inprogress or pended,
> + * otherwise return false
> + */
> +int ramdisk_writes_inflight(ramdisk_t *ramdisk);
> +
>  #endif
> --
> 1.9.3
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

[-- Attachment #1.2: Type: text/html, Size: 56347 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open
  2014-10-14  2:13 ` [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open Wen Congyang
@ 2014-10-20  2:58   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  2:58 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 7285 bytes --]

On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> remus's callback td_open needs uuid, but it is hard coded as 0.
> After commit 4b1af8, the vbd's uuid is the minor of the blktap
> device, not 0.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-aio.c         | 3 ++-
>  tools/blktap2/drivers/block-cache.c       | 3 ++-
>  tools/blktap2/drivers/block-log.c         | 3 ++-
>  tools/blktap2/drivers/block-qcow.c        | 3 ++-
>  tools/blktap2/drivers/block-ram.c         | 3 ++-
>  tools/blktap2/drivers/block-remus.c       | 8 ++------
>  tools/blktap2/drivers/block-vhd.c         | 3 ++-
>  tools/blktap2/drivers/tapdisk-interface.c | 4 +++-
>  tools/blktap2/drivers/tapdisk.h           | 2 +-
>  9 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-aio.c
b/tools/blktap2/drivers/block-aio.c
> index 10ab20b..1b560e5 100644
> --- a/tools/blktap2/drivers/block-aio.c
> +++ b/tools/blktap2/drivers/block-aio.c
> @@ -111,7 +111,8 @@ static int tdaio_get_image_info(int fd,
td_disk_info_t *info)
>  }
>
>  /* Open the disk file and initialize aio state. */
> -int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags)
> +int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
> +              td_uuid_t uuid)
>  {
>         int i, fd, ret, o_flags;
>         struct tdaio_state *prv;
> diff --git a/tools/blktap2/drivers/block-cache.c
b/tools/blktap2/drivers/block-cache.c
> index 1d2f4eb..cd6ea6a 100644
> --- a/tools/blktap2/drivers/block-cache.c
> +++ b/tools/blktap2/drivers/block-cache.c
> @@ -517,7 +517,8 @@ block_cache_put_request(block_cache_t *cache,
block_cache_request_t *breq)
>  }
>
>  static int
> -block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags)
> +block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
> +                td_uuid_t uuid)
>  {
>         int i, err;
>         radix_tree_t *tree;
> diff --git a/tools/blktap2/drivers/block-log.c
b/tools/blktap2/drivers/block-log.c
> index 5330cdc..7b33b63 100644
> --- a/tools/blktap2/drivers/block-log.c
> +++ b/tools/blktap2/drivers/block-log.c
> @@ -585,7 +585,8 @@ static void ctl_request(event_id_t id, char mode,
void *private)
>
>  static int tdlog_close(td_driver_t*);
>
> -static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t
flags)
> +static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t
flags,
> +                     td_uuid_t uuid)
>  {
>    struct tdlog_state* s = (struct tdlog_state*)driver->data;
>    int rc;
> diff --git a/tools/blktap2/drivers/block-qcow.c
b/tools/blktap2/drivers/block-qcow.c
> index b45bcaa..64dfafc 100644
> --- a/tools/blktap2/drivers/block-qcow.c
> +++ b/tools/blktap2/drivers/block-qcow.c
> @@ -865,7 +865,8 @@ out:
>  }
>
>  /* Open the disk file and initialize qcow state. */
> -int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags)
> +int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
> +                td_uuid_t uuid)
>  {
>         int fd, len, i, ret, size, o_flags;
>         td_disk_info_t *bs = &(driver->info);
> diff --git a/tools/blktap2/drivers/block-ram.c
b/tools/blktap2/drivers/block-ram.c
> index a859481..b64a194 100644
> --- a/tools/blktap2/drivers/block-ram.c
> +++ b/tools/blktap2/drivers/block-ram.c
> @@ -108,7 +108,8 @@ static int get_image_info(int fd, td_disk_info_t
*info)
>  }
>
>  /* Open the disk file and initialize ram state. */
> -int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags)
> +int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
> +               td_uuid_t uuid)
>  {
>         char *p;
>         uint64_t size;
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index 079588d..eb8c0ed 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -1633,18 +1633,14 @@ static int ctl_register(struct tdremus_state *s)
>  /* interface */
>
>  static int tdremus_open(td_driver_t *driver, const char *name,
> -                       td_flag_t flags)
> +                       td_flag_t flags, td_uuid_t uuid)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>         int rc;
>
>         RPRINTF("opening %s\n", name);
>
> -       /* first we need to get the underlying vbd for this driver stack.
To do so we
> -        * need to know the vbd's id. Fortunately, for tapdisk2 this is
hard-coded as
> -        * 0 (see tapdisk2.c)
> -        */
> -       device_vbd = tapdisk_server_get_vbd(0);
> +       device_vbd = tapdisk_server_get_vbd(uuid);
>
>         memset(s, 0, sizeof(*s));
>         s->server_fd.fd = -1;
> diff --git a/tools/blktap2/drivers/block-vhd.c
b/tools/blktap2/drivers/block-vhd.c
> index 76ea5bd..06e9c89 100644
> --- a/tools/blktap2/drivers/block-vhd.c
> +++ b/tools/blktap2/drivers/block-vhd.c
> @@ -675,7 +675,8 @@ __vhd_open(td_driver_t *driver, const char *name,
vhd_flag_t flags)
>  }
>
>  static int
> -_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags)
> +_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
> +         td_uuid_t uuid)
>  {
>         vhd_flag_t vhd_flags = 0;
>
> diff --git a/tools/blktap2/drivers/tapdisk-interface.c
b/tools/blktap2/drivers/tapdisk-interface.c
> index 2e51883..36b5393 100644
> --- a/tools/blktap2/drivers/tapdisk-interface.c
> +++ b/tools/blktap2/drivers/tapdisk-interface.c
> @@ -63,6 +63,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
>  {
>         int err;
>         td_driver_t *driver;
> +       td_vbd_t *vbd = image->private;
>
>         driver = image->driver;
>         if (!driver) {
> @@ -78,7 +79,8 @@ __td_open(td_image_t *image, td_disk_info_t *info)
>         }
>
>         if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
> -               err = driver->ops->td_open(driver, image->name,
image->flags);
> +               err = driver->ops->td_open(driver, image->name,
image->flags,
> +                                          vbd->uuid);
>                 if (err) {
>                         if (!image->driver)
>                                 tapdisk_driver_free(driver);
> diff --git a/tools/blktap2/drivers/tapdisk.h
b/tools/blktap2/drivers/tapdisk.h
> index 66d508e..459eaec 100644
> --- a/tools/blktap2/drivers/tapdisk.h
> +++ b/tools/blktap2/drivers/tapdisk.h
> @@ -157,7 +157,7 @@ struct tap_disk {
>         const char                  *disk_type;
>         td_flag_t                    flags;
>         int                          private_data_size;
> -       int (*td_open)               (td_driver_t *, const char *,
td_flag_t);
> +       int (*td_open)               (td_driver_t *, const char *,
td_flag_t, td_uuid_t);
>         int (*td_close)              (td_driver_t *);
>         int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
>         int (*td_validate_parent)    (td_driver_t *, td_driver_t *,
td_flag_t);
> --
> 1.9.3
>
>
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
_______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 9476 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/17] tools: block-remus: connect to backup asynchronously
  2014-10-20  2:50   ` Shriram Rajagopalan
@ 2014-10-20  3:00     ` Wen Congyang
  2014-10-20  3:11       ` Shriram Rajagopalan
  0 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-20  3:00 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 10:50 AM, Shriram Rajagopalan wrote:
> On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>>
>> Use the API to connect to backup asynchronously.
>> Before the connection is established, we queue
>> all I/O requests, and handle them when the connection
>> is established.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
>> ---
>>  tools/blktap2/drivers/block-remus.c       | 508
> +++++++++++++-----------------
>>  tools/blktap2/drivers/block-replication.h |   1 +
>>  2 files changed, 221 insertions(+), 288 deletions(-)
>>
>> diff --git a/tools/blktap2/drivers/block-remus.c
> b/tools/blktap2/drivers/block-remus.c
>> index e5ad782..a2b9f62 100644
>> --- a/tools/blktap2/drivers/block-remus.c
>> +++ b/tools/blktap2/drivers/block-remus.c
>> @@ -40,6 +40,7 @@
>>  #include "hashtable.h"
>>  #include "hashtable_itr.h"
>>  #include "hashtable_utility.h"
>> +#include "block-replication.h"
>>
>>  #include <errno.h>
>>  #include <inttypes.h>
>> @@ -49,10 +50,7 @@
>>  #include <string.h>
>>  #include <sys/time.h>
>>  #include <sys/types.h>
>> -#include <sys/socket.h>
>> -#include <netdb.h>
>>  #include <netinet/in.h>
>> -#include <arpa/inet.h>
>>  #include <sys/param.h>
>>  #include <sys/sysctl.h>
>>  #include <unistd.h>
>> @@ -63,10 +61,12 @@
>>  #define RAMDISK_HASHSIZE 128
>>
>>  /* connect retry timeout (seconds) */
>> -#define REMUS_CONNRETRY_TIMEOUT 10
>> +#define REMUS_CONNRETRY_TIMEOUT 1
>>
>>  #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
>>
>> +#define MAX_REMUS_REQUESTS      TAPDISK_DATA_REQUESTS
>> +
>>  enum tdremus_mode {
>>         mode_invalid = 0,
>>         mode_unprotected,
>> @@ -75,16 +75,14 @@ enum tdremus_mode {
>>  };
>>
>>  struct tdremus_req {
>> -       uint64_t sector;
>> -       int nb_sectors;
>> -       char buf[4096];
>> +       td_request_t treq;
>>  };
>>
>>  struct req_ring {
>>         /* waste one slot to distinguish between empty and full */
>> -       struct tdremus_req requests[MAX_REQUESTS * 2 + 1];
>> -       unsigned int head;
>> -       unsigned int tail;
>> +       struct tdremus_req pending_requests[MAX_REMUS_REQUESTS + 1];
>> +       unsigned int prod;
>> +       unsigned int cons;
>>  };
>>
>>  /* TODO: This isn't very pretty, but to properly generate our own treqs
> (needed
>> @@ -161,13 +159,14 @@ struct tdremus_state {
>>         char*     msg_path; /* output completion message here */
>>         poll_fd_t msg_fd;
>>
>> -  /* replication host */
>> -       struct sockaddr_in sa;
>> -       poll_fd_t server_fd;    /* server listen port */
>> +       td_replication_connect_t t;
>>         poll_fd_t stream_fd;     /* replication channel */
>>
>> -       /* queue write requests, batch-replicate at submit */
>> -       struct req_ring write_ring;
>> +       /*
>> +        * queue I/O requests, batch-replicate when
>> +        * the connection is established.
>> +        */
>> +       struct req_ring queued_io;
>>
>>         /* ramdisk data*/
>>         struct ramdisk ramdisk;
>> @@ -206,11 +205,13 @@ static int tdremus_close(td_driver_t *driver);
>>
>>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode);
>>  static int ctl_respond(struct tdremus_state *s, const char *response);
>> +static int ctl_register(struct tdremus_state *s);
>> +static void ctl_unregister(struct tdremus_state *s);
>>
>>  /* ring functions */
>> -static inline unsigned int ring_next(struct req_ring* ring, unsigned int
> pos)
>> +static inline unsigned int ring_next(unsigned int pos)
>>  {
>> -       if (++pos >= MAX_REQUESTS * 2 + 1)
>> +       if (++pos >= MAX_REMUS_REQUESTS + 1)
>>                 return 0;
>>
>>         return pos;
>> @@ -218,13 +219,26 @@ static inline unsigned int ring_next(struct
> req_ring* ring, unsigned int pos)
>>
>>  static inline int ring_isempty(struct req_ring* ring)
>>  {
>> -       return ring->head == ring->tail;
>> +       return ring->cons == ring->prod;
>>  }
>>
>>  static inline int ring_isfull(struct req_ring* ring)
>>  {
>> -       return ring_next(ring, ring->tail) == ring->head;
>> +       return ring_next(ring->prod) == ring->cons;
>>  }
>> +
>> +static void ring_add_request(struct req_ring *ring, const td_request_t
> *treq)
>> +{
>> +       /* If ring is full, it means that tapdisk2 has some bug */
>> +       if (ring_isfull(ring)) {
>> +               RPRINTF("OOPS, ring is full\n");
>> +               exit(1);
>> +       }
>> +
>> +       ring->pending_requests[ring->prod].treq = *treq;
>> +       ring->prod = ring_next(ring->prod);
>> +}
>> +
>>  /* Prototype declarations */
>>  static int ramdisk_flush(td_driver_t *driver, struct tdremus_state* s);
>>
>> @@ -724,89 +738,113 @@ static int mwrite(int fd, void* buf, size_t len)
>>         select(fd + 1, NULL, &wfds, NULL, &tv);
>>  }
>>
>> -
>> -static void inline close_stream_fd(struct tdremus_state *s)
>> -{
>> -       if (s->stream_fd.fd < 0)
>> -               return;
>> -
>> -       /* XXX: -2 is magic. replace with macro perhaps? */
>> -       tapdisk_server_unregister_event(s->stream_fd.id);
>> -       close(s->stream_fd.fd);
>> -       s->stream_fd.fd = -2;
>> -}
>> -
>> -static void close_server_fd(struct tdremus_state *s)
>> -{
>> -       if (s->server_fd.fd < 0)
>> -               return;
>> -
>> -       tapdisk_server_unregister_event(s->server_fd.id);
>> -       s->server_fd.id = -1;
>> -       close(s->stream_fd.fd);
>> -       s->stream_fd.fd = -1;
>> -}
>> -
>>  /* primary functions */
>>  static void remus_client_event(event_id_t, char mode, void *private);
>> +static int primary_forward_request(struct tdremus_state *s,
>> +                                  const td_request_t *treq);
>>
>> -static int primary_blocking_connect(struct tdremus_state *state)
>> +/*
>> + * It is called when we cannot connect to backup, or find I/O error when
>> + * reading/writing.
>> + */
>> +static void primary_failed(struct tdremus_state *s, int rc)
>>  {
>> -       int fd;
>> -       int id;
>> +       td_replication_connect_kill(&s->t);
>> +       if (rc == ERROR_INTERNAL)
>> +               RPRINTF("switch to unprotected mode due to internal
> error");
>> +       UNREGISTER_EVENT(s->stream_fd.id);
>> +       switch_mode(s->tdremus_driver, mode_unprotected);
>> +}
>> +
>> +static int remus_handle_queued_io(struct tdremus_state *s)
>> +{
>> +       struct req_ring *queued_io = &s->queued_io;
>> +       unsigned int cons;
>> +       td_request_t *treq;
>>         int rc;
>> -       int flags;
>>
>> -       RPRINTF("client connecting to %s:%d...\n",
> inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
>> +       while (!ring_isempty(queued_io)) {
>> +               cons = queued_io->cons;
>> +               treq = &queued_io->pending_requests[cons].treq;
>>
>> -       if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
>> -               RPRINTF("could not create client socket: %d\n", errno);
>> -               return -1;
>> -       }
>> -
>> -       do {
>> -               if ((rc = connect(fd, (struct sockaddr *)&state->sa,
>> -                   sizeof(state->sa))) < 0)
>> -               {
>> -                       if (errno == ECONNREFUSED) {
>> -                               RPRINTF("connection refused -- retrying
> in 1 second\n");
>> -                               sleep(1);
>> -                       } else {
>> -                               RPRINTF("connection failed: %d\n", errno);
>> -                               close(fd);
>> -                               return -1;
>> -                       }
>> +               if (treq->op == TD_OP_WRITE) {
>> +                       rc = primary_forward_request(s, treq);
>> +                       if (rc)
>> +                               return rc;
>>                 }
>> -       } while (rc < 0);
>>
>> -       RPRINTF("client connected\n");
>> -
>> -       /* make socket nonblocking */
>> -       if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
>> -               flags = 0;
>> -       if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
>> -       {
>> -               RPRINTF("error making socket nonblocking\n");
>> -               close(fd);
>> -               return -1;
>> +               td_forward_request(*treq);
>> +               queued_io->cons = ring_next(cons);
>>         }
>>
>> -       if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> fd, 0, remus_client_event, state)) < 0) {
>> -               RPRINTF("error registering client event handler: %s\n",
> strerror(id));
>> -               close(fd);
>> -               return -1;
>> -       }
>> -
>> -       state->stream_fd.fd = fd;
>> -       state->stream_fd.id = id;
>>         return 0;
>>  }
>>
>> -/* on read, just pass request through */
>> +static void remus_client_established(td_replication_connect_t *t, int rc)
>> +{
>> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
>> +       event_id_t id;
>> +
>> +       if (rc) {
>> +               primary_failed(s, rc);
>> +               return;
>> +       }
>> +
>> +       /* the connect succeeded */
>> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
>> +                                          0, remus_client_event, s);
>> +       if(id < 0) {
>> +               RPRINTF("error registering client event handler: %s\n",
>> +                       strerror(id));
>> +               primary_failed(s, ERROR_INTERNAL);
>> +               return;
>> +       }
>> +
>> +       s->stream_fd.fd = t->fd;
>> +       s->stream_fd.id = id;
>> +
>> +       /* handle the queued requests */
>> +       rc = remus_handle_queued_io(s);
>> +       if (rc)
>> +               primary_failed(s, rc);
>> +}
>> +
>>  static void primary_queue_read(td_driver_t *driver, td_request_t treq)
>>  {
>> -       /* just pass read through */
>> -       td_forward_request(treq);
>> +       struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> +       struct req_ring *ring = &s->queued_io;
>> +
>> +       if (ring_isempty(ring)) {
>> +               /* just pass read through */
>> +               td_forward_request(treq);
>> +               return;
>> +       }
>> +
>> +       ring_add_request(ring, &treq);
>> +}
>> +
>> +static int primary_forward_request(struct tdremus_state *s,
>> +                                  const td_request_t *treq)
>> +{
>> +       char header[sizeof(uint32_t) + sizeof(uint64_t)];
>> +       uint32_t *sectors = (uint32_t *)header;
>> +       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
>> +       td_driver_t *driver = s->tdremus_driver;
>> +
>> +       *sectors = treq->secs;
>> +       *sector = treq->sec;
>> +
>> +       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE))
> < 0)
>> +               return ERROR_IO;
>> +
>> +       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
>> +               return ERROR_IO;
>> +
>> +       if (mwrite(s->stream_fd.fd, treq->buf,
>> +           treq->secs * driver->info.sector_size) < 0)
>> +               return ERROR_IO;
>> +
>> +       return 0;
>>  }
>>
>>  /* TODO:
>> @@ -819,28 +857,28 @@ static void primary_queue_read(td_driver_t *driver,
> td_request_t treq)
>>  static void primary_queue_write(td_driver_t *driver, td_request_t treq)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> -
>> -       char header[sizeof(uint32_t) + sizeof(uint64_t)];
>> -       uint32_t *sectors = (uint32_t *)header;
>> -       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
>> +       int rc, ret;
>>
>>         // RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
>>
>> -       /* -1 means we haven't connected yet, -2 means the connection was
> lost */
>> -       if(s->stream_fd.fd == -1) {
>> +       ret = td_replication_connect_status(&s->t);
>> +       if(ret == -1) {
>>                 RPRINTF("connecting to backup...\n");
>> -               primary_blocking_connect(s);
>> +               s->t.callback = remus_client_established;
>> +               rc = td_replication_client_start(&s->t);
>> +               if (rc)
>> +                       goto fail;
>>         }
>>
>> -       *sectors = treq.secs;
>> -       *sector = treq.sec;
>> +       /* The connection is not established, just queue the request */
>> +       if (ret != 1) {
>> +               ring_add_request(&s->queued_io, &treq);
>> +               return;
>> +       }
>>
>> -       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE, strlen(TDREMUS_WRITE))
> < 0)
>> -               goto fail;
>> -       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
>> -               goto fail;
>> -
>> -       if (mwrite(s->stream_fd.fd, treq.buf, treq.secs *
> driver->info.sector_size) < 0)
>> +       /* The connection is established */
>> +       rc = primary_forward_request(s, &treq);
>> +       if (rc)
>>                 goto fail;
>>
>>         td_forward_request(treq);
>> @@ -850,7 +888,7 @@ static void primary_queue_write(td_driver_t *driver,
> td_request_t treq)
>>   fail:
>>         /* switch to unprotected mode and tell tapdisk to retry */
>>         RPRINTF("write request replication failed, switching to
> unprotected mode");
>> -       switch_mode(s->tdremus_driver, mode_unprotected);
>> +       primary_failed(s, rc);
>>         td_complete_request(treq, -EBUSY);
>>  }
>>
>> @@ -867,7 +905,7 @@ static int client_flush(td_driver_t *driver)
>>
>>         if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
> strlen(TDREMUS_COMMIT)) < 0) {
>>                 RPRINTF("error flushing output");
>> -               close_stream_fd(s);
>> +               primary_failed(s, ERROR_IO);
>>                 return -1;
>>         }
>>
>> @@ -886,6 +924,26 @@ static int server_flush(td_driver_t *driver)
>>         return ramdisk_flush(driver, s);
>>  }
>>
>> +/* It is called when switching the mode from primary to unprotected */
>> +static int primary_flush(td_driver_t *driver)
>> +{
>> +       struct tdremus_state *s = driver->data;
>> +       struct req_ring *ring = &s->queued_io;
>> +       unsigned int cons;
>> +
>> +       if (ring_isempty(ring))
>> +               return 0;
>> +
>> +       while (!ring_isempty(ring)) {
>> +               cons = ring->cons;
>> +               ring->cons = ring_next(cons);
>> +
>> +               td_forward_request(ring->pending_requests[cons].treq);
>> +       }
>> +
>> +       return client_flush(driver);
>> +}
>> +
>>  static int primary_start(td_driver_t *driver)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> @@ -894,7 +952,7 @@ static int primary_start(td_driver_t *driver)
>>
>>         tapdisk_remus.td_queue_read = primary_queue_read;
>>         tapdisk_remus.td_queue_write = primary_queue_write;
>> -       s->queue_flush = client_flush;
>> +       s->queue_flush = primary_flush;
>>
>>         s->stream_fd.fd = -1;
>>         s->stream_fd.id = -1;
>> @@ -913,7 +971,7 @@ static void remus_client_event(event_id_t id, char
> mode, void *private)
>>         if (mread(s->stream_fd.fd, req, sizeof(req) - 1) < 0) {
>>                 /* replication stream closed or otherwise broken
> (timeout, reset, &c) */
>>                 RPRINTF("error reading from backup\n");
>> -               close_stream_fd(s);
>> +               primary_failed(s, ERROR_IO);
>>                 return;
>>         }
>>
>> @@ -924,7 +982,7 @@ static void remus_client_event(event_id_t id, char
> mode, void *private)
>>                 ctl_respond(s, TDREMUS_DONE);
>>         else {
>>                 RPRINTF("received unknown message: %s\n", req);
>> -               close_stream_fd(s);
>> +               primary_failed(s, ERROR_IO);
>>         }
>>
>>         return;
>> @@ -933,84 +991,36 @@ static void remus_client_event(event_id_t id, char
> mode, void *private)
>>  /* backup functions */
>>  static void remus_server_event(event_id_t id, char mode, void *private);
>>
>> -/* returns the socket that receives write requests */
>> -static void remus_server_accept(event_id_t id, char mode, void* private)
>> +/* It is called when we find some I/O error */
>> +static void backup_failed(struct tdremus_state *s, int rc)
>>  {
>> -       struct tdremus_state* s = (struct tdremus_state *) private;
>> +       UNREGISTER_EVENT(s->stream_fd.id);
>> +       td_replication_connect_kill(&s->t);
>> +       /* We will switch to unprotected mode in backup_queue_write() */
>> +}
>>
>> -       int stream_fd;
>> -       event_id_t cid;
>> +/* returns the socket that receives write requests */
>> +static void remus_server_established(td_replication_connect_t *t, int rc)
>> +{
>> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
>> +       event_id_t id;
>>
>> -       /* XXX: add address-based black/white list */
>> -       if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
>> -               RPRINTF("error accepting connection: %d\n", errno);
>> -               return;
>> -       }
>> -
>> -       /* TODO: check to see if we are already replicating. if so just
> close the
>> -        * connection (or do something smarter) */
>> -       RPRINTF("server accepted connection\n");
>> +       /* rc is always 0 */
>>
>>         /* add tapdisk event for replication stream */
>> -       cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> stream_fd, 0,
>> -                                           remus_server_event, s);
>> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, t->fd,
> 0,
>> +                                          remus_server_event, s);
>>
>> -       if(cid < 0) {
>> -               RPRINTF("error registering connection event handler:
> %s\n", strerror(errno));
>> -               close(stream_fd);
>> +       if (id < 0) {
>> +               RPRINTF("error registering connection event handler:
> %s\n",
>> +                       strerror(errno));
>> +               td_replication_server_restart(t);
>>                 return;
>>         }
>>
>>         /* store replication file descriptor */
>> -       s->stream_fd.fd = stream_fd;
>> -       s->stream_fd.id = cid;
>> -}
>> -
>> -/* returns -2 if EADDRNOTAVAIL */
>> -static int remus_bind(struct tdremus_state* s)
>> -{
>> -//  struct sockaddr_in sa;
>> -       int opt;
>> -       int rc = -1;
>> -
>> -       if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
>> -               RPRINTF("could not create server socket: %d\n", errno);
>> -               return rc;
>> -       }
>> -       opt = 1;
>> -       if (setsockopt(s->server_fd.fd, SOL_SOCKET, SO_REUSEADDR, &opt,
> sizeof(opt)) < 0)
>> -               RPRINTF("Error setting REUSEADDR on %d: %d\n",
> s->server_fd.fd, errno);
>> -
>> -       if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa,
> sizeof(s->sa)) < 0) {
>> -               RPRINTF("could not bind server socket %d to %s:%d: %d
> %s\n", s->server_fd.fd,
>> -                       inet_ntoa(s->sa.sin_addr), ntohs(s->sa.sin_port),
> errno, strerror(errno));
>> -               if (errno != EADDRINUSE)
>> -                       rc = -2;
>> -               goto err_sfd;
>> -       }
>> -       if (listen(s->server_fd.fd, 10)) {
>> -               RPRINTF("could not listen on socket: %d\n", errno);
>> -               goto err_sfd;
>> -       }
>> -
>> -       /* The socket s now bound to the address and listening so we may
> now register
>> -   * the fd with tapdisk */
>> -
>> -       if((s->server_fd.id =
> tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
>> -
>  s->server_fd.fd, 0,
>> -
>  remus_server_accept, s)) < 0) {
>> -               RPRINTF("error registering server connection event
> handler: %s",
>> -                       strerror(s->server_fd.id));
>> -               goto err_sfd;
>> -       }
>> -
>> -       return 0;
>> -
>> - err_sfd:
>> -       close(s->server_fd.fd);
>> -       s->server_fd.fd = -1;
>> -
>> -       return rc;
>> +       s->stream_fd.fd = t->fd;
>> +       s->stream_fd.id = id;
>>  }
>>
>>  /* wait for latest checkpoint to be applied */
>> @@ -1053,6 +1063,8 @@ void backup_queue_write(td_driver_t *driver,
> td_request_t treq)
>>          * handle the write
>>          */
>>
>> +       /* If we have called backup_failed, calling it again is harmless
> */
>> +       backup_failed(s, ERROR_INTERNAL);
>>         switch_mode(driver, mode_unprotected);
>>         /* TODO: call the appropriate write function rather than return
> EBUSY */
>>         td_complete_request(treq, -EBUSY);
>> @@ -1061,7 +1073,6 @@ void backup_queue_write(td_driver_t *driver,
> td_request_t treq)
>>  static int backup_start(td_driver_t *driver)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> -       int fd;
>>
>>         if (ramdisk_start(driver) < 0)
>>                 return -1;
>> @@ -1073,12 +1084,12 @@ static int backup_start(td_driver_t *driver)
>>         return 0;
>>  }
>>
>> -static int server_do_wreq(td_driver_t *driver)
>> +static void server_do_wreq(td_driver_t *driver)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>>         static tdremus_wire_t twreq;
>>         char buf[4096];
>> -       int len, rc;
>> +       int len, rc = ERROR_IO;
>>
>>         char header[sizeof(uint32_t) + sizeof(uint64_t)];
>>         uint32_t *sectors = (uint32_t *) header;
>> @@ -1097,28 +1108,28 @@ static int server_do_wreq(td_driver_t *driver)
>>         if (len > sizeof(buf)) {
>>                 /* freak out! */
>>                 RPRINTF("write request too large: %d/%u\n", len,
> (unsigned)sizeof(buf));
>> -               return -1;
>> +               goto err;
>>         }
>>
>>         if (mread(s->stream_fd.fd, buf, len) < 0)
>>                 goto err;
>>
>> -       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0)
>> +       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
>> +               rc = ERROR_INTERNAL;
>>                 goto err;
>> +       }
>>
>> -       return 0;
>> +       return;
>>
>>   err:
>>         /* should start failover */
>>         RPRINTF("backup write request error\n");
>> -       close_stream_fd(s);
>> -
>> -       return -1;
>> +       backup_failed(s, rc);
>>  }
>>
>>  /* at this point, the server can start applying the most recent
>>   * ramdisk. */
>> -static int server_do_creq(td_driver_t *driver)
>> +static void server_do_creq(td_driver_t *driver)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>>
>> @@ -1128,9 +1139,7 @@ static int server_do_creq(td_driver_t *driver)
>>
>>         /* XXX this message should not be sent until flush completes! */
>>         if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE)) !=
> 4)
>> -               return -1;
>> -
>> -       return 0;
>> +               backup_failed(s, ERROR_IO);
>>  }
>>
>>
>> @@ -1213,11 +1222,6 @@ static int unprotected_start(td_driver_t *driver)
>>
>>         RPRINTF("failure detected, activating passthrough\n");
>>
>> -       /* close the server socket */
>> -       close_stream_fd(s);
>> -
>> -       close_server_fd(s);
>> -
>>         /* install the unprotected read/write handlers */
>>         tapdisk_remus.td_queue_read = unprotected_queue_read;
>>         tapdisk_remus.td_queue_write = unprotected_queue_write;
>> @@ -1227,90 +1231,6 @@ static int unprotected_start(td_driver_t *driver)
>>
>>
>>  /* control */
>> -
>> -static inline int resolve_address(const char* addr, struct in_addr* ia)
>> -{
>> -       struct hostent* he;
>> -       uint32_t ip;
>> -
>> -       if (!(he = gethostbyname(addr))) {
>> -               RPRINTF("error resolving %s: %d\n", addr, h_errno);
>> -               return -1;
>> -       }
>> -
>> -       if (!he->h_addr_list[0]) {
>> -               RPRINTF("no address found for %s\n", addr);
>> -               return -1;
>> -       }
>> -
>> -       /* network byte order */
>> -       ip = *((uint32_t**)he->h_addr_list)[0];
>> -       ia->s_addr = ip;
>> -
>> -       return 0;
>> -}
>> -
>> -static int get_args(td_driver_t *driver, const char* name)
>> -{
>> -       struct tdremus_state *state = (struct tdremus_state
> *)driver->data;
>> -       char* host;
>> -       char* port;
>> -//  char* driver_str;
>> -//  char* parent;
>> -//  int type;
>> -//  char* path;
>> -//  unsigned long ulport;
>> -//  int i;
>> -//  struct sockaddr_in server_addr_in;
>> -
>> -       int gai_status;
>> -       int valid_addr;
>> -       struct addrinfo gai_hints;
>> -       struct addrinfo *servinfo, *servinfo_itr;
>> -
>> -       memset(&gai_hints, 0, sizeof gai_hints);
>> -       gai_hints.ai_family = AF_UNSPEC;
>> -       gai_hints.ai_socktype = SOCK_STREAM;
>> -
>> -       port = strchr(name, ':');
>> -       if (!port) {
>> -               RPRINTF("missing host in %s\n", name);
>> -               return -ENOENT;
>> -       }
>> -       if (!(host = strndup(name, port - name))) {
>> -               RPRINTF("unable to allocate host\n");
>> -               return -ENOMEM;
>> -       }
>> -       port++;
>> -
>> -       if ((gai_status = getaddrinfo(host, port, &gai_hints, &servinfo))
> != 0) {
>> -               RPRINTF("getaddrinfo error: %s\n",
> gai_strerror(gai_status));
>> -               return -ENOENT;
>> -       }
>> -
>> -       /* TODO: do something smarter here */
>> -       valid_addr = 0;
>> -       for(servinfo_itr = servinfo; servinfo_itr != NULL; servinfo_itr =
> servinfo_itr->ai_next) {
>> -               void *addr;
>> -               char *ipver;
>> -
>> -               if (servinfo_itr->ai_family == AF_INET) {
>> -                       valid_addr = 1;
>> -                       memset(&state->sa, 0, sizeof(state->sa));
>> -                       state->sa = *(struct sockaddr_in
> *)servinfo_itr->ai_addr;
>> -                       break;
>> -               }
>> -       }
>> -       freeaddrinfo(servinfo);
>> -
>> -       if (!valid_addr)
>> -               return -ENOENT;
>> -
>> -       RPRINTF("host: %s, port: %d\n", inet_ntoa(state->sa.sin_addr),
> ntohs(state->sa.sin_port));
>> -
>> -       return 0;
>> -}
>> -
>>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> @@ -1343,6 +1263,20 @@ static int switch_mode(td_driver_t *driver, enum
> tdremus_mode mode)
>>         return rc;
>>  }
>>
>> +static void ctl_reopen(struct tdremus_state *s)
>> +{
>> +       ctl_unregister(s);
>> +       CLOSE_FD(s->ctl_fd.fd);
>> +       RPRINTF("FIFO closed\n");
>> +
>> +       if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
>> +               RPRINTF("error reopening FIFO: %d\n", errno);
>> +               return;
>> +       }
>> +
>> +       ctl_register(s);
>> +}
>> +
>>  static void ctl_request(event_id_t id, char mode, void *private)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)private;
>> @@ -1355,11 +1289,7 @@ static void ctl_request(event_id_t id, char mode,
> void *private)
>>         if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append nul
> */))) {
>>                 RPRINTF("0-byte read received, reopening FIFO\n");
>>                 /*TODO: we may have to unregister/re-register with
> tapdisk_server */
>> -               close(s->ctl_fd.fd);
>> -               RPRINTF("FIFO closed\n");
>> -               if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
>> -                       RPRINTF("error reopening FIFO: %d\n", errno);
>> -               }
>> +               ctl_reopen(s);
>>                 return;
>>         }
>>
>> @@ -1372,7 +1302,7 @@ static void ctl_request(event_id_t id, char mode,
> void *private)
>>         msg[rc] = '\0';
>>         if (!strncmp(msg, "flush", 5)) {
>>                 if (s->mode == mode_primary) {
>> -                       if ((rc = s->queue_flush(driver))) {
>> +                       if ((rc = client_flush(driver))) {
>>                                 RPRINTF("error passing flush request to
> backup");
>>                                 ctl_respond(s, TDREMUS_FAIL);
>>                         }
>> @@ -1521,6 +1451,7 @@ static void ctl_unregister(struct tdremus_state *s)
>>  static int tdremus_open(td_driver_t *driver, td_image_t *image,
> td_uuid_t uuid)
>>  {
>>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> +       td_replication_connect_t *t = &s->t;
>>         int rc;
>>         const char *name = image->name;
>>         td_flag_t flags = image->flags;
>> @@ -1531,7 +1462,6 @@ static int tdremus_open(td_driver_t *driver,
> td_image_t *image, td_uuid_t uuid)
>>         remus_image = image;
>>
>>         memset(s, 0, sizeof(*s));
>> -       s->server_fd.fd = -1;
>>         s->stream_fd.fd = -1;
>>         s->ctl_fd.fd = -1;
>>         s->msg_fd.fd = -1;
>> @@ -1540,8 +1470,11 @@ static int tdremus_open(td_driver_t *driver,
> td_image_t *image, td_uuid_t uuid)
>>          * the driver stack from the stream_fd event handler */
>>         s->tdremus_driver = driver;
>>
>> -       /* parse name to get info etc */
>> -       if ((rc = get_args(driver, name)))
>> +       t->log_prefix = "remus";
>> +       t->retry_timeout_s = REMUS_CONNRETRY_TIMEOUT;
>> +       t->max_connections = 10;
>> +       t->callback = remus_server_established;
>> +       if ((rc = td_replication_connect_init(t, name)))
>>                 return rc;
>>
>>         if ((rc = ctl_open(driver, name))) {
>> @@ -1555,7 +1488,7 @@ static int tdremus_open(td_driver_t *driver,
> td_image_t *image, td_uuid_t uuid)
>>                 return rc;
>>         }
>>
>> -       if (!(rc = remus_bind(s)))
>> +       if (!(rc = td_replication_server_start(t)))
>>                 rc = switch_mode(driver, mode_backup);
>>         else if (rc == -2)
>>                 rc = switch_mode(driver, mode_primary);
>> @@ -1575,8 +1508,7 @@ static int tdremus_close(td_driver_t *driver)
>>         if (s->ramdisk.inprogress)
>>                 hashtable_destroy(s->ramdisk.inprogress, 0);
>>
>> -       close_server_fd(s);
>> -       close_stream_fd(s);
>> +       td_replication_connect_kill(&s->t);
>>         ctl_unregister(s);
>>         ctl_close(s);
>>
>> diff --git a/tools/blktap2/drivers/block-replication.h
> b/tools/blktap2/drivers/block-replication.h
>> index 9e051cc..07fd630 100644
>> --- a/tools/blktap2/drivers/block-replication.h
>> +++ b/tools/blktap2/drivers/block-replication.h
>> @@ -48,6 +48,7 @@
>>  enum {
>>         ERROR_INTERNAL = -1,
>>         ERROR_CONNECTION = -2,
>> +       ERROR_IO = -3,
>>  };
>>
>>  typedef struct td_replication_connect td_replication_connect_t;
>> --
>> 1.9.3
>>
> 
> The code looks ok. Have you tested this, with some read/write workload
> inside the guest? Especially read after write style sanity checks to ensure
> that there is no data corruption (caused by stale ramdisk data flushed to
> disk or served to guest), before a connection to backup has been
> established.

Which current testtool can check this?
Before the connection to backup has been established, the guest will be blocked
when the first write operation happens. So you cannot log in and run a test program.

> I am acking this piece under good faith that you have tested all these
> cases.

Yes. Apply the hack patch17, you can run remus with blktap2.

I have tested it with pgbench. IIRC, in the test, I only find one problem:
select() will be timeout in xc_domain_restore.c.

Thanks
Wen Congyang

> 
> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends
  2014-10-14  2:14 ` [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends Wen Congyang
@ 2014-10-20  3:00   ` Shriram Rajagopalan
  2014-10-20  3:09     ` Wen Congyang
  0 siblings, 1 reply; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  3:00 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 10945 bytes --]

On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> Just for test

What do you mean? You would like these to be reviewed but not committed?

>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/libxl/Makefile                  |   2 +-
>  tools/libxl/libxl_create.c            |   8 ++
>  tools/libxl/libxl_internal.h          |   2 +
>  tools/libxl/libxl_remus_device.c      |   6 +
>  tools/libxl/libxl_remus_disk_blktap.c | 209
++++++++++++++++++++++++++++++++++
>  5 files changed, 226 insertions(+), 1 deletion(-)
>  create mode 100644 tools/libxl/libxl_remus_disk_blktap.c
>
> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
> index 0bf666f..b58c2ff 100644
> --- a/tools/libxl/Makefile
> +++ b/tools/libxl/Makefile
> @@ -56,7 +56,7 @@ else
>  LIBXL_OBJS-y += libxl_nonetbuffer.o
>  endif
>
> -LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
> +LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
libxl_remus_disk_blktap.o
>
>  LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
>  LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> index 8b82584..e634694 100644
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -853,6 +853,14 @@ static void initiate_domain_create(libxl__egc *egc,
>      for (i = 0; i < d_config->num_disks; i++) {
>          ret = libxl__device_disk_setdefault(gc, &d_config->disks[i]);
>          if (ret) goto error_out;
> +
> +        /* TODO: cleanup it when destroying the domain */
> +        if (d_config->disks[i].backend == LIBXL_DISK_BACKEND_TAP &&
> +            d_config->disks[i].filter)
> +            libxl__blktap_devpath(gc, d_config->disks[i].pdev_path,
> +                                  d_config->disks[i].format,
> +                                  d_config->disks[i].filter,
> +                                  d_config->disks[i].filter_params);
>      }
>
>      dcs->bl.ao = ao;
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> index 282b03f..a7c2334 100644
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2672,6 +2672,8 @@ int init_subkind_nic(libxl__remus_devices_state
*rds);
>  void cleanup_subkind_nic(libxl__remus_devices_state *rds);
>  int init_subkind_drbd_disk(libxl__remus_devices_state *rds);
>  void cleanup_subkind_drbd_disk(libxl__remus_devices_state *rds);
> +int init_subkind_blktap_disk(libxl__remus_devices_state *rds);
> +void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds);
>
>  typedef void libxl__remus_callback(libxl__egc *,
>                                     libxl__remus_devices_state *, int rc);
> diff --git a/tools/libxl/libxl_remus_device.c
b/tools/libxl/libxl_remus_device.c
> index a6cb7f6..ef272ac 100644
> --- a/tools/libxl/libxl_remus_device.c
> +++ b/tools/libxl/libxl_remus_device.c
> @@ -19,9 +19,11 @@
>
>  extern const libxl__remus_device_instance_ops remus_device_nic;
>  extern const libxl__remus_device_instance_ops remus_device_drbd_disk;
> +extern const libxl__remus_device_instance_ops remus_device_blktap2_disk;
>  static const libxl__remus_device_instance_ops *remus_ops[] = {
>      &remus_device_nic,
>      &remus_device_drbd_disk,
> +    &remus_device_blktap2_disk,
>      NULL,
>  };
>
> @@ -41,6 +43,9 @@ static int
init_device_subkind(libxl__remus_devices_state *rds)
>      rc = init_subkind_drbd_disk(rds);
>      if (rc) goto out;
>
> +    rc = init_subkind_blktap_disk(rds);
> +    if (rc) goto out;
> +
>      rc = 0;
>  out:
>      return rc;
> @@ -55,6 +60,7 @@ static void
cleanup_device_subkind(libxl__remus_devices_state *rds)
>          cleanup_subkind_nic(rds);
>
>      cleanup_subkind_drbd_disk(rds);
> +    cleanup_subkind_blktap_disk(rds);
>  }
>
>  /*----- setup() and teardown() -----*/
> diff --git a/tools/libxl/libxl_remus_disk_blktap.c
b/tools/libxl/libxl_remus_disk_blktap.c
> new file mode 100644
> index 0000000..3ae77d6
> --- /dev/null
> +++ b/tools/libxl/libxl_remus_disk_blktap.c
> @@ -0,0 +1,209 @@
> +/*
> + * Copyright (C) 2014 FUJITSU LIMITED
> + * Author Wen Congyang <wency@cn.fujitsu.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as
published
> + * by the Free Software Foundation; version 2.1 only. with the special
> + * exception on linking described in file LICENSE.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + */
> +
> +#include "libxl_osdeps.h" /* must come before any other headers */
> +
> +#include "libxl_internal.h"
> +
> +#include <string.h>
> +#include <sys/un.h>
> +
> +#define     BLKTAP2_REQUEST     "flush"
> +#define     BLKTAP2_RESPONSE    "done"
> +#define     BLKTAP_CTRL_DIR     "/var/run/tap"
> +
> +typedef struct libxl__remus_blktap2_disk {
> +    char *name;
> +    char *ctl_fifo_path;
> +    char *msg_fifo_path;
> +    int ctl_fd;
> +    int msg_fd;
> +    libxl__ev_fd ev;
> +    libxl__remus_device *dev;
> +}libxl__remus_blktap2_disk;
> +
> +int init_subkind_blktap_disk(libxl__remus_devices_state *rds)
> +{
> +    return 0;
> +}
> +
> +void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds)
> +{
> +    return;
> +}
> +/* ========== setup() and teardown() ========== */
> +static void blktap2_remus_setup(libxl__egc *egc, libxl__remus_device
*dev)
> +{
> +    const libxl_device_disk *disk = dev->backend_dev;
> +    libxl__remus_blktap2_disk *blktap2_disk;
> +    int rc;
> +    int i, l;
> +
> +    STATE_AO_GC(dev->rds->ao);
> +
> +    if (disk->backend != LIBXL_DISK_BACKEND_TAP ||
> +        !disk->filter ||
> +        strcmp(disk->filter, "remus")) {
> +        rc = ERROR_REMUS_DEVOPS_DOES_NOT_MATCH;
> +        goto out;
> +    }
> +
> +    dev->matched = 1;
> +    GCNEW(blktap2_disk);
> +    dev->concrete_data = blktap2_disk;
> +    blktap2_disk->ctl_fd = -1;
> +    blktap2_disk->msg_fd = -1;
> +    blktap2_disk->dev = dev;
> +
> +    blktap2_disk->name = libxl__strdup(gc, disk->filter_params);
> +    blktap2_disk->ctl_fifo_path = GCSPRINTF("%s/remus_%s",
> +                                            BLKTAP_CTRL_DIR,
> +                                            blktap2_disk->name);
> +    /* scrub fifo pathname */
> +    l = strlen(blktap2_disk->ctl_fifo_path);
> +    for (i = strlen(BLKTAP_CTRL_DIR) + 1; i < l; i++) {
> +        if (strchr(":/", blktap2_disk->ctl_fifo_path[i]))
> +            blktap2_disk->ctl_fifo_path[i] = '_';
> +    }
> +    blktap2_disk->msg_fifo_path = GCSPRINTF("%s.msg",
> +                                            blktap2_disk->ctl_fifo_path);
> +
> +    blktap2_disk->ctl_fd = open(blktap2_disk->ctl_fifo_path, O_WRONLY);
> +    blktap2_disk->msg_fd = open(blktap2_disk->msg_fifo_path, O_RDONLY);
> +    if (blktap2_disk->ctl_fd < 0 || blktap2_disk->msg_fd < 0) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    libxl__ev_fd_init(&blktap2_disk->ev);
> +
> +    rc = 0;
> +
> +out:
> +    dev->aodev.rc = rc;
> +    dev->aodev.callback(egc, &dev->aodev);
> +}
> +
> +static void blktap2_remus_teardown(libxl__egc *egc,
> +                                   libxl__remus_device *dev)
> +{
> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
> +
> +    if (blktap2_disk->ctl_fd > 0) {
> +        close(blktap2_disk->ctl_fd);
> +        blktap2_disk->ctl_fd = -1;
> +    }
> +
> +    if (blktap2_disk->msg_fd > 0) {
> +        close(blktap2_disk->msg_fd);
> +        blktap2_disk->msg_fd = -1;
> +    }
> +
> +    dev->aodev.rc = 0;
> +    dev->aodev.callback(egc, &dev->aodev);
> +}
> +
> +/* ========== checkpointing APIs ========== */
> +/*
> + * When a new checkpoint is triggered, we do the following thing:
> + *  1. send BLKTAP2_REQUEST to tapdisk2
> + *  2. tapdisk2 send "creq"
> + *  3. secondary vm's tapdisk2 reply "done"
> + *  4. tapdisk2 writes BLKTAP2_RESPONSE to the socket
> + *  5. read BLKTAP2_RESPONSE from the socket
> + * Step1 and 5 are implemented here.
> + */
> +static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
> +                                     int fd, short events, short
revents);
> +
> +static void blktap2_remus_postsuspend(libxl__egc *egc,
> +                                      libxl__remus_device *dev)
> +{
> +    int ret;
> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
> +    int rc = 0;
> +
> +    /* fifo fd, and not block */
> +    ret = write(blktap2_disk->ctl_fd, BLKTAP2_REQUEST,
strlen(BLKTAP2_REQUEST));
> +    if (ret < strlen(BLKTAP2_REQUEST))
> +        rc = ERROR_FAIL;
> +
> +    dev->aodev.rc = rc;
> +    dev->aodev.callback(egc, &dev->aodev);
> +}
> +
> +static void blktap2_remus_commit(libxl__egc *egc,
> +                                 libxl__remus_device *dev)
> +{
> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
> +    int rc;
> +
> +    /* Convenience aliases */
> +    const int fd = blktap2_disk->msg_fd;
> +    libxl__ev_fd *const ev = &blktap2_disk->ev;
> +
> +    STATE_AO_GC(dev->rds->ao);
> +
> +    rc = libxl__ev_fd_register(gc, ev, blktap2_control_readable, fd,
POLLIN);
> +    if (rc) {
> +        dev->aodev.rc = rc;
> +        dev->aodev.callback(egc, &dev->aodev);
> +    }
> +}
> +
> +static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
> +                                     int fd, short events, short revents)
> +{
> +    libxl__remus_blktap2_disk *blktap2_disk =
> +                CONTAINER_OF(ev, *blktap2_disk, ev);
> +    int rc = 0, ret;
> +    char response[5];
> +
> +    /* Convenience aliases */
> +    libxl__remus_device *const dev = blktap2_disk->dev;
> +
> +    EGC_GC;
> +
> +    libxl__ev_fd_deregister(gc, ev);
> +
> +    if (revents & ~POLLIN) {
> +        LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)",
revents);
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    ret = read(fd, response, sizeof(response) - 1);
> +    if (ret < sizeof(response) - 1) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    response[4] = '\0';
> +    if (strcmp(response, BLKTAP2_RESPONSE))
> +        rc = ERROR_FAIL;
> +
> +out:
> +    dev->aodev.rc = rc;
> +    dev->aodev.callback(egc, &dev->aodev);
> +}
> +
> +
> +const libxl__remus_device_instance_ops remus_device_blktap2_disk = {
> +    .kind = LIBXL__DEVICE_KIND_VBD,
> +    .setup = blktap2_remus_setup,
> +    .teardown = blktap2_remus_teardown,
> +    .postsuspend = blktap2_remus_postsuspend,
> +    .commit = blktap2_remus_commit,
> +};
> --
> 1.9.3
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 14856 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/17] tools: block-remus: fix bug in tdremus_close()
  2014-10-14  2:13 ` [PATCH 04/17] tools: block-remus: fix bug in tdremus_close() Wen Congyang
@ 2014-10-20  3:01   ` Shriram Rajagopalan
  2014-10-20  3:05     ` Wen Congyang
  0 siblings, 1 reply; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  3:01 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 7168 bytes --]

On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> We close ctl_fd.fd, but we don't unregister ctl_fd.id. It will
> cause select() return fails, and the user cannot talk with
> tapdisk2.
>
> This patch also does some cleanup.
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-remus.c | 90
++++++++++++++++++++++---------------
>  1 file changed, 53 insertions(+), 37 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index a2c08d8..fd5f209 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -151,9 +151,6 @@ typedef struct poll_fd {
>  } poll_fd_t;
>
>  struct tdremus_state {
> -//  struct tap_disk* driver;
> -       void* driver_data;
> -
>    /* XXX: this is needed so that the server can perform operations on
>     * the driver from the stream_fd event handler. fix this. */
>         td_driver_t *tdremus_driver;
> @@ -731,12 +728,26 @@ static int mwrite(int fd, void* buf, size_t len)
>
>  static void inline close_stream_fd(struct tdremus_state *s)
>  {
> +       if (s->stream_fd.fd < 0)
> +               return;
> +
>         /* XXX: -2 is magic. replace with macro perhaps? */
>         tapdisk_server_unregister_event(s->stream_fd.id);
>         close(s->stream_fd.fd);
>         s->stream_fd.fd = -2;
>  }
>
> +static void close_server_fd(struct tdremus_state *s)
> +{
> +       if (s->server_fd.fd < 0)
> +               return;
> +
> +       tapdisk_server_unregister_event(s->server_fd.id);
> +       s->server_fd.id = -1;
> +       close(s->stream_fd.fd);
> +       s->stream_fd.fd = -1;
> +}
> +
>  /* primary functions */
>  static void remus_client_event(event_id_t, char mode, void *private);
>  static void remus_connect_event(event_id_t id, char mode, void *private);
> @@ -1347,12 +1358,7 @@ static int unprotected_start(td_driver_t *driver)
>         /* close the server socket */
>         close_stream_fd(s);
>
> -       /* unregister the replication stream */
> -       tapdisk_server_unregister_event(s->server_fd.id);
> -
> -       /* close the replication stream */
> -       close(s->server_fd.fd);
> -       s->server_fd.fd = -1;
> +       close_server_fd(s);
>
>         /* install the unprotected read/write handlers */
>         tapdisk_remus.td_queue_read = unprotected_queue_read;
> @@ -1553,27 +1559,27 @@ static int ctl_open(td_driver_t *driver, const
char* name)
>                         s->ctl_path[i] = '_';
>         }
>         if (asprintf(&s->msg_path, "%s.msg", s->ctl_path) < 0)
> -               goto err_ctlfifo;
> +               goto err_setmsgfifo;
>
>         if (mkfifo(s->ctl_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno !=
EEXIST) {
>                 RPRINTF("error creating control FIFO %s: %d\n",
s->ctl_path, errno);
> -               goto err_msgfifo;
> +               goto err_mkctlfifo;
>         }
>
>         if (mkfifo(s->msg_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno !=
EEXIST) {
>                 RPRINTF("error creating message FIFO %s: %d\n",
s->msg_path, errno);
> -               goto err_msgfifo;
> +               goto err_mkmsgfifo;
>         }
>
>         /* RDWR so that fd doesn't block select when no writer is present
*/
>         if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
>                 RPRINTF("error opening control FIFO %s: %d\n",
s->ctl_path, errno);
> -               goto err_msgfifo;
> +               goto err_openctlfifo;
>         }
>
>         if ((s->msg_fd.fd = open(s->msg_path, O_RDWR)) < 0) {
>                 RPRINTF("error opening message FIFO %s: %d\n",
s->msg_path, errno);
> -               goto err_openctlfifo;
> +               goto err_openmsgfifo;
>         }
>
>         RPRINTF("control FIFO %s\n", s->ctl_path);
> @@ -1581,36 +1587,45 @@ static int ctl_open(td_driver_t *driver, const
char* name)
>
>         return 0;
>
> - err_openctlfifo:
> +err_openmsgfifo:
>         close(s->ctl_fd.fd);
> - err_msgfifo:
> +       s->ctl_fd.fd = -1;
> +err_openctlfifo:
> +       unlink(s->ctl_path);
> +err_mkmsgfifo:
> +       unlink(s->msg_path);
> +err_mkctlfifo:
>         free(s->msg_path);
>         s->msg_path = NULL;
> - err_ctlfifo:
> +err_setmsgfifo:
>         free(s->ctl_path);
>         s->ctl_path = NULL;
>         return -1;
>  }
>
> -static void ctl_close(td_driver_t *driver)
> +static void ctl_close(struct tdremus_state *s)
>  {
> -       struct tdremus_state *s = (struct tdremus_state *)driver->data;
> -
> -       /* TODO: close *all* connections */
> -
> -       if(s->ctl_fd.fd)
> +       if(s->ctl_fd.fd) {
>                 close(s->ctl_fd.fd);
> +               s->ctl_fd.fd = -1;
> +       }
>
>         if (s->ctl_path) {
>                 unlink(s->ctl_path);
>                 free(s->ctl_path);
>                 s->ctl_path = NULL;
>         }
> +
>         if (s->msg_path) {
>                 unlink(s->msg_path);
>                 free(s->msg_path);
>                 s->msg_path = NULL;
>         }
> +
> +       if (s->msg_fd.fd) {
> +               close(s->msg_fd.fd);
> +               s->msg_fd.fd = -1;
> +       }
>  }
>
>  static int ctl_register(struct tdremus_state *s)
> @@ -1628,6 +1643,16 @@ static int ctl_register(struct tdremus_state *s)
>         return 0;
>  }
>
> +static void ctl_unregister(struct tdremus_state *s)
> +{
> +       RPRINTF("unregistering ctl fifo\n");
> +
> +       if (s->ctl_fd.id >= 0) {
> +               tapdisk_server_unregister_event(s->ctl_fd.id);
> +               s->ctl_fd.id = -1;
> +       }
> +}
> +
>  /* interface */
>
>  static int tdremus_open(td_driver_t *driver, td_image_t *image,
td_uuid_t uuid)
> @@ -1658,13 +1683,12 @@ static int tdremus_open(td_driver_t *driver,
td_image_t *image, td_uuid_t uuid)
>
>         if ((rc = ctl_open(driver, name))) {
>                 RPRINTF("error setting up control channel\n");
> -               free(s->driver_data);
>                 return rc;
>         }
>
>         if ((rc = ctl_register(s))) {
>                 RPRINTF("error registering control channel\n");
> -               free(s->driver_data);
> +               ctl_close(s);
>                 return rc;
>         }
>
> @@ -1687,19 +1711,11 @@ static int tdremus_close(td_driver_t *driver)
>         RPRINTF("closing\n");
>         if (s->ramdisk.inprogress)
>                 hashtable_destroy(s->ramdisk.inprogress, 0);
> -
> -       if (s->driver_data) {
> -               free(s->driver_data);
> -               s->driver_data = NULL;
> -       }
> -       if (s->server_fd.fd >= 0) {
> -               close(s->server_fd.fd);
> -               s->server_fd.fd = -1;
> -       }
> -       if (s->stream_fd.fd >= 0)
> -               close_stream_fd(s);
>
> -       ctl_close(driver);
> +       close_server_fd(s);
> +       close_stream_fd(s);
> +       ctl_unregister(s);
> +       ctl_close(s);
>
>         return 0;
>  }
> --
> 1.9.3
>
>
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
_______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 10542 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 11/17] tools: block-remus: clean unused functions
  2014-10-14  2:13 ` [PATCH 11/17] tools: block-remus: clean unused functions Wen Congyang
@ 2014-10-20  3:01   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  3:01 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 8166 bytes --]

On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  tools/blktap2/drivers/block-remus.c | 142
------------------------------------
>  1 file changed, 142 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index 9be47f6..e5ad782 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -186,7 +186,6 @@ typedef struct tdremus_wire {
>
>  #define TDREMUS_READ "rreq"
>  #define TDREMUS_WRITE "wreq"
> -#define TDREMUS_SUBMIT "sreq"
>  #define TDREMUS_COMMIT "creq"
>  #define TDREMUS_DONE "done"
>  #define TDREMUS_FAIL "fail"
> @@ -750,42 +749,6 @@ static void close_server_fd(struct tdremus_state *s)
>
>  /* primary functions */
>  static void remus_client_event(event_id_t, char mode, void *private);
> -static void remus_connect_event(event_id_t id, char mode, void *private);
> -static void remus_retry_connect_event(event_id_t id, char mode, void
*private);
> -
> -static int primary_do_connect(struct tdremus_state *state)
> -{
> -       event_id_t id;
> -       int fd;
> -       int rc;
> -       int flags;
> -
> -       RPRINTF("client connecting to %s:%d...\n",
inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
> -
> -       if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
> -               RPRINTF("could not create client socket: %d\n", errno);
> -               return -1;
> -       }
> -
> -       /* make socket nonblocking */
> -       if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
> -               flags = 0;
> -       if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
> -               return -1;
> -
> -       /* once we have created the socket and populated the address, we
can now start
> -        * our non-blocking connect. rather than duplicating code we
trigger a timeout
> -        * on the socket fd, which calls out nonblocking connect code
> -        */
> -       if((id = tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT,
fd, 0, remus_retry_connect_event, state)) < 0) {
> -               RPRINTF("error registering timeout client connection
event handler: %s\n", strerror(id));
> -               /* TODO: we leak a fd here */
> -               return -1;
> -       }
> -       state->stream_fd.fd = fd;
> -       state->stream_fd.id = id;
> -       return 0;
> -}
>
>  static int primary_blocking_connect(struct tdremus_state *state)
>  {
> @@ -939,100 +902,6 @@ static int primary_start(td_driver_t *driver)
>         return 0;
>  }
>
> -/* timeout callback */
> -static void remus_retry_connect_event(event_id_t id, char mode, void
*private)
> -{
> -       struct tdremus_state *s = (struct tdremus_state *)private;
> -
> -       /* do a non-blocking connect */
> -       if (connect(s->stream_fd.fd, (struct sockaddr *)&s->sa,
sizeof(s->sa))
> -           && errno != EINPROGRESS)
> -       {
> -               if(errno == ECONNREFUSED || errno == ENETUNREACH || errno
== EAGAIN || errno == ECONNABORTED)
> -               {
> -                       /* try again in a second */
> -                       tapdisk_server_unregister_event(s->stream_fd.id);
> -                       if((id =
tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd,
REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
> -                               RPRINTF("error registering timeout client
connection event handler: %s\n", strerror(id));
> -                               return;
> -                       }
> -                       s->stream_fd.id = id;
> -               }
> -               else
> -               {
> -                       /* not recoverable */
> -                       RPRINTF("error connection to server %s\n",
strerror(errno));
> -                       return;
> -               }
> -       }
> -       else
> -       {
> -               /* the connect returned EINPROGRESS (nonblocking connect)
we must wait for the fd to be writeable to determine if the connect worked
*/
> -
> -               tapdisk_server_unregister_event(s->stream_fd.id);
> -               if((id =
tapdisk_server_register_event(SCHEDULER_POLL_WRITE_FD, s->stream_fd.fd, 0,
remus_connect_event, s)) < 0) {
> -                       RPRINTF("error registering client connection
event handler: %s\n", strerror(id));
> -                       return;
> -               }
> -               s->stream_fd.id = id;
> -       }
> -}
> -
> -/* callback when nonblocking connect() is finished */
> -/* called only by primary in unprotected state */
> -static void remus_connect_event(event_id_t id, char mode, void *private)
> -{
> -       int socket_errno;
> -       socklen_t socket_errno_size;
> -       struct tdremus_state *s = (struct tdremus_state *)private;
> -
> -       /* check to se if the connect succeeded */
> -       socket_errno_size = sizeof(socket_errno);
> -       if (getsockopt(s->stream_fd.fd, SOL_SOCKET, SO_ERROR,
&socket_errno, &socket_errno_size)) {
> -               RPRINTF("error getting socket errno\n");
> -               return;
> -       }
> -
> -       RPRINTF("socket connect returned %d\n", socket_errno);
> -
> -       if(socket_errno)
> -       {
> -               /* the connect did not succeed */
> -
> -               if(socket_errno == ECONNREFUSED || socket_errno ==
ENETUNREACH || socket_errno == ETIMEDOUT
> -                  || socket_errno == ECONNABORTED || socket_errno ==
EAGAIN)
> -               {
> -                       /* we can probably assume that the backup is
down. just try again later */
> -                       tapdisk_server_unregister_event(s->stream_fd.id);
> -                       if((id =
tapdisk_server_register_event(SCHEDULER_POLL_TIMEOUT, s->stream_fd.fd,
REMUS_CONNRETRY_TIMEOUT, remus_retry_connect_event, s)) < 0) {
> -                               RPRINTF("error registering timeout client
connection event handler: %s\n", strerror(id));
> -                               return;
> -                       }
> -                       s->stream_fd.id = id;
> -               }
> -               else
> -               {
> -                       RPRINTF("socket connect returned %d, giving
up\n", socket_errno);
> -               }
> -       }
> -       else
> -       {
> -               /* the connect succeeded */
> -
> -               /* unregister this function and register a new event
handler */
> -               tapdisk_server_unregister_event(s->stream_fd.id);
> -               if((id =
tapdisk_server_register_event(SCHEDULER_POLL_READ_FD, s->stream_fd.fd, 0,
remus_client_event, s)) < 0) {
> -                       RPRINTF("error registering client event handler:
%s\n", strerror(id));
> -                       return;
> -               }
> -               s->stream_fd.id = id;
> -
> -               /* switch from unprotected to protected client */
> -               switch_mode(s->tdremus_driver, mode_primary);
> -       }
> -}
> -
> -
>  /* we install this event handler on the primary once we have connected
to the backup */
>  /* wait for "done" message to commit checkpoint */
>  static void remus_client_event(event_id_t id, char mode, void *private)
> @@ -1247,15 +1116,6 @@ static int server_do_wreq(td_driver_t *driver)
>         return -1;
>  }
>
> -static int server_do_sreq(td_driver_t *driver)
> -{
> -       /*
> -         RPRINTF("submit request received\n");
> -  */
> -
> -       return 0;
> -}
> -
>  /* at this point, the server can start applying the most recent
>   * ramdisk. */
>  static int server_do_creq(td_driver_t *driver)
> @@ -1296,8 +1156,6 @@ static void remus_server_event(event_id_t id, char
mode, void *private)
>
>         if (!strcmp(req, TDREMUS_WRITE))
>                 server_do_wreq(driver);
> -       else if (!strcmp(req, TDREMUS_SUBMIT))
> -               server_do_sreq(driver);
>         else if (!strcmp(req, TDREMUS_COMMIT))
>                 server_do_creq(driver);
>         else
> --
> 1.9.3
>
>
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
_______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 11432 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/17] tools: block-remus: use correct way to get remus_image
  2014-10-14  2:13 ` [PATCH 03/17] tools: block-remus: use correct way to get remus_image Wen Congyang
@ 2014-10-20  3:02   ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  3:02 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 9269 bytes --]

On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> We set remus_image in backup_read(). If we do flush
> before the first read operation, remus_image will be
> NULL. Pass image to remus via the callback td_open().
>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> ---
>  tools/blktap2/drivers/block-aio.c         | 6 ++++--
>  tools/blktap2/drivers/block-cache.c       | 5 +++--
>  tools/blktap2/drivers/block-log.c         | 5 +++--
>  tools/blktap2/drivers/block-qcow.c        | 6 ++++--
>  tools/blktap2/drivers/block-ram.c         | 6 ++++--
>  tools/blktap2/drivers/block-remus.c       | 8 ++++----
>  tools/blktap2/drivers/block-vhd.c         | 6 ++++--
>  tools/blktap2/drivers/tapdisk-interface.c | 3 +--
>  tools/blktap2/drivers/tapdisk.h           | 2 +-
>  9 files changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/tools/blktap2/drivers/block-aio.c
b/tools/blktap2/drivers/block-aio.c
> index 1b560e5..27ba07d 100644
> --- a/tools/blktap2/drivers/block-aio.c
> +++ b/tools/blktap2/drivers/block-aio.c
> @@ -40,6 +40,7 @@
>  #include "tapdisk.h"
>  #include "tapdisk-driver.h"
>  #include "tapdisk-interface.h"
> +#include "tapdisk-image.h"
>
>  #define MAX_AIO_REQS         TAPDISK_DATA_REQUESTS
>
> @@ -111,11 +112,12 @@ static int tdaio_get_image_info(int fd,
td_disk_info_t *info)
>  }
>
>  /* Open the disk file and initialize aio state. */
> -int tdaio_open(td_driver_t *driver, const char *name, td_flag_t flags,
> -              td_uuid_t uuid)
> +int tdaio_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
>  {
>         int i, fd, ret, o_flags;
>         struct tdaio_state *prv;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         ret = 0;
>         prv = (struct tdaio_state *)driver->data;
> diff --git a/tools/blktap2/drivers/block-cache.c
b/tools/blktap2/drivers/block-cache.c
> index cd6ea6a..ff2c773 100644
> --- a/tools/blktap2/drivers/block-cache.c
> +++ b/tools/blktap2/drivers/block-cache.c
> @@ -517,12 +517,13 @@ block_cache_put_request(block_cache_t *cache,
block_cache_request_t *breq)
>  }
>
>  static int
> -block_cache_open(td_driver_t *driver, const char *name, td_flag_t flags,
> -                td_uuid_t uuid)
> +block_cache_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
>  {
>         int i, err;
>         radix_tree_t *tree;
>         block_cache_t *cache;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         if (!td_flag_test(flags, TD_OPEN_RDONLY))
>                 return -EINVAL;
> diff --git a/tools/blktap2/drivers/block-log.c
b/tools/blktap2/drivers/block-log.c
> index 7b33b63..80351d3 100644
> --- a/tools/blktap2/drivers/block-log.c
> +++ b/tools/blktap2/drivers/block-log.c
> @@ -585,11 +585,12 @@ static void ctl_request(event_id_t id, char mode,
void *private)
>
>  static int tdlog_close(td_driver_t*);
>
> -static int tdlog_open(td_driver_t* driver, const char* name, td_flag_t
flags,
> -                     td_uuid_t uuid)
> +static int tdlog_open(td_driver_t* driver, td_image_t *image, td_uuid_t
uuid)
>  {
>    struct tdlog_state* s = (struct tdlog_state*)driver->data;
>    int rc;
> +  const char *name = image->name;
> +  td_flag_t flags = image->flags;
>
>    memset(s, 0, sizeof(*s));
>
> diff --git a/tools/blktap2/drivers/block-qcow.c
b/tools/blktap2/drivers/block-qcow.c
> index 64dfafc..c63bd9d 100644
> --- a/tools/blktap2/drivers/block-qcow.c
> +++ b/tools/blktap2/drivers/block-qcow.c
> @@ -45,6 +45,7 @@
>  #include "qcow.h"
>  #include "blk.h"
>  #include "atomicio.h"
> +#include "tapdisk-image.h"
>
>  /* *BSD has no O_LARGEFILE */
>  #ifndef O_LARGEFILE
> @@ -865,14 +866,15 @@ out:
>  }
>
>  /* Open the disk file and initialize qcow state. */
> -int tdqcow_open (td_driver_t *driver, const char *name, td_flag_t flags,
> -                td_uuid_t uuid)
> +int tdqcow_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
>  {
>         int fd, len, i, ret, size, o_flags;
>         td_disk_info_t *bs = &(driver->info);
>         struct tdqcow_state   *s  = (struct tdqcow_state *)driver->data;
>         QCowHeader header;
>         uint64_t final_cluster = 0;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         DPRINTF("QCOW: Opening %s\n", name);
>
> diff --git a/tools/blktap2/drivers/block-ram.c
b/tools/blktap2/drivers/block-ram.c
> index b64a194..3e148ab 100644
> --- a/tools/blktap2/drivers/block-ram.c
> +++ b/tools/blktap2/drivers/block-ram.c
> @@ -40,6 +40,7 @@
>  #include "tapdisk.h"
>  #include "tapdisk-driver.h"
>  #include "tapdisk-interface.h"
> +#include "tapdisk-image.h"
>
>  char *img;
>  long int   disksector_size;
> @@ -108,13 +109,14 @@ static int get_image_info(int fd, td_disk_info_t
*info)
>  }
>
>  /* Open the disk file and initialize ram state. */
> -int tdram_open (td_driver_t *driver, const char *name, td_flag_t flags,
> -               td_uuid_t uuid)
> +int tdram_open (td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
>  {
>         char *p;
>         uint64_t size;
>         int i, fd, ret = 0, count = 0, o_flags;
>         struct tdram_state *prv = (struct tdram_state *)driver->data;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         connections++;
>
> diff --git a/tools/blktap2/drivers/block-remus.c
b/tools/blktap2/drivers/block-remus.c
> index eb8c0ed..a2c08d8 100644
> --- a/tools/blktap2/drivers/block-remus.c
> +++ b/tools/blktap2/drivers/block-remus.c
> @@ -1152,8 +1152,6 @@ void backup_queue_read(td_driver_t *driver,
td_request_t treq)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>         int i;
> -       if(!remus_image)
> -               remus_image = treq.image;
>
>         /* check if this read is queued in any currently ongoing flush */
>         if (ramdisk_read(&s->ramdisk, treq.sec, treq.secs, treq.buf)) {
> @@ -1632,15 +1630,17 @@ static int ctl_register(struct tdremus_state *s)
>
>  /* interface */
>
> -static int tdremus_open(td_driver_t *driver, const char *name,
> -                       td_flag_t flags, td_uuid_t uuid)
> +static int tdremus_open(td_driver_t *driver, td_image_t *image,
td_uuid_t uuid)
>  {
>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
>         int rc;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         RPRINTF("opening %s\n", name);
>
>         device_vbd = tapdisk_server_get_vbd(uuid);
> +       remus_image = image;
>
>         memset(s, 0, sizeof(*s));
>         s->server_fd.fd = -1;
> diff --git a/tools/blktap2/drivers/block-vhd.c
b/tools/blktap2/drivers/block-vhd.c
> index 06e9c89..b20f724 100644
> --- a/tools/blktap2/drivers/block-vhd.c
> +++ b/tools/blktap2/drivers/block-vhd.c
> @@ -59,6 +59,7 @@
>  #include "tapdisk-driver.h"
>  #include "tapdisk-interface.h"
>  #include "tapdisk-disktype.h"
> +#include "tapdisk-image.h"
>
>  unsigned int SPB;
>
> @@ -675,10 +676,11 @@ __vhd_open(td_driver_t *driver, const char *name,
vhd_flag_t flags)
>  }
>
>  static int
> -_vhd_open(td_driver_t *driver, const char *name, td_flag_t flags,
> -         td_uuid_t uuid)
> +_vhd_open(td_driver_t *driver, td_image_t *image, td_uuid_t uuid)
>  {
>         vhd_flag_t vhd_flags = 0;
> +       const char *name = image->name;
> +       td_flag_t flags = image->flags;
>
>         if (flags & TD_OPEN_RDONLY)
>                 vhd_flags |= VHD_FLAG_OPEN_RDONLY;
> diff --git a/tools/blktap2/drivers/tapdisk-interface.c
b/tools/blktap2/drivers/tapdisk-interface.c
> index 36b5393..a29de64 100644
> --- a/tools/blktap2/drivers/tapdisk-interface.c
> +++ b/tools/blktap2/drivers/tapdisk-interface.c
> @@ -79,8 +79,7 @@ __td_open(td_image_t *image, td_disk_info_t *info)
>         }
>
>         if (!td_flag_test(driver->state, TD_DRIVER_OPEN)) {
> -               err = driver->ops->td_open(driver, image->name,
image->flags,
> -                                          vbd->uuid);
> +               err = driver->ops->td_open(driver, image, vbd->uuid);
>                 if (err) {
>                         if (!image->driver)
>                                 tapdisk_driver_free(driver);
> diff --git a/tools/blktap2/drivers/tapdisk.h
b/tools/blktap2/drivers/tapdisk.h
> index 459eaec..3c3b51d 100644
> --- a/tools/blktap2/drivers/tapdisk.h
> +++ b/tools/blktap2/drivers/tapdisk.h
> @@ -157,7 +157,7 @@ struct tap_disk {
>         const char                  *disk_type;
>         td_flag_t                    flags;
>         int                          private_data_size;
> -       int (*td_open)               (td_driver_t *, const char *,
td_flag_t, td_uuid_t);
> +       int (*td_open)               (td_driver_t *, td_image_t *,
td_uuid_t);
>         int (*td_close)              (td_driver_t *);
>         int (*td_get_parent_id)      (td_driver_t *, td_disk_id_t *);
>         int (*td_validate_parent)    (td_driver_t *, td_driver_t *,
td_flag_t);
> --
> 1.9.3
>
>
>

Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
_______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 12337 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/17] tools: block-remus: fix bug in tdremus_close()
  2014-10-20  3:01   ` Shriram Rajagopalan
@ 2014-10-20  3:05     ` Wen Congyang
  0 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-20  3:05 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 11:01 AM, Shriram Rajagopalan wrote:
> On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>>
>> We close ctl_fd.fd, but we don't unregister ctl_fd.id. It will
>> cause select() return fails, and the user cannot talk with
>> tapdisk2.
>>
>> This patch also does some cleanup.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
>> ---
>>  tools/blktap2/drivers/block-remus.c | 90
> ++++++++++++++++++++++---------------
>>  1 file changed, 53 insertions(+), 37 deletions(-)
>>
>> diff --git a/tools/blktap2/drivers/block-remus.c
> b/tools/blktap2/drivers/block-remus.c
>> index a2c08d8..fd5f209 100644
>> --- a/tools/blktap2/drivers/block-remus.c
>> +++ b/tools/blktap2/drivers/block-remus.c
>> @@ -151,9 +151,6 @@ typedef struct poll_fd {
>>  } poll_fd_t;
>>
>>  struct tdremus_state {
>> -//  struct tap_disk* driver;
>> -       void* driver_data;
>> -
>>    /* XXX: this is needed so that the server can perform operations on
>>     * the driver from the stream_fd event handler. fix this. */
>>         td_driver_t *tdremus_driver;
>> @@ -731,12 +728,26 @@ static int mwrite(int fd, void* buf, size_t len)
>>
>>  static void inline close_stream_fd(struct tdremus_state *s)
>>  {
>> +       if (s->stream_fd.fd < 0)
>> +               return;
>> +
>>         /* XXX: -2 is magic. replace with macro perhaps? */
>>         tapdisk_server_unregister_event(s->stream_fd.id);
>>         close(s->stream_fd.fd);
>>         s->stream_fd.fd = -2;
>>  }
>>
>> +static void close_server_fd(struct tdremus_state *s)
>> +{
>> +       if (s->server_fd.fd < 0)
>> +               return;
>> +
>> +       tapdisk_server_unregister_event(s->server_fd.id);
>> +       s->server_fd.id = -1;
>> +       close(s->stream_fd.fd);
>> +       s->stream_fd.fd = -1;
>> +}
>> +
>>  /* primary functions */
>>  static void remus_client_event(event_id_t, char mode, void *private);
>>  static void remus_connect_event(event_id_t id, char mode, void *private);
>> @@ -1347,12 +1358,7 @@ static int unprotected_start(td_driver_t *driver)
>>         /* close the server socket */
>>         close_stream_fd(s);
>>
>> -       /* unregister the replication stream */
>> -       tapdisk_server_unregister_event(s->server_fd.id);
>> -
>> -       /* close the replication stream */
>> -       close(s->server_fd.fd);
>> -       s->server_fd.fd = -1;
>> +       close_server_fd(s);
>>
>>         /* install the unprotected read/write handlers */
>>         tapdisk_remus.td_queue_read = unprotected_queue_read;
>> @@ -1553,27 +1559,27 @@ static int ctl_open(td_driver_t *driver, const
> char* name)
>>                         s->ctl_path[i] = '_';
>>         }
>>         if (asprintf(&s->msg_path, "%s.msg", s->ctl_path) < 0)
>> -               goto err_ctlfifo;
>> +               goto err_setmsgfifo;
>>
>>         if (mkfifo(s->ctl_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno !=
> EEXIST) {
>>                 RPRINTF("error creating control FIFO %s: %d\n",
> s->ctl_path, errno);
>> -               goto err_msgfifo;
>> +               goto err_mkctlfifo;
>>         }
>>
>>         if (mkfifo(s->msg_path, S_IRWXU|S_IRWXG|S_IRWXO) && errno !=
> EEXIST) {
>>                 RPRINTF("error creating message FIFO %s: %d\n",
> s->msg_path, errno);
>> -               goto err_msgfifo;
>> +               goto err_mkmsgfifo;
>>         }
>>
>>         /* RDWR so that fd doesn't block select when no writer is present
> */
>>         if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
>>                 RPRINTF("error opening control FIFO %s: %d\n",
> s->ctl_path, errno);
>> -               goto err_msgfifo;
>> +               goto err_openctlfifo;
>>         }
>>
>>         if ((s->msg_fd.fd = open(s->msg_path, O_RDWR)) < 0) {
>>                 RPRINTF("error opening message FIFO %s: %d\n",
> s->msg_path, errno);
>> -               goto err_openctlfifo;
>> +               goto err_openmsgfifo;
>>         }
>>
>>         RPRINTF("control FIFO %s\n", s->ctl_path);
>> @@ -1581,36 +1587,45 @@ static int ctl_open(td_driver_t *driver, const
> char* name)
>>
>>         return 0;
>>
>> - err_openctlfifo:
>> +err_openmsgfifo:
>>         close(s->ctl_fd.fd);
>> - err_msgfifo:
>> +       s->ctl_fd.fd = -1;
>> +err_openctlfifo:
>> +       unlink(s->ctl_path);
>> +err_mkmsgfifo:
>> +       unlink(s->msg_path);
>> +err_mkctlfifo:
>>         free(s->msg_path);
>>         s->msg_path = NULL;
>> - err_ctlfifo:
>> +err_setmsgfifo:
>>         free(s->ctl_path);
>>         s->ctl_path = NULL;
>>         return -1;
>>  }
>>
>> -static void ctl_close(td_driver_t *driver)
>> +static void ctl_close(struct tdremus_state *s)
>>  {
>> -       struct tdremus_state *s = (struct tdremus_state *)driver->data;
>> -
>> -       /* TODO: close *all* connections */
>> -
>> -       if(s->ctl_fd.fd)
>> +       if(s->ctl_fd.fd) {
>>                 close(s->ctl_fd.fd);
>> +               s->ctl_fd.fd = -1;
>> +       }
>>
>>         if (s->ctl_path) {
>>                 unlink(s->ctl_path);
>>                 free(s->ctl_path);
>>                 s->ctl_path = NULL;
>>         }
>> +
>>         if (s->msg_path) {
>>                 unlink(s->msg_path);
>>                 free(s->msg_path);
>>                 s->msg_path = NULL;
>>         }
>> +
>> +       if (s->msg_fd.fd) {
>> +               close(s->msg_fd.fd);
>> +               s->msg_fd.fd = -1;
>> +       }
>>  }
>>
>>  static int ctl_register(struct tdremus_state *s)
>> @@ -1628,6 +1643,16 @@ static int ctl_register(struct tdremus_state *s)
>>         return 0;
>>  }
>>
>> +static void ctl_unregister(struct tdremus_state *s)
>> +{
>> +       RPRINTF("unregistering ctl fifo\n");
>> +
>> +       if (s->ctl_fd.id >= 0) {
>> +               tapdisk_server_unregister_event(s->ctl_fd.id);
>> +               s->ctl_fd.id = -1;
>> +       }
>> +}
>> +
>>  /* interface */
>>
>>  static int tdremus_open(td_driver_t *driver, td_image_t *image,
> td_uuid_t uuid)
>> @@ -1658,13 +1683,12 @@ static int tdremus_open(td_driver_t *driver,
> td_image_t *image, td_uuid_t uuid)
>>
>>         if ((rc = ctl_open(driver, name))) {
>>                 RPRINTF("error setting up control channel\n");
>> -               free(s->driver_data);
>>                 return rc;
>>         }
>>
>>         if ((rc = ctl_register(s))) {
>>                 RPRINTF("error registering control channel\n");
>> -               free(s->driver_data);
>> +               ctl_close(s);
>>                 return rc;
>>         }
>>
>> @@ -1687,19 +1711,11 @@ static int tdremus_close(td_driver_t *driver)
>>         RPRINTF("closing\n");
>>         if (s->ramdisk.inprogress)
>>                 hashtable_destroy(s->ramdisk.inprogress, 0);
>> -
>> -       if (s->driver_data) {
>> -               free(s->driver_data);
>> -               s->driver_data = NULL;
>> -       }
>> -       if (s->server_fd.fd >= 0) {
>> -               close(s->server_fd.fd);
>> -               s->server_fd.fd = -1;
>> -       }
>> -       if (s->stream_fd.fd >= 0)
>> -               close_stream_fd(s);
>>
>> -       ctl_close(driver);
>> +       close_server_fd(s);
>> +       close_stream_fd(s);
>> +       ctl_unregister(s);
>> +       ctl_close(s);
>>
>>         return 0;
>>  }
>> --
>> 1.9.3
>>
>>
>>
> 
> Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>

Hmm, you have acked patch1-4...

Thanks
Wen Congyang

> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends
  2014-10-20  3:00   ` Shriram Rajagopalan
@ 2014-10-20  3:09     ` Wen Congyang
  0 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-20  3:09 UTC (permalink / raw)
  To: rshriram
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 11:00 AM, Shriram Rajagopalan wrote:
> On Oct 13, 2014 10:15 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>>
>> Just for test
> 
> What do you mean? You would like these to be reviewed but not committed?

No, you can apply this patch and use remus+blktap2 to test this patchset.

How to support blktap2 in libxl is still under discussion, and this patch
has some hack codes...

> 
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> ---
>>  tools/libxl/Makefile                  |   2 +-
>>  tools/libxl/libxl_create.c            |   8 ++
>>  tools/libxl/libxl_internal.h          |   2 +
>>  tools/libxl/libxl_remus_device.c      |   6 +
>>  tools/libxl/libxl_remus_disk_blktap.c | 209
> ++++++++++++++++++++++++++++++++++
>>  5 files changed, 226 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/libxl/libxl_remus_disk_blktap.c
>>
>> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
>> index 0bf666f..b58c2ff 100644
>> --- a/tools/libxl/Makefile
>> +++ b/tools/libxl/Makefile
>> @@ -56,7 +56,7 @@ else
>>  LIBXL_OBJS-y += libxl_nonetbuffer.o
>>  endif
>>
>> -LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
>> +LIBXL_OBJS-y += libxl_remus_device.o libxl_remus_disk_drbd.o
> libxl_remus_disk_blktap.o
>>
>>  LIBXL_OBJS-$(CONFIG_X86) += libxl_cpuid.o libxl_x86.o
>>  LIBXL_OBJS-$(CONFIG_ARM) += libxl_nocpuid.o libxl_arm.o
>> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
>> index 8b82584..e634694 100644
>> --- a/tools/libxl/libxl_create.c
>> +++ b/tools/libxl/libxl_create.c
>> @@ -853,6 +853,14 @@ static void initiate_domain_create(libxl__egc *egc,
>>      for (i = 0; i < d_config->num_disks; i++) {
>>          ret = libxl__device_disk_setdefault(gc, &d_config->disks[i]);
>>          if (ret) goto error_out;
>> +
>> +        /* TODO: cleanup it when destroying the domain */
>> +        if (d_config->disks[i].backend == LIBXL_DISK_BACKEND_TAP &&
>> +            d_config->disks[i].filter)
>> +            libxl__blktap_devpath(gc, d_config->disks[i].pdev_path,
>> +                                  d_config->disks[i].format,
>> +                                  d_config->disks[i].filter,
>> +                                  d_config->disks[i].filter_params);
>>      }

These codes not very clean...


>>
>>      dcs->bl.ao = ao;
>> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
>> index 282b03f..a7c2334 100644
>> --- a/tools/libxl/libxl_internal.h
>> +++ b/tools/libxl/libxl_internal.h
>> @@ -2672,6 +2672,8 @@ int init_subkind_nic(libxl__remus_devices_state
> *rds);
>>  void cleanup_subkind_nic(libxl__remus_devices_state *rds);
>>  int init_subkind_drbd_disk(libxl__remus_devices_state *rds);
>>  void cleanup_subkind_drbd_disk(libxl__remus_devices_state *rds);
>> +int init_subkind_blktap_disk(libxl__remus_devices_state *rds);
>> +void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds);
>>
>>  typedef void libxl__remus_callback(libxl__egc *,
>>                                     libxl__remus_devices_state *, int rc);
>> diff --git a/tools/libxl/libxl_remus_device.c
> b/tools/libxl/libxl_remus_device.c
>> index a6cb7f6..ef272ac 100644
>> --- a/tools/libxl/libxl_remus_device.c
>> +++ b/tools/libxl/libxl_remus_device.c
>> @@ -19,9 +19,11 @@
>>
>>  extern const libxl__remus_device_instance_ops remus_device_nic;
>>  extern const libxl__remus_device_instance_ops remus_device_drbd_disk;
>> +extern const libxl__remus_device_instance_ops remus_device_blktap2_disk;
>>  static const libxl__remus_device_instance_ops *remus_ops[] = {
>>      &remus_device_nic,
>>      &remus_device_drbd_disk,
>> +    &remus_device_blktap2_disk,
>>      NULL,
>>  };
>>
>> @@ -41,6 +43,9 @@ static int
> init_device_subkind(libxl__remus_devices_state *rds)
>>      rc = init_subkind_drbd_disk(rds);
>>      if (rc) goto out;
>>
>> +    rc = init_subkind_blktap_disk(rds);
>> +    if (rc) goto out;
>> +
>>      rc = 0;
>>  out:
>>      return rc;
>> @@ -55,6 +60,7 @@ static void
> cleanup_device_subkind(libxl__remus_devices_state *rds)
>>          cleanup_subkind_nic(rds);
>>
>>      cleanup_subkind_drbd_disk(rds);
>> +    cleanup_subkind_blktap_disk(rds);
>>  }
>>
>>  /*----- setup() and teardown() -----*/
>> diff --git a/tools/libxl/libxl_remus_disk_blktap.c
> b/tools/libxl/libxl_remus_disk_blktap.c
>> new file mode 100644
>> index 0000000..3ae77d6
>> --- /dev/null
>> +++ b/tools/libxl/libxl_remus_disk_blktap.c
>> @@ -0,0 +1,209 @@
>> +/*
>> + * Copyright (C) 2014 FUJITSU LIMITED
>> + * Author Wen Congyang <wency@cn.fujitsu.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU Lesser General Public License as
> published
>> + * by the Free Software Foundation; version 2.1 only. with the special
>> + * exception on linking described in file LICENSE.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU Lesser General Public License for more details.
>> + */
>> +
>> +#include "libxl_osdeps.h" /* must come before any other headers */
>> +
>> +#include "libxl_internal.h"
>> +
>> +#include <string.h>
>> +#include <sys/un.h>
>> +
>> +#define     BLKTAP2_REQUEST     "flush"
>> +#define     BLKTAP2_RESPONSE    "done"
>> +#define     BLKTAP_CTRL_DIR     "/var/run/tap"
>> +
>> +typedef struct libxl__remus_blktap2_disk {
>> +    char *name;
>> +    char *ctl_fifo_path;
>> +    char *msg_fifo_path;
>> +    int ctl_fd;
>> +    int msg_fd;
>> +    libxl__ev_fd ev;
>> +    libxl__remus_device *dev;
>> +}libxl__remus_blktap2_disk;
>> +
>> +int init_subkind_blktap_disk(libxl__remus_devices_state *rds)
>> +{
>> +    return 0;
>> +}
>> +
>> +void cleanup_subkind_blktap_disk(libxl__remus_devices_state *rds)
>> +{
>> +    return;
>> +}
>> +/* ========== setup() and teardown() ========== */
>> +static void blktap2_remus_setup(libxl__egc *egc, libxl__remus_device
> *dev)
>> +{
>> +    const libxl_device_disk *disk = dev->backend_dev;
>> +    libxl__remus_blktap2_disk *blktap2_disk;
>> +    int rc;
>> +    int i, l;
>> +
>> +    STATE_AO_GC(dev->rds->ao);
>> +
>> +    if (disk->backend != LIBXL_DISK_BACKEND_TAP ||
>> +        !disk->filter ||
>> +        strcmp(disk->filter, "remus")) {
>> +        rc = ERROR_REMUS_DEVOPS_DOES_NOT_MATCH;
>> +        goto out;
>> +    }
>> +
>> +    dev->matched = 1;
>> +    GCNEW(blktap2_disk);
>> +    dev->concrete_data = blktap2_disk;
>> +    blktap2_disk->ctl_fd = -1;
>> +    blktap2_disk->msg_fd = -1;
>> +    blktap2_disk->dev = dev;
>> +
>> +    blktap2_disk->name = libxl__strdup(gc, disk->filter_params);
>> +    blktap2_disk->ctl_fifo_path = GCSPRINTF("%s/remus_%s",
>> +                                            BLKTAP_CTRL_DIR,
>> +                                            blktap2_disk->name);
>> +    /* scrub fifo pathname */
>> +    l = strlen(blktap2_disk->ctl_fifo_path);
>> +    for (i = strlen(BLKTAP_CTRL_DIR) + 1; i < l; i++) {
>> +        if (strchr(":/", blktap2_disk->ctl_fifo_path[i]))
>> +            blktap2_disk->ctl_fifo_path[i] = '_';
>> +    }
>> +    blktap2_disk->msg_fifo_path = GCSPRINTF("%s.msg",
>> +                                            blktap2_disk->ctl_fifo_path);
>> +
>> +    blktap2_disk->ctl_fd = open(blktap2_disk->ctl_fifo_path, O_WRONLY);
>> +    blktap2_disk->msg_fd = open(blktap2_disk->msg_fifo_path, O_RDONLY);
>> +    if (blktap2_disk->ctl_fd < 0 || blktap2_disk->msg_fd < 0) {
>> +        rc = ERROR_FAIL;
>> +        goto out;
>> +    }
>> +
>> +    libxl__ev_fd_init(&blktap2_disk->ev);
>> +
>> +    rc = 0;
>> +
>> +out:
>> +    dev->aodev.rc = rc;
>> +    dev->aodev.callback(egc, &dev->aodev);
>> +}
>> +
>> +static void blktap2_remus_teardown(libxl__egc *egc,
>> +                                   libxl__remus_device *dev)
>> +{
>> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
>> +
>> +    if (blktap2_disk->ctl_fd > 0) {
>> +        close(blktap2_disk->ctl_fd);
>> +        blktap2_disk->ctl_fd = -1;
>> +    }
>> +
>> +    if (blktap2_disk->msg_fd > 0) {
>> +        close(blktap2_disk->msg_fd);
>> +        blktap2_disk->msg_fd = -1;
>> +    }
>> +
>> +    dev->aodev.rc = 0;
>> +    dev->aodev.callback(egc, &dev->aodev);
>> +}
>> +
>> +/* ========== checkpointing APIs ========== */
>> +/*
>> + * When a new checkpoint is triggered, we do the following thing:
>> + *  1. send BLKTAP2_REQUEST to tapdisk2
>> + *  2. tapdisk2 send "creq"
>> + *  3. secondary vm's tapdisk2 reply "done"
>> + *  4. tapdisk2 writes BLKTAP2_RESPONSE to the socket
>> + *  5. read BLKTAP2_RESPONSE from the socket
>> + * Step1 and 5 are implemented here.
>> + */
>> +static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
>> +                                     int fd, short events, short
> revents);
>> +
>> +static void blktap2_remus_postsuspend(libxl__egc *egc,
>> +                                      libxl__remus_device *dev)
>> +{
>> +    int ret;
>> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
>> +    int rc = 0;
>> +
>> +    /* fifo fd, and not block */
>> +    ret = write(blktap2_disk->ctl_fd, BLKTAP2_REQUEST,
> strlen(BLKTAP2_REQUEST));
>> +    if (ret < strlen(BLKTAP2_REQUEST))
>> +        rc = ERROR_FAIL;
>> +
>> +    dev->aodev.rc = rc;
>> +    dev->aodev.callback(egc, &dev->aodev);
>> +}
>> +
>> +static void blktap2_remus_commit(libxl__egc *egc,
>> +                                 libxl__remus_device *dev)
>> +{
>> +    libxl__remus_blktap2_disk *blktap2_disk = dev->concrete_data;
>> +    int rc;
>> +
>> +    /* Convenience aliases */
>> +    const int fd = blktap2_disk->msg_fd;
>> +    libxl__ev_fd *const ev = &blktap2_disk->ev;
>> +
>> +    STATE_AO_GC(dev->rds->ao);
>> +
>> +    rc = libxl__ev_fd_register(gc, ev, blktap2_control_readable, fd,
> POLLIN);
>> +    if (rc) {
>> +        dev->aodev.rc = rc;
>> +        dev->aodev.callback(egc, &dev->aodev);
>> +    }
>> +}
>> +
>> +static void blktap2_control_readable(libxl__egc *egc, libxl__ev_fd *ev,
>> +                                     int fd, short events, short revents)
>> +{
>> +    libxl__remus_blktap2_disk *blktap2_disk =
>> +                CONTAINER_OF(ev, *blktap2_disk, ev);
>> +    int rc = 0, ret;
>> +    char response[5];
>> +
>> +    /* Convenience aliases */
>> +    libxl__remus_device *const dev = blktap2_disk->dev;
>> +
>> +    EGC_GC;
>> +
>> +    libxl__ev_fd_deregister(gc, ev);
>> +
>> +    if (revents & ~POLLIN) {
>> +        LOG(ERROR, "unexpected poll event 0x%x (should be POLLIN)",
> revents);
>> +        rc = ERROR_FAIL;
>> +        goto out;
>> +    }
>> +
>> +    ret = read(fd, response, sizeof(response) - 1);
>> +    if (ret < sizeof(response) - 1) {
>> +        rc = ERROR_FAIL;
>> +        goto out;
>> +    }
>> +
>> +    response[4] = '\0';
>> +    if (strcmp(response, BLKTAP2_RESPONSE))
>> +        rc = ERROR_FAIL;
>> +
>> +out:
>> +    dev->aodev.rc = rc;
>> +    dev->aodev.callback(egc, &dev->aodev);
>> +}
>> +
>> +
>> +const libxl__remus_device_instance_ops remus_device_blktap2_disk = {
>> +    .kind = LIBXL__DEVICE_KIND_VBD,
>> +    .setup = blktap2_remus_setup,
>> +    .teardown = blktap2_remus_teardown,
>> +    .postsuspend = blktap2_remus_postsuspend,
>> +    .commit = blktap2_remus_commit,
>> +};
>> --
>> 1.9.3
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/17] tools: block-remus: connect to backup asynchronously
  2014-10-20  3:00     ` Wen Congyang
@ 2014-10-20  3:11       ` Shriram Rajagopalan
  0 siblings, 0 replies; 50+ messages in thread
From: Shriram Rajagopalan @ 2014-10-20  3:11 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Dong Eddie, Jiang Yunhong, Ian Jackson, xen devel,
	FNST-Yang Hongyang, Ian Campbell


[-- Attachment #1.1: Type: text/plain, Size: 33589 bytes --]

On Oct 19, 2014 10:59 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
>
> On 10/20/2014 10:50 AM, Shriram Rajagopalan wrote:
> > On Oct 13, 2014 10:13 PM, "Wen Congyang" <wency@cn.fujitsu.com> wrote:
> >>
> >> Use the API to connect to backup asynchronously.
> >> Before the connection is established, we queue
> >> all I/O requests, and handle them when the connection
> >> is established.
> >>
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Cc: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> >> ---
> >>  tools/blktap2/drivers/block-remus.c       | 508
> > +++++++++++++-----------------
> >>  tools/blktap2/drivers/block-replication.h |   1 +
> >>  2 files changed, 221 insertions(+), 288 deletions(-)
> >>
> >> diff --git a/tools/blktap2/drivers/block-remus.c
> > b/tools/blktap2/drivers/block-remus.c
> >> index e5ad782..a2b9f62 100644
> >> --- a/tools/blktap2/drivers/block-remus.c
> >> +++ b/tools/blktap2/drivers/block-remus.c
> >> @@ -40,6 +40,7 @@
> >>  #include "hashtable.h"
> >>  #include "hashtable_itr.h"
> >>  #include "hashtable_utility.h"
> >> +#include "block-replication.h"
> >>
> >>  #include <errno.h>
> >>  #include <inttypes.h>
> >> @@ -49,10 +50,7 @@
> >>  #include <string.h>
> >>  #include <sys/time.h>
> >>  #include <sys/types.h>
> >> -#include <sys/socket.h>
> >> -#include <netdb.h>
> >>  #include <netinet/in.h>
> >> -#include <arpa/inet.h>
> >>  #include <sys/param.h>
> >>  #include <sys/sysctl.h>
> >>  #include <unistd.h>
> >> @@ -63,10 +61,12 @@
> >>  #define RAMDISK_HASHSIZE 128
> >>
> >>  /* connect retry timeout (seconds) */
> >> -#define REMUS_CONNRETRY_TIMEOUT 10
> >> +#define REMUS_CONNRETRY_TIMEOUT 1
> >>
> >>  #define RPRINTF(_f, _a...) syslog (LOG_DEBUG, "remus: " _f, ## _a)
> >>
> >> +#define MAX_REMUS_REQUESTS      TAPDISK_DATA_REQUESTS
> >> +
> >>  enum tdremus_mode {
> >>         mode_invalid = 0,
> >>         mode_unprotected,
> >> @@ -75,16 +75,14 @@ enum tdremus_mode {
> >>  };
> >>
> >>  struct tdremus_req {
> >> -       uint64_t sector;
> >> -       int nb_sectors;
> >> -       char buf[4096];
> >> +       td_request_t treq;
> >>  };
> >>
> >>  struct req_ring {
> >>         /* waste one slot to distinguish between empty and full */
> >> -       struct tdremus_req requests[MAX_REQUESTS * 2 + 1];
> >> -       unsigned int head;
> >> -       unsigned int tail;
> >> +       struct tdremus_req pending_requests[MAX_REMUS_REQUESTS + 1];
> >> +       unsigned int prod;
> >> +       unsigned int cons;
> >>  };
> >>
> >>  /* TODO: This isn't very pretty, but to properly generate our own
treqs
> > (needed
> >> @@ -161,13 +159,14 @@ struct tdremus_state {
> >>         char*     msg_path; /* output completion message here */
> >>         poll_fd_t msg_fd;
> >>
> >> -  /* replication host */
> >> -       struct sockaddr_in sa;
> >> -       poll_fd_t server_fd;    /* server listen port */
> >> +       td_replication_connect_t t;
> >>         poll_fd_t stream_fd;     /* replication channel */
> >>
> >> -       /* queue write requests, batch-replicate at submit */
> >> -       struct req_ring write_ring;
> >> +       /*
> >> +        * queue I/O requests, batch-replicate when
> >> +        * the connection is established.
> >> +        */
> >> +       struct req_ring queued_io;
> >>
> >>         /* ramdisk data*/
> >>         struct ramdisk ramdisk;
> >> @@ -206,11 +205,13 @@ static int tdremus_close(td_driver_t *driver);
> >>
> >>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode);
> >>  static int ctl_respond(struct tdremus_state *s, const char *response);
> >> +static int ctl_register(struct tdremus_state *s);
> >> +static void ctl_unregister(struct tdremus_state *s);
> >>
> >>  /* ring functions */
> >> -static inline unsigned int ring_next(struct req_ring* ring, unsigned
int
> > pos)
> >> +static inline unsigned int ring_next(unsigned int pos)
> >>  {
> >> -       if (++pos >= MAX_REQUESTS * 2 + 1)
> >> +       if (++pos >= MAX_REMUS_REQUESTS + 1)
> >>                 return 0;
> >>
> >>         return pos;
> >> @@ -218,13 +219,26 @@ static inline unsigned int ring_next(struct
> > req_ring* ring, unsigned int pos)
> >>
> >>  static inline int ring_isempty(struct req_ring* ring)
> >>  {
> >> -       return ring->head == ring->tail;
> >> +       return ring->cons == ring->prod;
> >>  }
> >>
> >>  static inline int ring_isfull(struct req_ring* ring)
> >>  {
> >> -       return ring_next(ring, ring->tail) == ring->head;
> >> +       return ring_next(ring->prod) == ring->cons;
> >>  }
> >> +
> >> +static void ring_add_request(struct req_ring *ring, const td_request_t
> > *treq)
> >> +{
> >> +       /* If ring is full, it means that tapdisk2 has some bug */
> >> +       if (ring_isfull(ring)) {
> >> +               RPRINTF("OOPS, ring is full\n");
> >> +               exit(1);
> >> +       }
> >> +
> >> +       ring->pending_requests[ring->prod].treq = *treq;
> >> +       ring->prod = ring_next(ring->prod);
> >> +}
> >> +
> >>  /* Prototype declarations */
> >>  static int ramdisk_flush(td_driver_t *driver, struct tdremus_state*
s);
> >>
> >> @@ -724,89 +738,113 @@ static int mwrite(int fd, void* buf, size_t len)
> >>         select(fd + 1, NULL, &wfds, NULL, &tv);
> >>  }
> >>
> >> -
> >> -static void inline close_stream_fd(struct tdremus_state *s)
> >> -{
> >> -       if (s->stream_fd.fd < 0)
> >> -               return;
> >> -
> >> -       /* XXX: -2 is magic. replace with macro perhaps? */
> >> -       tapdisk_server_unregister_event(s->stream_fd.id);
> >> -       close(s->stream_fd.fd);
> >> -       s->stream_fd.fd = -2;
> >> -}
> >> -
> >> -static void close_server_fd(struct tdremus_state *s)
> >> -{
> >> -       if (s->server_fd.fd < 0)
> >> -               return;
> >> -
> >> -       tapdisk_server_unregister_event(s->server_fd.id);
> >> -       s->server_fd.id = -1;
> >> -       close(s->stream_fd.fd);
> >> -       s->stream_fd.fd = -1;
> >> -}
> >> -
> >>  /* primary functions */
> >>  static void remus_client_event(event_id_t, char mode, void *private);
> >> +static int primary_forward_request(struct tdremus_state *s,
> >> +                                  const td_request_t *treq);
> >>
> >> -static int primary_blocking_connect(struct tdremus_state *state)
> >> +/*
> >> + * It is called when we cannot connect to backup, or find I/O error
when
> >> + * reading/writing.
> >> + */
> >> +static void primary_failed(struct tdremus_state *s, int rc)
> >>  {
> >> -       int fd;
> >> -       int id;
> >> +       td_replication_connect_kill(&s->t);
> >> +       if (rc == ERROR_INTERNAL)
> >> +               RPRINTF("switch to unprotected mode due to internal
> > error");
> >> +       UNREGISTER_EVENT(s->stream_fd.id);
> >> +       switch_mode(s->tdremus_driver, mode_unprotected);
> >> +}
> >> +
> >> +static int remus_handle_queued_io(struct tdremus_state *s)
> >> +{
> >> +       struct req_ring *queued_io = &s->queued_io;
> >> +       unsigned int cons;
> >> +       td_request_t *treq;
> >>         int rc;
> >> -       int flags;
> >>
> >> -       RPRINTF("client connecting to %s:%d...\n",
> > inet_ntoa(state->sa.sin_addr), ntohs(state->sa.sin_port));
> >> +       while (!ring_isempty(queued_io)) {
> >> +               cons = queued_io->cons;
> >> +               treq = &queued_io->pending_requests[cons].treq;
> >>
> >> -       if ((fd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
> >> -               RPRINTF("could not create client socket: %d\n", errno);
> >> -               return -1;
> >> -       }
> >> -
> >> -       do {
> >> -               if ((rc = connect(fd, (struct sockaddr *)&state->sa,
> >> -                   sizeof(state->sa))) < 0)
> >> -               {
> >> -                       if (errno == ECONNREFUSED) {
> >> -                               RPRINTF("connection refused -- retrying
> > in 1 second\n");
> >> -                               sleep(1);
> >> -                       } else {
> >> -                               RPRINTF("connection failed: %d\n",
errno);
> >> -                               close(fd);
> >> -                               return -1;
> >> -                       }
> >> +               if (treq->op == TD_OP_WRITE) {
> >> +                       rc = primary_forward_request(s, treq);
> >> +                       if (rc)
> >> +                               return rc;
> >>                 }
> >> -       } while (rc < 0);
> >>
> >> -       RPRINTF("client connected\n");
> >> -
> >> -       /* make socket nonblocking */
> >> -       if ((flags = fcntl(fd, F_GETFL, 0)) == -1)
> >> -               flags = 0;
> >> -       if (fcntl(fd, F_SETFL, flags | O_NONBLOCK) == -1)
> >> -       {
> >> -               RPRINTF("error making socket nonblocking\n");
> >> -               close(fd);
> >> -               return -1;
> >> +               td_forward_request(*treq);
> >> +               queued_io->cons = ring_next(cons);
> >>         }
> >>
> >> -       if((id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> > fd, 0, remus_client_event, state)) < 0) {
> >> -               RPRINTF("error registering client event handler: %s\n",
> > strerror(id));
> >> -               close(fd);
> >> -               return -1;
> >> -       }
> >> -
> >> -       state->stream_fd.fd = fd;
> >> -       state->stream_fd.id = id;
> >>         return 0;
> >>  }
> >>
> >> -/* on read, just pass request through */
> >> +static void remus_client_established(td_replication_connect_t *t, int
rc)
> >> +{
> >> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
> >> +       event_id_t id;
> >> +
> >> +       if (rc) {
> >> +               primary_failed(s, rc);
> >> +               return;
> >> +       }
> >> +
> >> +       /* the connect succeeded */
> >> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
t->fd,
> >> +                                          0, remus_client_event, s);
> >> +       if(id < 0) {
> >> +               RPRINTF("error registering client event handler: %s\n",
> >> +                       strerror(id));
> >> +               primary_failed(s, ERROR_INTERNAL);
> >> +               return;
> >> +       }
> >> +
> >> +       s->stream_fd.fd = t->fd;
> >> +       s->stream_fd.id = id;
> >> +
> >> +       /* handle the queued requests */
> >> +       rc = remus_handle_queued_io(s);
> >> +       if (rc)
> >> +               primary_failed(s, rc);
> >> +}
> >> +
> >>  static void primary_queue_read(td_driver_t *driver, td_request_t treq)
> >>  {
> >> -       /* just pass read through */
> >> -       td_forward_request(treq);
> >> +       struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> +       struct req_ring *ring = &s->queued_io;
> >> +
> >> +       if (ring_isempty(ring)) {
> >> +               /* just pass read through */
> >> +               td_forward_request(treq);
> >> +               return;
> >> +       }
> >> +
> >> +       ring_add_request(ring, &treq);
> >> +}
> >> +
> >> +static int primary_forward_request(struct tdremus_state *s,
> >> +                                  const td_request_t *treq)
> >> +{
> >> +       char header[sizeof(uint32_t) + sizeof(uint64_t)];
> >> +       uint32_t *sectors = (uint32_t *)header;
> >> +       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
> >> +       td_driver_t *driver = s->tdremus_driver;
> >> +
> >> +       *sectors = treq->secs;
> >> +       *sector = treq->sec;
> >> +
> >> +       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE,
strlen(TDREMUS_WRITE))
> > < 0)
> >> +               return ERROR_IO;
> >> +
> >> +       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
> >> +               return ERROR_IO;
> >> +
> >> +       if (mwrite(s->stream_fd.fd, treq->buf,
> >> +           treq->secs * driver->info.sector_size) < 0)
> >> +               return ERROR_IO;
> >> +
> >> +       return 0;
> >>  }
> >>
> >>  /* TODO:
> >> @@ -819,28 +857,28 @@ static void primary_queue_read(td_driver_t
*driver,
> > td_request_t treq)
> >>  static void primary_queue_write(td_driver_t *driver, td_request_t
treq)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> -
> >> -       char header[sizeof(uint32_t) + sizeof(uint64_t)];
> >> -       uint32_t *sectors = (uint32_t *)header;
> >> -       uint64_t *sector = (uint64_t *)(header + sizeof(uint32_t));
> >> +       int rc, ret;
> >>
> >>         // RPRINTF("write: stream_fd.fd: %d\n", s->stream_fd.fd);
> >>
> >> -       /* -1 means we haven't connected yet, -2 means the connection
was
> > lost */
> >> -       if(s->stream_fd.fd == -1) {
> >> +       ret = td_replication_connect_status(&s->t);
> >> +       if(ret == -1) {
> >>                 RPRINTF("connecting to backup...\n");
> >> -               primary_blocking_connect(s);
> >> +               s->t.callback = remus_client_established;
> >> +               rc = td_replication_client_start(&s->t);
> >> +               if (rc)
> >> +                       goto fail;
> >>         }
> >>
> >> -       *sectors = treq.secs;
> >> -       *sector = treq.sec;
> >> +       /* The connection is not established, just queue the request */
> >> +       if (ret != 1) {
> >> +               ring_add_request(&s->queued_io, &treq);
> >> +               return;
> >> +       }
> >>
> >> -       if (mwrite(s->stream_fd.fd, TDREMUS_WRITE,
strlen(TDREMUS_WRITE))
> > < 0)
> >> -               goto fail;
> >> -       if (mwrite(s->stream_fd.fd, header, sizeof(header)) < 0)
> >> -               goto fail;
> >> -
> >> -       if (mwrite(s->stream_fd.fd, treq.buf, treq.secs *
> > driver->info.sector_size) < 0)
> >> +       /* The connection is established */
> >> +       rc = primary_forward_request(s, &treq);
> >> +       if (rc)
> >>                 goto fail;
> >>
> >>         td_forward_request(treq);
> >> @@ -850,7 +888,7 @@ static void primary_queue_write(td_driver_t
*driver,
> > td_request_t treq)
> >>   fail:
> >>         /* switch to unprotected mode and tell tapdisk to retry */
> >>         RPRINTF("write request replication failed, switching to
> > unprotected mode");
> >> -       switch_mode(s->tdremus_driver, mode_unprotected);
> >> +       primary_failed(s, rc);
> >>         td_complete_request(treq, -EBUSY);
> >>  }
> >>
> >> @@ -867,7 +905,7 @@ static int client_flush(td_driver_t *driver)
> >>
> >>         if (mwrite(s->stream_fd.fd, TDREMUS_COMMIT,
> > strlen(TDREMUS_COMMIT)) < 0) {
> >>                 RPRINTF("error flushing output");
> >> -               close_stream_fd(s);
> >> +               primary_failed(s, ERROR_IO);
> >>                 return -1;
> >>         }
> >>
> >> @@ -886,6 +924,26 @@ static int server_flush(td_driver_t *driver)
> >>         return ramdisk_flush(driver, s);
> >>  }
> >>
> >> +/* It is called when switching the mode from primary to unprotected */
> >> +static int primary_flush(td_driver_t *driver)
> >> +{
> >> +       struct tdremus_state *s = driver->data;
> >> +       struct req_ring *ring = &s->queued_io;
> >> +       unsigned int cons;
> >> +
> >> +       if (ring_isempty(ring))
> >> +               return 0;
> >> +
> >> +       while (!ring_isempty(ring)) {
> >> +               cons = ring->cons;
> >> +               ring->cons = ring_next(cons);
> >> +
> >> +               td_forward_request(ring->pending_requests[cons].treq);
> >> +       }
> >> +
> >> +       return client_flush(driver);
> >> +}
> >> +
> >>  static int primary_start(td_driver_t *driver)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> @@ -894,7 +952,7 @@ static int primary_start(td_driver_t *driver)
> >>
> >>         tapdisk_remus.td_queue_read = primary_queue_read;
> >>         tapdisk_remus.td_queue_write = primary_queue_write;
> >> -       s->queue_flush = client_flush;
> >> +       s->queue_flush = primary_flush;
> >>
> >>         s->stream_fd.fd = -1;
> >>         s->stream_fd.id = -1;
> >> @@ -913,7 +971,7 @@ static void remus_client_event(event_id_t id, char
> > mode, void *private)
> >>         if (mread(s->stream_fd.fd, req, sizeof(req) - 1) < 0) {
> >>                 /* replication stream closed or otherwise broken
> > (timeout, reset, &c) */
> >>                 RPRINTF("error reading from backup\n");
> >> -               close_stream_fd(s);
> >> +               primary_failed(s, ERROR_IO);
> >>                 return;
> >>         }
> >>
> >> @@ -924,7 +982,7 @@ static void remus_client_event(event_id_t id, char
> > mode, void *private)
> >>                 ctl_respond(s, TDREMUS_DONE);
> >>         else {
> >>                 RPRINTF("received unknown message: %s\n", req);
> >> -               close_stream_fd(s);
> >> +               primary_failed(s, ERROR_IO);
> >>         }
> >>
> >>         return;
> >> @@ -933,84 +991,36 @@ static void remus_client_event(event_id_t id,
char
> > mode, void *private)
> >>  /* backup functions */
> >>  static void remus_server_event(event_id_t id, char mode, void
*private);
> >>
> >> -/* returns the socket that receives write requests */
> >> -static void remus_server_accept(event_id_t id, char mode, void*
private)
> >> +/* It is called when we find some I/O error */
> >> +static void backup_failed(struct tdremus_state *s, int rc)
> >>  {
> >> -       struct tdremus_state* s = (struct tdremus_state *) private;
> >> +       UNREGISTER_EVENT(s->stream_fd.id);
> >> +       td_replication_connect_kill(&s->t);
> >> +       /* We will switch to unprotected mode in backup_queue_write()
*/
> >> +}
> >>
> >> -       int stream_fd;
> >> -       event_id_t cid;
> >> +/* returns the socket that receives write requests */
> >> +static void remus_server_established(td_replication_connect_t *t, int
rc)
> >> +{
> >> +       struct tdremus_state *s = CONTAINER_OF(t, *s, t);
> >> +       event_id_t id;
> >>
> >> -       /* XXX: add address-based black/white list */
> >> -       if ((stream_fd = accept(s->server_fd.fd, NULL, NULL)) < 0) {
> >> -               RPRINTF("error accepting connection: %d\n", errno);
> >> -               return;
> >> -       }
> >> -
> >> -       /* TODO: check to see if we are already replicating. if so just
> > close the
> >> -        * connection (or do something smarter) */
> >> -       RPRINTF("server accepted connection\n");
> >> +       /* rc is always 0 */
> >>
> >>         /* add tapdisk event for replication stream */
> >> -       cid = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> > stream_fd, 0,
> >> -                                           remus_server_event, s);
> >> +       id = tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
t->fd,
> > 0,
> >> +                                          remus_server_event, s);
> >>
> >> -       if(cid < 0) {
> >> -               RPRINTF("error registering connection event handler:
> > %s\n", strerror(errno));
> >> -               close(stream_fd);
> >> +       if (id < 0) {
> >> +               RPRINTF("error registering connection event handler:
> > %s\n",
> >> +                       strerror(errno));
> >> +               td_replication_server_restart(t);
> >>                 return;
> >>         }
> >>
> >>         /* store replication file descriptor */
> >> -       s->stream_fd.fd = stream_fd;
> >> -       s->stream_fd.id = cid;
> >> -}
> >> -
> >> -/* returns -2 if EADDRNOTAVAIL */
> >> -static int remus_bind(struct tdremus_state* s)
> >> -{
> >> -//  struct sockaddr_in sa;
> >> -       int opt;
> >> -       int rc = -1;
> >> -
> >> -       if ((s->server_fd.fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
> >> -               RPRINTF("could not create server socket: %d\n", errno);
> >> -               return rc;
> >> -       }
> >> -       opt = 1;
> >> -       if (setsockopt(s->server_fd.fd, SOL_SOCKET, SO_REUSEADDR, &opt,
> > sizeof(opt)) < 0)
> >> -               RPRINTF("Error setting REUSEADDR on %d: %d\n",
> > s->server_fd.fd, errno);
> >> -
> >> -       if (bind(s->server_fd.fd, (struct sockaddr *)&s->sa,
> > sizeof(s->sa)) < 0) {
> >> -               RPRINTF("could not bind server socket %d to %s:%d: %d
> > %s\n", s->server_fd.fd,
> >> -                       inet_ntoa(s->sa.sin_addr),
ntohs(s->sa.sin_port),
> > errno, strerror(errno));
> >> -               if (errno != EADDRINUSE)
> >> -                       rc = -2;
> >> -               goto err_sfd;
> >> -       }
> >> -       if (listen(s->server_fd.fd, 10)) {
> >> -               RPRINTF("could not listen on socket: %d\n", errno);
> >> -               goto err_sfd;
> >> -       }
> >> -
> >> -       /* The socket s now bound to the address and listening so we
may
> > now register
> >> -   * the fd with tapdisk */
> >> -
> >> -       if((s->server_fd.id =
> > tapdisk_server_register_event(SCHEDULER_POLL_READ_FD,
> >> -
> >  s->server_fd.fd, 0,
> >> -
> >  remus_server_accept, s)) < 0) {
> >> -               RPRINTF("error registering server connection event
> > handler: %s",
> >> -                       strerror(s->server_fd.id));
> >> -               goto err_sfd;
> >> -       }
> >> -
> >> -       return 0;
> >> -
> >> - err_sfd:
> >> -       close(s->server_fd.fd);
> >> -       s->server_fd.fd = -1;
> >> -
> >> -       return rc;
> >> +       s->stream_fd.fd = t->fd;
> >> +       s->stream_fd.id = id;
> >>  }
> >>
> >>  /* wait for latest checkpoint to be applied */
> >> @@ -1053,6 +1063,8 @@ void backup_queue_write(td_driver_t *driver,
> > td_request_t treq)
> >>          * handle the write
> >>          */
> >>
> >> +       /* If we have called backup_failed, calling it again is
harmless
> > */
> >> +       backup_failed(s, ERROR_INTERNAL);
> >>         switch_mode(driver, mode_unprotected);
> >>         /* TODO: call the appropriate write function rather than return
> > EBUSY */
> >>         td_complete_request(treq, -EBUSY);
> >> @@ -1061,7 +1073,6 @@ void backup_queue_write(td_driver_t *driver,
> > td_request_t treq)
> >>  static int backup_start(td_driver_t *driver)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> -       int fd;
> >>
> >>         if (ramdisk_start(driver) < 0)
> >>                 return -1;
> >> @@ -1073,12 +1084,12 @@ static int backup_start(td_driver_t *driver)
> >>         return 0;
> >>  }
> >>
> >> -static int server_do_wreq(td_driver_t *driver)
> >> +static void server_do_wreq(td_driver_t *driver)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >>         static tdremus_wire_t twreq;
> >>         char buf[4096];
> >> -       int len, rc;
> >> +       int len, rc = ERROR_IO;
> >>
> >>         char header[sizeof(uint32_t) + sizeof(uint64_t)];
> >>         uint32_t *sectors = (uint32_t *) header;
> >> @@ -1097,28 +1108,28 @@ static int server_do_wreq(td_driver_t *driver)
> >>         if (len > sizeof(buf)) {
> >>                 /* freak out! */
> >>                 RPRINTF("write request too large: %d/%u\n", len,
> > (unsigned)sizeof(buf));
> >> -               return -1;
> >> +               goto err;
> >>         }
> >>
> >>         if (mread(s->stream_fd.fd, buf, len) < 0)
> >>                 goto err;
> >>
> >> -       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0)
> >> +       if (ramdisk_write(&s->ramdisk, *sector, *sectors, buf) < 0) {
> >> +               rc = ERROR_INTERNAL;
> >>                 goto err;
> >> +       }
> >>
> >> -       return 0;
> >> +       return;
> >>
> >>   err:
> >>         /* should start failover */
> >>         RPRINTF("backup write request error\n");
> >> -       close_stream_fd(s);
> >> -
> >> -       return -1;
> >> +       backup_failed(s, rc);
> >>  }
> >>
> >>  /* at this point, the server can start applying the most recent
> >>   * ramdisk. */
> >> -static int server_do_creq(td_driver_t *driver)
> >> +static void server_do_creq(td_driver_t *driver)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >>
> >> @@ -1128,9 +1139,7 @@ static int server_do_creq(td_driver_t *driver)
> >>
> >>         /* XXX this message should not be sent until flush completes!
*/
> >>         if (write(s->stream_fd.fd, TDREMUS_DONE, strlen(TDREMUS_DONE))
!=
> > 4)
> >> -               return -1;
> >> -
> >> -       return 0;
> >> +               backup_failed(s, ERROR_IO);
> >>  }
> >>
> >>
> >> @@ -1213,11 +1222,6 @@ static int unprotected_start(td_driver_t
*driver)
> >>
> >>         RPRINTF("failure detected, activating passthrough\n");
> >>
> >> -       /* close the server socket */
> >> -       close_stream_fd(s);
> >> -
> >> -       close_server_fd(s);
> >> -
> >>         /* install the unprotected read/write handlers */
> >>         tapdisk_remus.td_queue_read = unprotected_queue_read;
> >>         tapdisk_remus.td_queue_write = unprotected_queue_write;
> >> @@ -1227,90 +1231,6 @@ static int unprotected_start(td_driver_t
*driver)
> >>
> >>
> >>  /* control */
> >> -
> >> -static inline int resolve_address(const char* addr, struct in_addr*
ia)
> >> -{
> >> -       struct hostent* he;
> >> -       uint32_t ip;
> >> -
> >> -       if (!(he = gethostbyname(addr))) {
> >> -               RPRINTF("error resolving %s: %d\n", addr, h_errno);
> >> -               return -1;
> >> -       }
> >> -
> >> -       if (!he->h_addr_list[0]) {
> >> -               RPRINTF("no address found for %s\n", addr);
> >> -               return -1;
> >> -       }
> >> -
> >> -       /* network byte order */
> >> -       ip = *((uint32_t**)he->h_addr_list)[0];
> >> -       ia->s_addr = ip;
> >> -
> >> -       return 0;
> >> -}
> >> -
> >> -static int get_args(td_driver_t *driver, const char* name)
> >> -{
> >> -       struct tdremus_state *state = (struct tdremus_state
> > *)driver->data;
> >> -       char* host;
> >> -       char* port;
> >> -//  char* driver_str;
> >> -//  char* parent;
> >> -//  int type;
> >> -//  char* path;
> >> -//  unsigned long ulport;
> >> -//  int i;
> >> -//  struct sockaddr_in server_addr_in;
> >> -
> >> -       int gai_status;
> >> -       int valid_addr;
> >> -       struct addrinfo gai_hints;
> >> -       struct addrinfo *servinfo, *servinfo_itr;
> >> -
> >> -       memset(&gai_hints, 0, sizeof gai_hints);
> >> -       gai_hints.ai_family = AF_UNSPEC;
> >> -       gai_hints.ai_socktype = SOCK_STREAM;
> >> -
> >> -       port = strchr(name, ':');
> >> -       if (!port) {
> >> -               RPRINTF("missing host in %s\n", name);
> >> -               return -ENOENT;
> >> -       }
> >> -       if (!(host = strndup(name, port - name))) {
> >> -               RPRINTF("unable to allocate host\n");
> >> -               return -ENOMEM;
> >> -       }
> >> -       port++;
> >> -
> >> -       if ((gai_status = getaddrinfo(host, port, &gai_hints,
&servinfo))
> > != 0) {
> >> -               RPRINTF("getaddrinfo error: %s\n",
> > gai_strerror(gai_status));
> >> -               return -ENOENT;
> >> -       }
> >> -
> >> -       /* TODO: do something smarter here */
> >> -       valid_addr = 0;
> >> -       for(servinfo_itr = servinfo; servinfo_itr != NULL;
servinfo_itr =
> > servinfo_itr->ai_next) {
> >> -               void *addr;
> >> -               char *ipver;
> >> -
> >> -               if (servinfo_itr->ai_family == AF_INET) {
> >> -                       valid_addr = 1;
> >> -                       memset(&state->sa, 0, sizeof(state->sa));
> >> -                       state->sa = *(struct sockaddr_in
> > *)servinfo_itr->ai_addr;
> >> -                       break;
> >> -               }
> >> -       }
> >> -       freeaddrinfo(servinfo);
> >> -
> >> -       if (!valid_addr)
> >> -               return -ENOENT;
> >> -
> >> -       RPRINTF("host: %s, port: %d\n", inet_ntoa(state->sa.sin_addr),
> > ntohs(state->sa.sin_port));
> >> -
> >> -       return 0;
> >> -}
> >> -
> >>  static int switch_mode(td_driver_t *driver, enum tdremus_mode mode)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> @@ -1343,6 +1263,20 @@ static int switch_mode(td_driver_t *driver, enum
> > tdremus_mode mode)
> >>         return rc;
> >>  }
> >>
> >> +static void ctl_reopen(struct tdremus_state *s)
> >> +{
> >> +       ctl_unregister(s);
> >> +       CLOSE_FD(s->ctl_fd.fd);
> >> +       RPRINTF("FIFO closed\n");
> >> +
> >> +       if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
> >> +               RPRINTF("error reopening FIFO: %d\n", errno);
> >> +               return;
> >> +       }
> >> +
> >> +       ctl_register(s);
> >> +}
> >> +
> >>  static void ctl_request(event_id_t id, char mode, void *private)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)private;
> >> @@ -1355,11 +1289,7 @@ static void ctl_request(event_id_t id, char
mode,
> > void *private)
> >>         if (!(rc = read(s->ctl_fd.fd, msg, sizeof(msg) - 1 /* append
nul
> > */))) {
> >>                 RPRINTF("0-byte read received, reopening FIFO\n");
> >>                 /*TODO: we may have to unregister/re-register with
> > tapdisk_server */
> >> -               close(s->ctl_fd.fd);
> >> -               RPRINTF("FIFO closed\n");
> >> -               if ((s->ctl_fd.fd = open(s->ctl_path, O_RDWR)) < 0) {
> >> -                       RPRINTF("error reopening FIFO: %d\n", errno);
> >> -               }
> >> +               ctl_reopen(s);
> >>                 return;
> >>         }
> >>
> >> @@ -1372,7 +1302,7 @@ static void ctl_request(event_id_t id, char mode,
> > void *private)
> >>         msg[rc] = '\0';
> >>         if (!strncmp(msg, "flush", 5)) {
> >>                 if (s->mode == mode_primary) {
> >> -                       if ((rc = s->queue_flush(driver))) {
> >> +                       if ((rc = client_flush(driver))) {
> >>                                 RPRINTF("error passing flush request to
> > backup");
> >>                                 ctl_respond(s, TDREMUS_FAIL);
> >>                         }
> >> @@ -1521,6 +1451,7 @@ static void ctl_unregister(struct tdremus_state
*s)
> >>  static int tdremus_open(td_driver_t *driver, td_image_t *image,
> > td_uuid_t uuid)
> >>  {
> >>         struct tdremus_state *s = (struct tdremus_state *)driver->data;
> >> +       td_replication_connect_t *t = &s->t;
> >>         int rc;
> >>         const char *name = image->name;
> >>         td_flag_t flags = image->flags;
> >> @@ -1531,7 +1462,6 @@ static int tdremus_open(td_driver_t *driver,
> > td_image_t *image, td_uuid_t uuid)
> >>         remus_image = image;
> >>
> >>         memset(s, 0, sizeof(*s));
> >> -       s->server_fd.fd = -1;
> >>         s->stream_fd.fd = -1;
> >>         s->ctl_fd.fd = -1;
> >>         s->msg_fd.fd = -1;
> >> @@ -1540,8 +1470,11 @@ static int tdremus_open(td_driver_t *driver,
> > td_image_t *image, td_uuid_t uuid)
> >>          * the driver stack from the stream_fd event handler */
> >>         s->tdremus_driver = driver;
> >>
> >> -       /* parse name to get info etc */
> >> -       if ((rc = get_args(driver, name)))
> >> +       t->log_prefix = "remus";
> >> +       t->retry_timeout_s = REMUS_CONNRETRY_TIMEOUT;
> >> +       t->max_connections = 10;
> >> +       t->callback = remus_server_established;
> >> +       if ((rc = td_replication_connect_init(t, name)))
> >>                 return rc;
> >>
> >>         if ((rc = ctl_open(driver, name))) {
> >> @@ -1555,7 +1488,7 @@ static int tdremus_open(td_driver_t *driver,
> > td_image_t *image, td_uuid_t uuid)
> >>                 return rc;
> >>         }
> >>
> >> -       if (!(rc = remus_bind(s)))
> >> +       if (!(rc = td_replication_server_start(t)))
> >>                 rc = switch_mode(driver, mode_backup);
> >>         else if (rc == -2)
> >>                 rc = switch_mode(driver, mode_primary);
> >> @@ -1575,8 +1508,7 @@ static int tdremus_close(td_driver_t *driver)
> >>         if (s->ramdisk.inprogress)
> >>                 hashtable_destroy(s->ramdisk.inprogress, 0);
> >>
> >> -       close_server_fd(s);
> >> -       close_stream_fd(s);
> >> +       td_replication_connect_kill(&s->t);
> >>         ctl_unregister(s);
> >>         ctl_close(s);
> >>
> >> diff --git a/tools/blktap2/drivers/block-replication.h
> > b/tools/blktap2/drivers/block-replication.h
> >> index 9e051cc..07fd630 100644
> >> --- a/tools/blktap2/drivers/block-replication.h
> >> +++ b/tools/blktap2/drivers/block-replication.h
> >> @@ -48,6 +48,7 @@
> >>  enum {
> >>         ERROR_INTERNAL = -1,
> >>         ERROR_CONNECTION = -2,
> >> +       ERROR_IO = -3,
> >>  };
> >>
> >>  typedef struct td_replication_connect td_replication_connect_t;
> >> --
> >> 1.9.3
> >>
> >
> > The code looks ok. Have you tested this, with some read/write workload
> > inside the guest? Especially read after write style sanity checks to
ensure
> > that there is no data corruption (caused by stale ramdisk data flushed
to
> > disk or served to guest), before a connection to backup has been
> > established.
>
> Which current testtool can check this?
> Before the connection to backup has been established, the guest will be
blocked
> when the first write operation happens. So you cannot log in and run a
test program.
>

That is how Remus behaves with current blktap2. I thought this patch was
trying to allow the guest to run normally before starting Remus while
buffering writes in a ramdisk.

> > I am acking this piece under good faith that you have tested all these
> > cases.
>
> Yes. Apply the hack patch17, you can run remus with blktap2.
>
> I have tested it with pgbench. IIRC, in the test, I only find one problem:
> select() will be timeout in xc_domain_restore.c.
>

Pgbench is too heavy for this test. You are better off running your own
simple C code that does these basic sanity checks.

> Thanks
> Wen Congyang
>
> >
> > Acked-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
> >
>

[-- Attachment #1.2: Type: text/html, Size: 51010 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-15  1:05   ` Wen Congyang
  2014-10-19 20:34     ` Shriram Rajagopalan
@ 2014-10-20 14:25     ` George Dunlap
  2014-10-21  2:28       ` Wen Congyang
                         ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: George Dunlap @ 2014-10-20 14:25 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>> These bugs are found when we implement COLO, or rebase
>>> COLO to upstream xen. They are independent patches, so
>>> post them in separate series.
>>
>> blktap2 is unmaintained AFAICT.
>>
>> In the last year there has been only one commit which shows evidence
>> of someone caring even slightly about tools/blktap2/.  The last
>> substantial attention was in March 2013.
>>
>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>> problems with new compilers, sort out build system and file
>> rearrangements, etc.)
>>
>> The file you are touching in your 01/17 was last edited (by anyone, at
>> all) in January 2010.
>>
>> Under the circumstances, we should probably take all these changes
>> without looking for anyone to ack them.
>>
>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>
> Hmm, I don't have any knowledge about disk format, but blktap2 have
> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
> maintain the rest codes.

Congyang, were you aware that XenServer has a fork of blktap is
actually still under active development and maintainership outside of
the main Xen tree?

git://github.com/xen-org/blktap.git

Both CentOS and Fedora are actually using snapshots of the "blktap2"
branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
believe Fedora is.)  It's not unlikely that the bugs you're fixing
here have already been fixed in the XenServer fork.

I think we could consider taking these patches for the 4.5 release, as
it's obviously too late to do anything more drastic at this point.
But I think long-term we need to sort out a better solution.  I'll
write up an e-mail here to talk about a longer-term plan shortly...

 -George

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-20 14:25     ` George Dunlap
@ 2014-10-21  2:28       ` Wen Congyang
  2014-10-21  2:56       ` Wen Congyang
  2014-10-29  5:49       ` Wen Congyang
  2 siblings, 0 replies; 50+ messages in thread
From: Wen Congyang @ 2014-10-21  2:28 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 10:25 PM, George Dunlap wrote:
> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>> These bugs are found when we implement COLO, or rebase
>>>> COLO to upstream xen. They are independent patches, so
>>>> post them in separate series.
>>>
>>> blktap2 is unmaintained AFAICT.
>>>
>>> In the last year there has been only one commit which shows evidence
>>> of someone caring even slightly about tools/blktap2/.  The last
>>> substantial attention was in March 2013.
>>>
>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>> problems with new compilers, sort out build system and file
>>> rearrangements, etc.)
>>>
>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>> all) in January 2010.
>>>
>>> Under the circumstances, we should probably take all these changes
>>> without looking for anyone to ack them.
>>>
>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>
>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>> maintain the rest codes.
> 
> Congyang, were you aware that XenServer has a fork of blktap is
> actually still under active development and maintainership outside of
> the main Xen tree?
> 
> git://github.com/xen-org/blktap.git
> 
> Both CentOS and Fedora are actually using snapshots of the "blktap2"
> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
> believe Fedora is.)  It's not unlikely that the bugs you're fixing
> here have already been fixed in the XenServer fork.

OK, I will check that tree.

Thanks
Wen Congyang

> 
> I think we could consider taking these patches for the 4.5 release, as
> it's obviously too late to do anything more drastic at this point.
> But I think long-term we need to sort out a better solution.  I'll
> write up an e-mail here to talk about a longer-term plan shortly...
> 
>  -George
> .
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-20 14:25     ` George Dunlap
  2014-10-21  2:28       ` Wen Congyang
@ 2014-10-21  2:56       ` Wen Congyang
  2014-10-21  9:55         ` George Dunlap
  2014-10-29  5:49       ` Wen Congyang
  2 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-21  2:56 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 10:25 PM, George Dunlap wrote:
> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>> These bugs are found when we implement COLO, or rebase
>>>> COLO to upstream xen. They are independent patches, so
>>>> post them in separate series.
>>>
>>> blktap2 is unmaintained AFAICT.
>>>
>>> In the last year there has been only one commit which shows evidence
>>> of someone caring even slightly about tools/blktap2/.  The last
>>> substantial attention was in March 2013.
>>>
>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>> problems with new compilers, sort out build system and file
>>> rearrangements, etc.)
>>>
>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>> all) in January 2010.
>>>
>>> Under the circumstances, we should probably take all these changes
>>> without looking for anyone to ack them.
>>>
>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>
>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>> maintain the rest codes.
> 
> Congyang, were you aware that XenServer has a fork of blktap is
> actually still under active development and maintainership outside of
> the main Xen tree?
> 
> git://github.com/xen-org/blktap.git
> 
> Both CentOS and Fedora are actually using snapshots of the "blktap2"
> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
> believe Fedora is.)  It's not unlikely that the bugs you're fixing
> here have already been fixed in the XenServer fork.

How to build upstream xen with this blktap2? Copy the codes to overwrite
blktap2 in xen.git?

Thanks
Wen Congang

> 
> I think we could consider taking these patches for the 4.5 release, as
> it's obviously too late to do anything more drastic at this point.
> But I think long-term we need to sort out a better solution.  I'll
> write up an e-mail here to talk about a longer-term plan shortly...
> 
>  -George
> .
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-21  2:56       ` Wen Congyang
@ 2014-10-21  9:55         ` George Dunlap
  2014-10-21 10:07           ` M A Young
  2014-10-21 10:45           ` Bob Ball
  0 siblings, 2 replies; 50+ messages in thread
From: George Dunlap @ 2014-10-21  9:55 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Bob Ball, Yang Hongyang, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

On Tue, Oct 21, 2014 at 3:56 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:

>> Congyang, were you aware that XenServer has a fork of blktap is
>> actually still under active development and maintainership outside of
>> the main Xen tree?
>>
>> git://github.com/xen-org/blktap.git
>>
>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>> here have already been fixed in the XenServer fork.
>
> How to build upstream xen with this blktap2? Copy the codes to overwrite
> blktap2 in xen.git?

Well what CentOS does at the moment is rm -rf tools/blktap2, then copy
the above repo into tools/blktap2.  It requires a couple of patches to
make things build / work (attached).

CentOS is using a snapshot from 2012 -- I haven't tried a more recent
version; but I *think* Fedora may be using a more recent one.  I think
Bob Ball might know more about the best way to link in this branch.

 -George

[-- Attachment #2: xen-centos-blktap25-ctl-ipc-restart.patch --]
[-- Type: application/octet-stream, Size: 1134 bytes --]

diff -uNrp xen-4.2.1.orig/tools/blktap2/control/tap-ctl-ipc.c xen-4.2.1/tools/blktap2/control/tap-ctl-ipc.c
--- xen-4.2.1.orig/tools/blktap2/control/tap-ctl-ipc.c	2013-01-22 11:43:54.000000000 -0600
+++ xen-4.2.1/tools/blktap2/control/tap-ctl-ipc.c	2013-03-27 00:07:21.521475110 -0500
@@ -58,8 +58,11 @@ tap_ctl_read_raw(int fd, void *buf, size
 		FD_SET(fd, &readfds);
 
 		ret = select(fd + 1, &readfds, NULL, NULL, timeout);
-		if (ret == -1)
-			break;
+                if (ret == -1) {
+                        if (errno == EINTR)
+                                continue;
+                        break;
+                }
 		else if (FD_ISSET(fd, &readfds)) {
 			ret = read(fd, buf + offset, size - offset);
 			if (ret <= 0)
@@ -114,6 +117,11 @@ tap_ctl_write_message(int fd, tapdisk_me
 		 * bit more time than expected. */
 
 		ret = select(fd + 1, NULL, &writefds, NULL, timeout);
+               if (ret == -1) {
+                        if (errno == EINTR)
+                                continue;
+                        break;
+                }
 		if (ret == -1)
 			break;
 		else if (FD_ISSET(fd, &writefds)) {

[-- Attachment #3: xen-centos-disableWerror-blktap25.patch --]
[-- Type: application/octet-stream, Size: 1489 bytes --]

	diff -uNr xen-4.2.1__orig/tools/blktap2/drivers/Makefile xen-4.2.1/tools/blktap2/drivers/Makefile
--- xen-4.2.1__orig/tools/blktap2/drivers/Makefile	2013-01-22 14:21:13.643741669 -0500
+++ xen-4.2.1/tools/blktap2/drivers/Makefile	2013-01-22 14:21:44.347092274 -0500
@@ -207,7 +207,7 @@
 top_build_prefix = ../
 top_builddir = ..
 top_srcdir = ..
-AM_CFLAGS = -Wall -Werror
+AM_CFLAGS = -Wall 
 AM_CPPFLAGS = -D_GNU_SOURCE -I$(top_srcdir)/include
 tapdisk_SOURCES = tapdisk2.c
 tapdisk_LDADD = libtapdisk.la
diff -uNr xen-4.2.1__orig/tools/blktap2/drivers/Makefile.am xen-4.2.1/tools/blktap2/drivers/Makefile.am
--- xen-4.2.1__orig/tools/blktap2/drivers/Makefile.am	2013-01-22 12:43:54.000000000 -0500
+++ xen-4.2.1/tools/blktap2/drivers/Makefile.am	2013-01-22 14:21:44.347732663 -0500
@@ -1,6 +1,6 @@
 
 AM_CFLAGS  = -Wall
-AM_CFLAGS += -Werror
+AM_CFLAGS += 
 
 AM_CPPFLAGS  = -D_GNU_SOURCE
 AM_CPPFLAGS += -I$(top_srcdir)/include
diff -uNr xen-4.2.1__orig/tools/blktap2/drivers/Makefile.in xen-4.2.1/tools/blktap2/drivers/Makefile.in
--- xen-4.2.1__orig/tools/blktap2/drivers/Makefile.in	2013-01-22 14:21:09.878842722 -0500
+++ xen-4.2.1/tools/blktap2/drivers/Makefile.in	2013-01-22 14:21:44.349092631 -0500
@@ -207,7 +207,7 @@
 top_build_prefix = @top_build_prefix@
 top_builddir = @top_builddir@
 top_srcdir = @top_srcdir@
-AM_CFLAGS = -Wall -Werror
+AM_CFLAGS = -Wall 
 AM_CPPFLAGS = -D_GNU_SOURCE -I$(top_srcdir)/include
 tapdisk_SOURCES = tapdisk2.c
 tapdisk_LDADD = libtapdisk.la

[-- Attachment #4: xen-centos-libxl-with-blktap25.patch --]
[-- Type: application/octet-stream, Size: 3848 bytes --]

diff --git a/tools/Rules.mk b/tools/Rules.mk
index f4e84c1..45f782d 100644
--- a/tools/Rules.mk
+++ b/tools/Rules.mk
@@ -46,9 +46,9 @@ LIBXL_BLKTAP ?= n
 endif
 
 ifeq ($(LIBXL_BLKTAP),y)
-CFLAGS_libblktapctl = -I$(XEN_BLKTAP2)/control -I$(XEN_BLKTAP2)/include $(CFLAGS_xeninclude)
-LDLIBS_libblktapctl = -L$(XEN_BLKTAP2)/control -lblktapctl
-SHLIB_libblktapctl  = -Wl,-rpath-link=$(XEN_BLKTAP2)/control
+CFLAGS_libblktapctl = -I$(XEN_BLKTAP2)/include $(CFLAGS_xeninclude)
+LDLIBS_libblktapctl = -L$(XEN_BLKTAP2)/control/.libs -lblktapctl
+SHLIB_libblktapctl  = -Wl,-rpath-link=$(XEN_BLKTAP2)/control/.libs
 else
 CFLAGS_libblktapctl =
 LDLIBS_libblktapctl =
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index 0b780c0..90cfd0d 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -71,7 +71,7 @@ LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
 
-$(LIBXL_OBJS): CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore) $(CFLAGS_libblktapctl) -include $(XEN_ROOT)/tools/config.h
+$(LIBXL_OBJS): CFLAGS += $(CFLAGS_libxenctrl) $(CFLAGS_libxenguest) $(CFLAGS_libxenstore) -I$(XEN_ROOT)/tools/blktap2/include -include $(XEN_ROOT)/tools/config.h
 
 AUTOINCS= libxlu_cfg_y.h libxlu_cfg_l.h _libxl_list.h _paths.h \
 	_libxl_save_msgs_callout.h _libxl_save_msgs_helper.h
diff --git a/tools/libxl/libxl_blktap2.c b/tools/libxl/libxl_blktap2.c
index 2053403..c85b182 100644
--- a/tools/libxl/libxl_blktap2.c
+++ b/tools/libxl/libxl_blktap2.c
@@ -29,20 +29,15 @@ char *libxl__blktap_devpath(libxl__gc *gc,
 {
     const char *type;
     char *params, *devname = NULL;
-    tap_list_t tap;
     int err;
 
     type = libxl__device_disk_string_of_format(format);
-    err = tap_ctl_find(type, disk, &tap);
-    if (err == 0) {
-        devname = libxl__sprintf(gc, "/dev/xen/blktap-2/tapdev%d", tap.minor);
-        if (devname)
-            return devname;
-    }
 
     params = libxl__sprintf(gc, "%s:%s", type, disk);
-    err = tap_ctl_create(params, &devname);
+    fprintf(stderr, "DEBUG %s %d %s\n",__func__,__LINE__,params);
+    err = tap_ctl_create(params, &devname, 0, -1, 0);
     if (!err) {
+        fprintf(stderr, "DEBUG %s %d %s\n",__func__,__LINE__,devname);
         libxl__ptr_add(gc, devname);
         return devname;
     }
@@ -55,7 +50,10 @@ int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
 {
     char *type, *disk;
     int err;
-    tap_list_t tap;
+	struct list_head list = LIST_HEAD_INIT(list);
+	tap_list_t *entry;
+    int minor = -1;
+    pid_t pid = -1;
 
     type = libxl__strdup(gc, params);
 
@@ -65,19 +63,34 @@ int libxl__device_destroy_tapdisk(libxl__gc *gc, const char *params)
         return ERROR_INVAL;
     }
 
+    fprintf(stderr, "DEBUG %s %d type=%s disk=%s\n",__func__,__LINE__,type,disk);
     *disk++ = '\0';
 
-    err = tap_ctl_find(type, disk, &tap);
-    if (err < 0) {
-        /* returns -errno */
+    err = tap_ctl_list(&list);
+    if (err)
+        return err;
+    tap_list_for_each_entry(entry, &list) {
+		if (type && (!entry->type || strcmp(entry->type, type)))
+			continue;
+
+		if (disk && (!entry->path || strcmp(entry->path, disk)))
+			continue;
+
+        minor = entry->minor;
+        pid = entry->pid;
+		break;
+	}
+	tap_ctl_list_free(&list);
+
+    if (minor < 0) {
         LOGEV(ERROR, -err, "Unable to find type %s disk %s", type, disk);
         return ERROR_FAIL;
     }
 
-    err = tap_ctl_destroy(tap.id, tap.minor);
+    err = tap_ctl_destroy(pid, minor, 1, NULL);
     if (err < 0) {
         LOGEV(ERROR, -err, "Failed to destroy tap device id %d minor %d",
-              tap.id, tap.minor);
+              pid, minor);
         return ERROR_FAIL;
     }

[-- Attachment #5: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-21  9:55         ` George Dunlap
@ 2014-10-21 10:07           ` M A Young
  2014-10-21 10:45           ` Bob Ball
  1 sibling, 0 replies; 50+ messages in thread
From: M A Young @ 2014-10-21 10:07 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lai Jiangshan, Wen Congyang, Ian Jackson, Jiang Yunhong,
	Dong Eddie, xen devel, Bob Ball, Yang Hongyang, Ian Campbell

On Tue, 21 Oct 2014, George Dunlap wrote:

> On Tue, Oct 21, 2014 at 3:56 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>
>>> Congyang, were you aware that XenServer has a fork of blktap is
>>> actually still under active development and maintainership outside of
>>> the main Xen tree?
>>>
>>> git://github.com/xen-org/blktap.git
>>>
>>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>>> here have already been fixed in the XenServer fork.
>>
>> How to build upstream xen with this blktap2? Copy the codes to overwrite
>> blktap2 in xen.git?
>
> Well what CentOS does at the moment is rm -rf tools/blktap2, then copy
> the above repo into tools/blktap2.  It requires a couple of patches to
> make things build / work (attached).
>
> CentOS is using a snapshot from 2012 -- I haven't tried a more recent
> version; but I *think* Fedora may be using a more recent one.  I think
> Bob Ball might know more about the best way to link in this branch.

Those patches need a slight modification if you are building against (what 
I think is) the latest release at
https://github.com/xapi-project/blktap/releases/tag/0.9.2
because in the libxl patch they have added an extra field (timeout) to the 
tap_ctl_create function. You also lose a few qcow utilities when compared 
to the blktap2 in the xen source.

 	Michael Young

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-21  9:55         ` George Dunlap
  2014-10-21 10:07           ` M A Young
@ 2014-10-21 10:45           ` Bob Ball
  1 sibling, 0 replies; 50+ messages in thread
From: Bob Ball @ 2014-10-21 10:45 UTC (permalink / raw)
  To: George Dunlap, Wen Congyang
  Cc: Thanos Makatos, Lai Jiangshan, Jiang Yunhong, Eddie Dong,
	xen devel, Ian Jackson, Yang Hongyang, Ian Campbell

> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George Dunlap
> Sent: 21 October 2014 10:55

> CentOS is using a snapshot from 2012 -- I haven't tried a more recent
> version; but I *think* Fedora may be using a more recent one.  I think
> Bob Ball might know more about the best way to link in this branch.

Fedora does not currently replace blktap in their xen-runtime but they provide a separate package based on XenServer's blktap 0.9.2 to generate their blktap-3.0.0 package.  I raised a bug in this area because the two packages cannot co-exist: https://bugzilla.redhat.com/show_bug.cgi?id=1150546

Note also that there is some work ongoing at the moment for which I will push for the creation of a new release; that's to enable compilation of the latest blktap (actually on branch 'xs64bit' not master) without performance-improving grant copy kernel patches which have not yet been upstreamed (https://github.com/tmakatos/blktap/tree/CA-147743).

Bob

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
                   ` (17 preceding siblings ...)
  2014-10-14 15:48 ` [PATCH 00/17] blktap2 related bugfix patches Ian Jackson
@ 2014-10-27 18:32 ` Konrad Rzeszutek Wilk
  18 siblings, 0 replies; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-10-27 18:32 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Lai Jiangshan

On Tue, Oct 14, 2014 at 10:13:48AM +0800, Wen Congyang wrote:
> These bugs are found when we implement COLO, or rebase
> COLO to upstream xen. They are independent patches, so
> post them in separate series.

There is no maintainer for blktap in the Xen code-base.
As such there is nobody to actually review the patches.

Now there is an version of blktap(v3?v4?) 'userspace' 
that I have been hearing from but I do not know much
about.

As such - I was wondering what should be done
about your changes to blktap?

Are you considering being the maintainer of this
version of blktap code?

> 
> The codes are also hosted on github:
> https://github.com/wencongyang/xen/commits/bugfix-v4
> 
> Lai Jiangshan (1):
>   tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error
> 
> Wen Congyang (16):
>   tools: block-remus: pass uuid to the callback td_open
>   tools: block-remus: use correct way to get remus_image
>   tools: block-remus: fix bug in tdremus_close()
>   tools: block-remus: fix memory leak
>   tools: blktap2: return the correct dev path
>   tools: blktap2: use correct way to get free event id
>   tools: blktap2: don't return negative event id
>   tools: blktap2: use correct way to define array.
>   tools: block-remus: fix bug in ctl_request()
>   tools: block-remus: clean unused functions
>   tools: blktap2: implement an API to create a connection asynchronously
>   tools: block-remus: connect to backup asynchronously
>   block-remus: switch to unprotected mode before closing
>   tools: blktap2: move ramdisk related codes to block-replication.c
>   support blktap remus in xl
>   HACK: libxl/remus: setup and control disk replication for blktap2
>     backends
> 
>  tools/blktap2/drivers/Makefile            |    1 +
>  tools/blktap2/drivers/block-aio.c         |   41 +-
>  tools/blktap2/drivers/block-cache.c       |    4 +-
>  tools/blktap2/drivers/block-log.c         |    4 +-
>  tools/blktap2/drivers/block-qcow.c        |    5 +-
>  tools/blktap2/drivers/block-ram.c         |    5 +-
>  tools/blktap2/drivers/block-remus.c       | 1201 +++++++----------------------
>  tools/blktap2/drivers/block-replication.c |  928 ++++++++++++++++++++++
>  tools/blktap2/drivers/block-replication.h |  178 +++++
>  tools/blktap2/drivers/block-vhd.c         |    5 +-
>  tools/blktap2/drivers/scheduler.c         |   33 +-
>  tools/blktap2/drivers/tapdisk-control.c   |   17 +-
>  tools/blktap2/drivers/tapdisk-disktype.c  |   12 +-
>  tools/blktap2/drivers/tapdisk-disktype.h  |    2 +-
>  tools/blktap2/drivers/tapdisk-interface.c |   21 +-
>  tools/blktap2/drivers/tapdisk-interface.h |    1 +
>  tools/blktap2/drivers/tapdisk-vbd.c       |    9 +
>  tools/blktap2/drivers/tapdisk-vbd.h       |    1 +
>  tools/blktap2/drivers/tapdisk.h           |    3 +-
>  tools/libxl/Makefile                      |    2 +-
>  tools/libxl/libxl.c                       |   25 +-
>  tools/libxl/libxl_blktap2.c               |   38 +-
>  tools/libxl/libxl_create.c                |    8 +
>  tools/libxl/libxl_device.c                |   35 +-
>  tools/libxl/libxl_dm.c                    |    4 +-
>  tools/libxl/libxl_internal.h              |   10 +-
>  tools/libxl/libxl_noblktap2.c             |    8 +-
>  tools/libxl/libxl_remus_device.c          |    6 +
>  tools/libxl/libxl_remus_disk_blktap.c     |  209 +++++
>  tools/libxl/libxl_types.idl               |    2 +
>  tools/libxl/libxlu_disk_l.l               |    2 +
>  31 files changed, 1857 insertions(+), 963 deletions(-)
>  create mode 100644 tools/blktap2/drivers/block-replication.c
>  create mode 100644 tools/blktap2/drivers/block-replication.h
>  create mode 100644 tools/libxl/libxl_remus_disk_blktap.c
> 
> -- 
> 1.9.3
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-20 14:25     ` George Dunlap
  2014-10-21  2:28       ` Wen Congyang
  2014-10-21  2:56       ` Wen Congyang
@ 2014-10-29  5:49       ` Wen Congyang
  2014-11-03  9:58         ` George Dunlap
  2 siblings, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-10-29  5:49 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/20/2014 10:25 PM, George Dunlap wrote:
> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>> These bugs are found when we implement COLO, or rebase
>>>> COLO to upstream xen. They are independent patches, so
>>>> post them in separate series.
>>>
>>> blktap2 is unmaintained AFAICT.
>>>
>>> In the last year there has been only one commit which shows evidence
>>> of someone caring even slightly about tools/blktap2/.  The last
>>> substantial attention was in March 2013.
>>>
>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>> problems with new compilers, sort out build system and file
>>> rearrangements, etc.)
>>>
>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>> all) in January 2010.
>>>
>>> Under the circumstances, we should probably take all these changes
>>> without looking for anyone to ack them.
>>>
>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>
>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>> maintain the rest codes.
> 
> Congyang, were you aware that XenServer has a fork of blktap is
> actually still under active development and maintainership outside of
> the main Xen tree?
> 
> git://github.com/xen-org/blktap.git
> 
> Both CentOS and Fedora are actually using snapshots of the "blktap2"
> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
> believe Fedora is.)  It's not unlikely that the bugs you're fixing
> here have already been fixed in the XenServer fork.

I have another question:
Why we don't merge the "blktap2' branch into xen upstream periodically?

Thanks
Wen Congyang

> 
> I think we could consider taking these patches for the 4.5 release, as
> it's obviously too late to do anything more drastic at this point.
> But I think long-term we need to sort out a better solution.  I'll
> write up an e-mail here to talk about a longer-term plan shortly...
> 
>  -George
> .
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-10-29  5:49       ` Wen Congyang
@ 2014-11-03  9:58         ` George Dunlap
  2014-11-03 10:07           ` Wen Congyang
  2015-02-13  6:56           ` Hongyang Yang
  0 siblings, 2 replies; 50+ messages in thread
From: George Dunlap @ 2014-11-03  9:58 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 10/29/2014 05:49 AM, Wen Congyang wrote:
> On 10/20/2014 10:25 PM, George Dunlap wrote:
>> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>>> These bugs are found when we implement COLO, or rebase
>>>>> COLO to upstream xen. They are independent patches, so
>>>>> post them in separate series.
>>>> blktap2 is unmaintained AFAICT.
>>>>
>>>> In the last year there has been only one commit which shows evidence
>>>> of someone caring even slightly about tools/blktap2/.  The last
>>>> substantial attention was in March 2013.
>>>>
>>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>>> problems with new compilers, sort out build system and file
>>>> rearrangements, etc.)
>>>>
>>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>>> all) in January 2010.
>>>>
>>>> Under the circumstances, we should probably take all these changes
>>>> without looking for anyone to ack them.
>>>>
>>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>>> maintain the rest codes.
>> Congyang, were you aware that XenServer has a fork of blktap is
>> actually still under active development and maintainership outside of
>> the main Xen tree?
>>
>> git://github.com/xen-org/blktap.git
>>
>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>> here have already been fixed in the XenServer fork.
> I have another question:
> Why we don't merge the "blktap2' branch into xen upstream periodically?

I take it you've found blktap "2.5" useful? :-)

I've been meaning to write an e-mail about this.

The basic reason is that it's normally up to the people doing the 
development to submit changes upstream.  Some years ago XenServer forked 
the blktap2 codebase but got behind in upstreaming things; at this point 
there are far too many changes to simply push them upstream.  
Furthermore, even XenServer isn't 100% sure what they're going to do in 
the future; as of a year ago they were planning to get rid of blktap 
entirely in favor of another solution.

One of the ideas I'm going to discuss in my e-mail is the idea of 
treating blktap2.5 (and/or blktap3) as an external upstream project, 
similar to the way that we treat qemu, seabios, ipxe, and ovmf. That 
would have a similar effect to what you describe.

  -George

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-11-03  9:58         ` George Dunlap
@ 2014-11-03 10:07           ` Wen Congyang
  2014-11-05 19:25             ` Konrad Rzeszutek Wilk
  2015-02-13  6:56           ` Hongyang Yang
  1 sibling, 1 reply; 50+ messages in thread
From: Wen Congyang @ 2014-11-03 10:07 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lai Jiangshan, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Yang Hongyang, Ian Campbell

On 11/03/2014 05:58 PM, George Dunlap wrote:
> On 10/29/2014 05:49 AM, Wen Congyang wrote:
>> On 10/20/2014 10:25 PM, George Dunlap wrote:
>>> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>>>> These bugs are found when we implement COLO, or rebase
>>>>>> COLO to upstream xen. They are independent patches, so
>>>>>> post them in separate series.
>>>>> blktap2 is unmaintained AFAICT.
>>>>>
>>>>> In the last year there has been only one commit which shows evidence
>>>>> of someone caring even slightly about tools/blktap2/.  The last
>>>>> substantial attention was in March 2013.
>>>>>
>>>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>>>> problems with new compilers, sort out build system and file
>>>>> rearrangements, etc.)
>>>>>
>>>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>>>> all) in January 2010.
>>>>>
>>>>> Under the circumstances, we should probably take all these changes
>>>>> without looking for anyone to ack them.
>>>>>
>>>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>>>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>>>> maintain the rest codes.
>>> Congyang, were you aware that XenServer has a fork of blktap is
>>> actually still under active development and maintainership outside of
>>> the main Xen tree?
>>>
>>> git://github.com/xen-org/blktap.git
>>>
>>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>>> here have already been fixed in the XenServer fork.
>> I have another question:
>> Why we don't merge the "blktap2' branch into xen upstream periodically?
> 
> I take it you've found blktap "2.5" useful? :-)
> 
> I've been meaning to write an e-mail about this.
> 
> The basic reason is that it's normally up to the people doing the development to submit changes upstream.  Some years ago XenServer forked the blktap2 codebase but got behind in upstreaming things; at this point there are far too many changes to simply push them upstream.  Furthermore, even XenServer isn't 100% sure what they're going to do in the future; as of a year ago they were planning to get rid of blktap entirely in favor of another solution.
> 
> One of the ideas I'm going to discuss in my e-mail is the idea of treating blktap2.5 (and/or blktap3) as an external upstream project, similar to the way that we treat qemu, seabios, ipxe, and ovmf. That would have a similar effect to what you describe.

I agree with this. Currently, we have blktap2 and blktap2.5. I don't know my work should be for which
version...

Thanks
Wen Congyang

> 
>  -George
> .
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-11-03 10:07           ` Wen Congyang
@ 2014-11-05 19:25             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-05 19:25 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Lai Jiangshan, George Dunlap, Ian Jackson, Jiang Yunhong,
	Dong Eddie, xen devel, Yang Hongyang, Ian Campbell

On Mon, Nov 03, 2014 at 06:07:06PM +0800, Wen Congyang wrote:
> On 11/03/2014 05:58 PM, George Dunlap wrote:
> > On 10/29/2014 05:49 AM, Wen Congyang wrote:
> >> On 10/20/2014 10:25 PM, George Dunlap wrote:
> >>> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
> >>>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
> >>>>>> These bugs are found when we implement COLO, or rebase
> >>>>>> COLO to upstream xen. They are independent patches, so
> >>>>>> post them in separate series.
> >>>>> blktap2 is unmaintained AFAICT.
> >>>>>
> >>>>> In the last year there has been only one commit which shows evidence
> >>>>> of someone caring even slightly about tools/blktap2/.  The last
> >>>>> substantial attention was in March 2013.
> >>>>>
> >>>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
> >>>>> problems with new compilers, sort out build system and file
> >>>>> rearrangements, etc.)
> >>>>>
> >>>>> The file you are touching in your 01/17 was last edited (by anyone, at
> >>>>> all) in January 2010.
> >>>>>
> >>>>> Under the circumstances, we should probably take all these changes
> >>>>> without looking for anyone to ack them.
> >>>>>
> >>>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
> >>>> Hmm, I don't have any knowledge about disk format, but blktap2 have
> >>>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
> >>>> maintain the rest codes.
> >>> Congyang, were you aware that XenServer has a fork of blktap is
> >>> actually still under active development and maintainership outside of
> >>> the main Xen tree?
> >>>
> >>> git://github.com/xen-org/blktap.git
> >>>
> >>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
> >>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
> >>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
> >>> here have already been fixed in the XenServer fork.
> >> I have another question:
> >> Why we don't merge the "blktap2' branch into xen upstream periodically?
> > 
> > I take it you've found blktap "2.5" useful? :-)
> > 
> > I've been meaning to write an e-mail about this.
> > 
> > The basic reason is that it's normally up to the people doing the development to submit changes upstream.  Some years ago XenServer forked the blktap2 codebase but got behind in upstreaming things; at this point there are far too many changes to simply push them upstream.  Furthermore, even XenServer isn't 100% sure what they're going to do in the future; as of a year ago they were planning to get rid of blktap entirely in favor of another solution.
> > 
> > One of the ideas I'm going to discuss in my e-mail is the idea of treating blktap2.5 (and/or blktap3) as an external upstream project, similar to the way that we treat qemu, seabios, ipxe, and ovmf. That would have a similar effect to what you describe.
> 
> I agree with this. Currently, we have blktap2 and blktap2.5. I don't know my work should be for which
> version...

The one that has an active maintainer!

I presume 'blktap2.5' fits in that category? But we would need to sync
the checkout of blktap2.5 in the Xen git tree when building - so that is
definite Xen 4.6 material.

> 
> Thanks
> Wen Congyang
> 
> > 
> >  -George
> > .
> > 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2014-11-03  9:58         ` George Dunlap
  2014-11-03 10:07           ` Wen Congyang
@ 2015-02-13  6:56           ` Hongyang Yang
  2015-02-14 18:40             ` George Dunlap
  1 sibling, 1 reply; 50+ messages in thread
From: Hongyang Yang @ 2015-02-13  6:56 UTC (permalink / raw)
  To: George Dunlap, Wen Congyang
  Cc: Ian Campbell, Ian Jackson, Jiang Yunhong, Dong Eddie, xen devel,
	Lai Jiangshan

Hi George,

在 11/03/2014 05:58 PM, George Dunlap 写道:
> On 10/29/2014 05:49 AM, Wen Congyang wrote:
>> On 10/20/2014 10:25 PM, George Dunlap wrote:
>>> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>>>> These bugs are found when we implement COLO, or rebase
>>>>>> COLO to upstream xen. They are independent patches, so
>>>>>> post them in separate series.
>>>>> blktap2 is unmaintained AFAICT.
>>>>>
>>>>> In the last year there has been only one commit which shows evidence
>>>>> of someone caring even slightly about tools/blktap2/.  The last
>>>>> substantial attention was in March 2013.
>>>>>
>>>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>>>> problems with new compilers, sort out build system and file
>>>>> rearrangements, etc.)
>>>>>
>>>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>>>> all) in January 2010.
>>>>>
>>>>> Under the circumstances, we should probably take all these changes
>>>>> without looking for anyone to ack them.
>>>>>
>>>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>>>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>>>> maintain the rest codes.
>>> Congyang, were you aware that XenServer has a fork of blktap is
>>> actually still under active development and maintainership outside of
>>> the main Xen tree?
>>>
>>> git://github.com/xen-org/blktap.git
>>>
>>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>>> here have already been fixed in the XenServer fork.
>> I have another question:
>> Why we don't merge the "blktap2' branch into xen upstream periodically?
>
> I take it you've found blktap "2.5" useful? :-)
>
> I've been meaning to write an e-mail about this.
>
> The basic reason is that it's normally up to the people doing the development to
> submit changes upstream.  Some years ago XenServer forked the blktap2 codebase
> but got behind in upstreaming things; at this point there are far too many
> changes to simply push them upstream. Furthermore, even XenServer isn't 100%
> sure what they're going to do in the future; as of a year ago they were planning
> to get rid of blktap entirely in favor of another solution.
>
> One of the ideas I'm going to discuss in my e-mail is the idea of treating
> blktap2.5 (and/or blktap3) as an external upstream project, similar to the way
> that we treat qemu, seabios, ipxe, and ovmf. That would have a similar effect to
> what you describe.

How is this going?

>
>   -George
> .
>

-- 
Thanks,
Yang.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/17] blktap2 related bugfix patches
  2015-02-13  6:56           ` Hongyang Yang
@ 2015-02-14 18:40             ` George Dunlap
  0 siblings, 0 replies; 50+ messages in thread
From: George Dunlap @ 2015-02-14 18:40 UTC (permalink / raw)
  To: Hongyang Yang, Wen Congyang
  Cc: Ian Campbell, Jiang Yunhong, Eddie Dong, xen devel, Ian Jackson,
	Lai Jiangshan

I'm working on a talk for the Linux Collab Summit next week; and after that I'm on holiday for about a week.  (Actually I'm in Hong Kong for the tail end of Chinese New Year!)

At any rate, I won't get a chance to look at this until March at the earliest.

 -George
________________________________________
From: Hongyang Yang [yanghy@cn.fujitsu.com]
Sent: 13 February 2015 06:56
To: George Dunlap; Wen Congyang
Cc: Ian Jackson; Lai Jiangshan; Jiang Yunhong; Eddie Dong; xen devel; Ian Campbell
Subject: Re: [Xen-devel] [PATCH 00/17] blktap2 related bugfix patches

Hi George,

在 11/03/2014 05:58 PM, George Dunlap 写道:
> On 10/29/2014 05:49 AM, Wen Congyang wrote:
>> On 10/20/2014 10:25 PM, George Dunlap wrote:
>>> On Wed, Oct 15, 2014 at 2:05 AM, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>> On 10/14/2014 11:48 PM, Ian Jackson wrote:
>>>>> Wen Congyang writes ("[PATCH 00/17] blktap2 related bugfix patches"):
>>>>>> These bugs are found when we implement COLO, or rebase
>>>>>> COLO to upstream xen. They are independent patches, so
>>>>>> post them in separate series.
>>>>> blktap2 is unmaintained AFAICT.
>>>>>
>>>>> In the last year there has been only one commit which shows evidence
>>>>> of someone caring even slightly about tools/blktap2/.  The last
>>>>> substantial attention was in March 2013.
>>>>>
>>>>> (I'm disregarding commits which touch tools/blktap2/ to fix up compile
>>>>> problems with new compilers, sort out build system and file
>>>>> rearrangements, etc.)
>>>>>
>>>>> The file you are touching in your 01/17 was last edited (by anyone, at
>>>>> all) in January 2010.
>>>>>
>>>>> Under the circumstances, we should probably take all these changes
>>>>> without looking for anyone to ack them.
>>>>>
>>>>> Perhaps you would like to become the maintainers of blktap2 ? :-)
>>>> Hmm, I don't have any knowledge about disk format, but blktap2 have
>>>> such codes(For example: block-vhd.c, block-qcow.c...). I think I can
>>>> maintain the rest codes.
>>> Congyang, were you aware that XenServer has a fork of blktap is
>>> actually still under active development and maintainership outside of
>>> the main Xen tree?
>>>
>>> git://github.com/xen-org/blktap.git
>>>
>>> Both CentOS and Fedora are actually using snapshots of the "blktap2"
>>> branch in that tree for their Xen RPMs.  (I'm sure CentOS is, I
>>> believe Fedora is.)  It's not unlikely that the bugs you're fixing
>>> here have already been fixed in the XenServer fork.
>> I have another question:
>> Why we don't merge the "blktap2' branch into xen upstream periodically?
>
> I take it you've found blktap "2.5" useful? :-)
>
> I've been meaning to write an e-mail about this.
>
> The basic reason is that it's normally up to the people doing the development to
> submit changes upstream.  Some years ago XenServer forked the blktap2 codebase
> but got behind in upstreaming things; at this point there are far too many
> changes to simply push them upstream. Furthermore, even XenServer isn't 100%
> sure what they're going to do in the future; as of a year ago they were planning
> to get rid of blktap entirely in favor of another solution.
>
> One of the ideas I'm going to discuss in my e-mail is the idea of treating
> blktap2.5 (and/or blktap3) as an external upstream project, similar to the way
> that we treat qemu, seabios, ipxe, and ovmf. That would have a similar effect to
> what you describe.

How is this going?

>
>   -George
> .
>

--
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2015-02-14 18:40 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-14  2:13 [PATCH 00/17] blktap2 related bugfix patches Wen Congyang
2014-10-14  2:13 ` [PATCH 01/17] tools: blktap2: dynamic allocate aio_requests to avoid -EBUSY error Wen Congyang
2014-10-14  2:13 ` [PATCH 02/17] tools: block-remus: pass uuid to the callback td_open Wen Congyang
2014-10-20  2:58   ` Shriram Rajagopalan
2014-10-14  2:13 ` [PATCH 03/17] tools: block-remus: use correct way to get remus_image Wen Congyang
2014-10-20  3:02   ` Shriram Rajagopalan
2014-10-14  2:13 ` [PATCH 04/17] tools: block-remus: fix bug in tdremus_close() Wen Congyang
2014-10-20  3:01   ` Shriram Rajagopalan
2014-10-20  3:05     ` Wen Congyang
2014-10-14  2:13 ` [PATCH 05/17] tools: block-remus: fix memory leak Wen Congyang
2014-10-20  2:33   ` Shriram Rajagopalan
2014-10-14  2:13 ` [PATCH 06/17] tools: blktap2: return the correct dev path Wen Congyang
2014-10-14  2:13 ` [PATCH 07/17] tools: blktap2: use correct way to get free event id Wen Congyang
2014-10-14  2:13 ` [PATCH 08/17] tools: blktap2: don't return negative " Wen Congyang
2014-10-14  2:13 ` [PATCH 09/17] tools: blktap2: use correct way to define array Wen Congyang
2014-10-20  2:37   ` Shriram Rajagopalan
2014-10-20  2:52     ` Wen Congyang
2014-10-14  2:13 ` [PATCH 10/17] tools: block-remus: fix bug in ctl_request() Wen Congyang
2014-10-20  2:38   ` Shriram Rajagopalan
2014-10-14  2:13 ` [PATCH 11/17] tools: block-remus: clean unused functions Wen Congyang
2014-10-20  3:01   ` Shriram Rajagopalan
2014-10-14  2:14 ` [PATCH 12/17] tools: blktap2: implement an API to create a connection asynchronously Wen Congyang
2014-10-14  2:14 ` [PATCH 13/17] tools: block-remus: connect to backup asynchronously Wen Congyang
2014-10-20  2:50   ` Shriram Rajagopalan
2014-10-20  3:00     ` Wen Congyang
2014-10-20  3:11       ` Shriram Rajagopalan
2014-10-14  2:14 ` [PATCH 14/17] block-remus: switch to unprotected mode before closing Wen Congyang
2014-10-20  2:51   ` Shriram Rajagopalan
2014-10-14  2:14 ` [PATCH 15/17] tools: blktap2: move ramdisk related codes to block-replication.c Wen Congyang
2014-10-20  2:52   ` Shriram Rajagopalan
2014-10-14  2:14 ` [PATCH 16/17] support blktap remus in xl Wen Congyang
2014-10-14  2:14 ` [PATCH 17/17] HACK: libxl/remus: setup and control disk replication for blktap2 backends Wen Congyang
2014-10-20  3:00   ` Shriram Rajagopalan
2014-10-20  3:09     ` Wen Congyang
2014-10-14 15:48 ` [PATCH 00/17] blktap2 related bugfix patches Ian Jackson
2014-10-15  1:05   ` Wen Congyang
2014-10-19 20:34     ` Shriram Rajagopalan
2014-10-20 14:25     ` George Dunlap
2014-10-21  2:28       ` Wen Congyang
2014-10-21  2:56       ` Wen Congyang
2014-10-21  9:55         ` George Dunlap
2014-10-21 10:07           ` M A Young
2014-10-21 10:45           ` Bob Ball
2014-10-29  5:49       ` Wen Congyang
2014-11-03  9:58         ` George Dunlap
2014-11-03 10:07           ` Wen Congyang
2014-11-05 19:25             ` Konrad Rzeszutek Wilk
2015-02-13  6:56           ` Hongyang Yang
2015-02-14 18:40             ` George Dunlap
2014-10-27 18:32 ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).