qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum
@ 2014-07-17  5:18 Liu Yuan
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum Liu Yuan
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
  0 siblings, 2 replies; 10+ messages in thread
From: Liu Yuan @ 2014-07-17  5:18 UTC (permalink / raw)
  To: qemu-devel

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

        VM
  --------------
  |            |
  v            v
  A            B

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Liu Yuan (2):
  qapi: add read-pattern support for quorum
  block/quorum: add simple read pattern support

 block/quorum.c       | 181 +++++++++++++++++++++++++++++++++++++--------------
 qapi/block-core.json |  19 +++++-
 2 files changed, 149 insertions(+), 51 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum
  2014-07-17  5:18 [Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum Liu Yuan
@ 2014-07-17  5:18 ` Liu Yuan
  2014-08-11 16:42   ` Eric Blake
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
  1 sibling, 1 reply; 10+ messages in thread
From: Liu Yuan @ 2014-07-17  5:18 UTC (permalink / raw)
  To: qemu-devel

Cc: Eric Blake <eblake@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 qapi/block-core.json | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
             'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 #                     (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#                (Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
             'children': [ 'BlockdevRef' ],
-            'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+            'vote-threshold': 'int',
+            '*rewrite-corrupted': 'bool',
+            '*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-07-17  5:18 [Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum Liu Yuan
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum Liu Yuan
@ 2014-07-17  5:18 ` Liu Yuan
  2014-08-11  7:18   ` Liu Yuan
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Liu Yuan @ 2014-07-17  5:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

        VM
  --------------
  |            |
  v            v
  A            B

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet <benoit@irqsave.net>
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
 block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 131 insertions(+), 48 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..ebf5c71 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY      "blkverify"
 #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
     bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
                             * block if Quorum is reached.
                             */
+
+    QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
     bool is_read;
     int vote_ret;
+    int child_iter;             /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
     if (acb->is_read) {
         for (i = 0; i < s->num_children; i++) {
-            qemu_vfree(acb->qcrs[i].buf);
-            qemu_iovec_destroy(&acb->qcrs[i].qiov);
+            if (i <= acb->child_iter) {
+                qemu_vfree(acb->qcrs[i].buf);
+                qemu_iovec_destroy(&acb->qcrs[i].qiov);
+            }
         }
     }
 
@@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
     quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+    int i;
+    assert(dest->niov == source->niov);
+    assert(dest->size == source->size);
+    for (i = 0; i < source->niov; i++) {
+        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+        memcpy(dest->iov[i].iov_base,
+               source->iov[i].iov_base,
+               source->iov[i].iov_len);
+    }
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
     QuorumChildRequest *sacb = opaque;
@@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
     BDRVQuorumState *s = acb->common.bs->opaque;
     bool rewrite = false;
 
+    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
+        /* We try to read next child in FIFO order if we fail to read */
+        if (ret < 0 && ++acb->child_iter < s->num_children) {
+            read_fifo_child(acb);
+            return;
+        }
+
+        if (ret == 0) {
+            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+        }
+        acb->vote_ret = ret;
+        quorum_aio_finalize(acb);
+        return;
+    }
+
     sacb->ret = ret;
     acb->count++;
     if (ret == 0) {
@@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
     return count;
 }
 
-static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
-{
-    int i;
-    assert(dest->niov == source->niov);
-    assert(dest->size == source->size);
-    for (i = 0; i < source->niov; i++) {
-        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
-        memcpy(dest->iov[i].iov_base,
-               source->iov[i].iov_base,
-               source->iov[i].iov_len);
-    }
-}
-
 static void quorum_count_vote(QuorumVotes *votes,
                               QuorumVoteValue *value,
                               int index)
@@ -615,34 +638,62 @@ free_exit:
     return rewrite;
 }
 
-static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
-                                         int64_t sector_num,
-                                         QEMUIOVector *qiov,
-                                         int nb_sectors,
-                                         BlockDriverCompletionFunc *cb,
-                                         void *opaque)
+static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
 {
-    BDRVQuorumState *s = bs->opaque;
-    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
-                                      nb_sectors, cb, opaque);
+    BDRVQuorumState *s = acb->common.bs->opaque;
     int i;
 
-    acb->is_read = true;
-
     for (i = 0; i < s->num_children; i++) {
-        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
-        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
-        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
+        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
+        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
+        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
     }
 
     for (i = 0; i < s->num_children; i++) {
-        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
-                       quorum_aio_cb, &acb->qcrs[i]);
+        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
+                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
     }
 
     return &acb->common;
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
+{
+    BDRVQuorumState *s = acb->common.bs->opaque;
+
+    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
+                                                     acb->qiov->size);
+    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
+    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
+                     acb->qcrs[acb->child_iter].buf);
+    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
+                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
+                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
+
+    return &acb->common;
+}
+
+static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
+                                          int64_t sector_num,
+                                          QEMUIOVector *qiov,
+                                          int nb_sectors,
+                                          BlockDriverCompletionFunc *cb,
+                                          void *opaque)
+{
+    BDRVQuorumState *s = bs->opaque;
+    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
+                                      nb_sectors, cb, opaque);
+    acb->is_read = true;
+
+    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
+        acb->child_iter = s->num_children - 1;
+        return read_quorum_children(acb);
+    }
+
+    acb->child_iter = 0;
+    return read_fifo_child(acb);
+}
+
 static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
                                           int64_t sector_num,
                                           QEMUIOVector *qiov,
@@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "Rewrite corrupted block on read quorum",
         },
+        {
+            .name = QUORUM_OPT_READ_PATTERN,
+            .type = QEMU_OPT_STRING,
+            .help = "Allowed pattern: quorum, fifo. Quorum is default",
+        },
         { /* end of list */ }
     },
 };
 
+static int parse_read_pattern(const char *opt)
+{
+    int i;
+
+    if (!opt) {
+        /* Set quorum as default */
+        return QUORUM_READ_PATTERN_QUORUM;
+    }
+
+    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
+        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
+            return i;
+        }
+    }
+
+    return -EINVAL;
+}
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
                        Error **errp)
 {
@@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
-
-    /* and validate it against s->num_children */
-    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
+    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
     if (ret < 0) {
+        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
         goto exit;
     }
+    s->read_pattern = ret;
 
-    /* is the driver in blkverify mode */
-    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
-        s->num_children == 2 && s->threshold == 2) {
-        s->is_blkverify = true;
-    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
-        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
-                "and using two files with vote_threshold=2\n");
-    }
+    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
+        /* and validate it against s->num_children */
+        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
+        if (ret < 0) {
+            goto exit;
+        }
 
-    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
-    if (s->rewrite_corrupted && s->is_blkverify) {
-        error_setg(&local_err,
-                   "rewrite-corrupted=on cannot be used with blkverify=on");
-        ret = -EINVAL;
-        goto exit;
+        /* is the driver in blkverify mode */
+        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
+            s->num_children == 2 && s->threshold == 2) {
+            s->is_blkverify = true;
+        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
+            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
+                    "and using two files with vote_threshold=2\n");
+        }
+
+        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
+                                                 false);
+        if (s->rewrite_corrupted && s->is_blkverify) {
+            error_setg(&local_err,
+                       "rewrite-corrupted=on cannot be used with blkverify=on");
+            ret = -EINVAL;
+            goto exit;
+        }
     }
 
     /* allocate the children BlockDriverState array */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
@ 2014-08-11  7:18   ` Liu Yuan
  2014-08-11 12:31   ` Benoît Canet
  2014-08-14 11:09   ` Benoît Canet
  2 siblings, 0 replies; 10+ messages in thread
From: Liu Yuan @ 2014-08-11  7:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi

On Thu, Jul 17, 2014 at 01:18:56PM +0800, Liu Yuan wrote:
> This patch adds single read pattern to quorum driver and quorum vote is default
> pattern.
> 
> For now we do a quorum vote on all the reads, it is designed for unreliable
> underlying storage such as non-redundant NFS to make sure data integrity at the
> cost of the read performance.
> 
> For some use cases as following:
> 
>         VM
>   --------------
>   |            |
>   v            v
>   A            B
> 
> Both A and B has hardware raid storage to justify the data integrity on its own.
> So it would help performance if we do a single read instead of on all the nodes.
> Further, if we run VM on either of the storage node, we can make a local read
> request for better performance.
> 
> This patch generalize the above 2 nodes case in the N nodes. That is,
> 
> vm -> write to all the N nodes, read just one of them. If single read fails, we
> try to read next node in FIFO order specified by the startup command.
> 
> The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> functionality in the single device/node failure for now. But compared with DRBD
> we still have some advantages over it:
> 
> - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> storage. And if A crashes, we need to restart all the VMs on node B. But for
> practice case, we can't because B might not have enough resources to setup 20 VMs
> at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> images over the data center, we can very likely restart 20 VMs without any
> resource problem.
> 
> After all, I think we can build a more powerful replicated image functionality
> on quorum and block jobs(block mirror) to meet various High Availibility needs.
> 
> E.g, Enable single read pattern on 2 children,
> 
> -drive driver=quorum,children.0.file.filename=0.qcow2,\
> children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> 
> [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> 
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 131 insertions(+), 48 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index d5ee9c0..ebf5c71 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -24,6 +24,7 @@
>  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
>  #define QUORUM_OPT_BLKVERIFY      "blkverify"
>  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
>  
>  /* This union holds a vote hash value */
>  typedef union QuorumVoteValue {
> @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>                              * block if Quorum is reached.
>                              */
> +
> +    QuorumReadPattern read_pattern;
>  } BDRVQuorumState;
>  
>  typedef struct QuorumAIOCB QuorumAIOCB;
> @@ -117,6 +120,7 @@ struct QuorumAIOCB {
>  
>      bool is_read;
>      int vote_ret;
> +    int child_iter;             /* which child to read in fifo pattern */
>  };
>  
>  static bool quorum_vote(QuorumAIOCB *acb);
> @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>  
>      if (acb->is_read) {
>          for (i = 0; i < s->num_children; i++) {
> -            qemu_vfree(acb->qcrs[i].buf);
> -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            if (i <= acb->child_iter) {
> +                qemu_vfree(acb->qcrs[i].buf);
> +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            }
>          }
>      }
>  
> @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
>      quorum_aio_finalize(acb);
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> +
> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> +{
> +    int i;
> +    assert(dest->niov == source->niov);
> +    assert(dest->size == source->size);
> +    for (i = 0; i < source->niov; i++) {
> +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> +        memcpy(dest->iov[i].iov_base,
> +               source->iov[i].iov_base,
> +               source->iov[i].iov_len);
> +    }
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>      QuorumChildRequest *sacb = opaque;
> @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
>      BDRVQuorumState *s = acb->common.bs->opaque;
>      bool rewrite = false;
>  
> +    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> +        /* We try to read next child in FIFO order if we fail to read */
> +        if (ret < 0 && ++acb->child_iter < s->num_children) {
> +            read_fifo_child(acb);
> +            return;
> +        }
> +
> +        if (ret == 0) {
> +            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
> +        }
> +        acb->vote_ret = ret;
> +        quorum_aio_finalize(acb);
> +        return;
> +    }
> +
>      sacb->ret = ret;
>      acb->count++;
>      if (ret == 0) {
> @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
>      return count;
>  }
>  
> -static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> -{
> -    int i;
> -    assert(dest->niov == source->niov);
> -    assert(dest->size == source->size);
> -    for (i = 0; i < source->niov; i++) {
> -        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> -        memcpy(dest->iov[i].iov_base,
> -               source->iov[i].iov_base,
> -               source->iov[i].iov_len);
> -    }
> -}
> -
>  static void quorum_count_vote(QuorumVotes *votes,
>                                QuorumVoteValue *value,
>                                int index)
> @@ -615,34 +638,62 @@ free_exit:
>      return rewrite;
>  }
>  
> -static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> -                                         int64_t sector_num,
> -                                         QEMUIOVector *qiov,
> -                                         int nb_sectors,
> -                                         BlockDriverCompletionFunc *cb,
> -                                         void *opaque)
> +static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
>  {
> -    BDRVQuorumState *s = bs->opaque;
> -    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> -                                      nb_sectors, cb, opaque);
> +    BDRVQuorumState *s = acb->common.bs->opaque;
>      int i;
>  
> -    acb->is_read = true;
> -
>      for (i = 0; i < s->num_children; i++) {
> -        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
> -        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
> -        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
> +        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> +        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> +        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
>      }
>  
>      for (i = 0; i < s->num_children; i++) {
> -        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
> -                       quorum_aio_cb, &acb->qcrs[i]);
> +        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
> +                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
>      }
>  
>      return &acb->common;
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->common.bs->opaque;
> +
> +    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
> +                                                     acb->qiov->size);
> +    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
> +    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
> +                     acb->qcrs[acb->child_iter].buf);
> +    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
> +                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
> +                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
> +
> +    return &acb->common;
> +}
> +
> +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> +                                          int64_t sector_num,
> +                                          QEMUIOVector *qiov,
> +                                          int nb_sectors,
> +                                          BlockDriverCompletionFunc *cb,
> +                                          void *opaque)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> +                                      nb_sectors, cb, opaque);
> +    acb->is_read = true;
> +
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        acb->child_iter = s->num_children - 1;
> +        return read_quorum_children(acb);
> +    }
> +
> +    acb->child_iter = 0;
> +    return read_fifo_child(acb);
> +}
> +
>  static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>                                            int64_t sector_num,
>                                            QEMUIOVector *qiov,
> @@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "Rewrite corrupted block on read quorum",
>          },
> +        {
> +            .name = QUORUM_OPT_READ_PATTERN,
> +            .type = QEMU_OPT_STRING,
> +            .help = "Allowed pattern: quorum, fifo. Quorum is default",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static int parse_read_pattern(const char *opt)
> +{
> +    int i;
> +
> +    if (!opt) {
> +        /* Set quorum as default */
> +        return QUORUM_READ_PATTERN_QUORUM;
> +    }
> +
> +    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
> +        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
> +            return i;
> +        }
> +    }
> +
> +    return -EINVAL;
> +}
> +
>  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>                         Error **errp)
>  {
> @@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> -
> -    /* and validate it against s->num_children */
> -    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
>      if (ret < 0) {
> +        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
>          goto exit;
>      }
> +    s->read_pattern = ret;
>  
> -    /* is the driver in blkverify mode */
> -    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> -        s->num_children == 2 && s->threshold == 2) {
> -        s->is_blkverify = true;
> -    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> -        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> -                "and using two files with vote_threshold=2\n");
> -    }
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        /* and validate it against s->num_children */
> +        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +        if (ret < 0) {
> +            goto exit;
> +        }
>  
> -    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
> -    if (s->rewrite_corrupted && s->is_blkverify) {
> -        error_setg(&local_err,
> -                   "rewrite-corrupted=on cannot be used with blkverify=on");
> -        ret = -EINVAL;
> -        goto exit;
> +        /* is the driver in blkverify mode */
> +        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> +            s->num_children == 2 && s->threshold == 2) {
> +            s->is_blkverify = true;
> +        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> +            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> +                    "and using two files with vote_threshold=2\n");
> +        }
> +
> +        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
> +                                                 false);
> +        if (s->rewrite_corrupted && s->is_blkverify) {
> +            error_setg(&local_err,
> +                       "rewrite-corrupted=on cannot be used with blkverify=on");
> +            ret = -EINVAL;
> +            goto exit;
> +        }
>      }
>  
>      /* allocate the children BlockDriverState array */
> -- 
> 1.9.1
> 

Anybody? It might be got ignored :(...

Thanks
Yuan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
  2014-08-11  7:18   ` Liu Yuan
@ 2014-08-11 12:31   ` Benoît Canet
  2014-08-12  2:41     ` Liu Yuan
  2014-08-14 11:09   ` Benoît Canet
  2 siblings, 1 reply; 10+ messages in thread
From: Benoît Canet @ 2014-08-11 12:31 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> This patch adds single read pattern to quorum driver and quorum vote is default
> pattern.
> 
> For now we do a quorum vote on all the reads, it is designed for unreliable
> underlying storage such as non-redundant NFS to make sure data integrity at the
> cost of the read performance.
> 
> For some use cases as following:
> 
>         VM
>   --------------
>   |            |
>   v            v
>   A            B
> 
> Both A and B has hardware raid storage to justify the data integrity on its own.
> So it would help performance if we do a single read instead of on all the nodes.
> Further, if we run VM on either of the storage node, we can make a local read
> request for better performance.
> 
> This patch generalize the above 2 nodes case in the N nodes. That is,
> 
> vm -> write to all the N nodes, read just one of them. If single read fails, we
> try to read next node in FIFO order specified by the startup command.
> 
> The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> functionality in the single device/node failure for now. But compared with DRBD
> we still have some advantages over it:
> 
> - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> storage. And if A crashes, we need to restart all the VMs on node B. But for
> practice case, we can't because B might not have enough resources to setup 20 VMs
> at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> images over the data center, we can very likely restart 20 VMs without any
> resource problem.
> 
> After all, I think we can build a more powerful replicated image functionality
> on quorum and block jobs(block mirror) to meet various High Availibility needs.
> 
> E.g, Enable single read pattern on 2 children,
> 
> -drive driver=quorum,children.0.file.filename=0.qcow2,\
> children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> 
> [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> 
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 131 insertions(+), 48 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index d5ee9c0..ebf5c71 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -24,6 +24,7 @@
>  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
>  #define QUORUM_OPT_BLKVERIFY      "blkverify"
>  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
>  
>  /* This union holds a vote hash value */
>  typedef union QuorumVoteValue {
> @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>                              * block if Quorum is reached.
>                              */
> +
> +    QuorumReadPattern read_pattern;
>  } BDRVQuorumState;
>  
>  typedef struct QuorumAIOCB QuorumAIOCB;
> @@ -117,6 +120,7 @@ struct QuorumAIOCB {
>  
>      bool is_read;
>      int vote_ret;
> +    int child_iter;             /* which child to read in fifo pattern */
>  };
>  
>  static bool quorum_vote(QuorumAIOCB *acb);
> @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>  
>      if (acb->is_read) {
>          for (i = 0; i < s->num_children; i++) {
> -            qemu_vfree(acb->qcrs[i].buf);
> -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            if (i <= acb->child_iter) {
> +                qemu_vfree(acb->qcrs[i].buf);
> +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            }
>          }
>      }
>  
> @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
>      quorum_aio_finalize(acb);
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> +
> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> +{
> +    int i;
> +    assert(dest->niov == source->niov);
> +    assert(dest->size == source->size);
> +    for (i = 0; i < source->niov; i++) {
> +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> +        memcpy(dest->iov[i].iov_base,
> +               source->iov[i].iov_base,
> +               source->iov[i].iov_len);
> +    }
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>      QuorumChildRequest *sacb = opaque;
> @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
>      BDRVQuorumState *s = acb->common.bs->opaque;
>      bool rewrite = false;
>  
> +    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> +        /* We try to read next child in FIFO order if we fail to read */
> +        if (ret < 0 && ++acb->child_iter < s->num_children) {
> +            read_fifo_child(acb);
> +            return;
> +        }
> +
> +        if (ret == 0) {
> +            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
> +        }
> +        acb->vote_ret = ret;
> +        quorum_aio_finalize(acb);
> +        return;
> +    }
> +
>      sacb->ret = ret;
>      acb->count++;
>      if (ret == 0) {
> @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
>      return count;
>  }
>  
> -static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> -{
> -    int i;
> -    assert(dest->niov == source->niov);
> -    assert(dest->size == source->size);
> -    for (i = 0; i < source->niov; i++) {
> -        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> -        memcpy(dest->iov[i].iov_base,
> -               source->iov[i].iov_base,
> -               source->iov[i].iov_len);
> -    }
> -}
> -
>  static void quorum_count_vote(QuorumVotes *votes,
>                                QuorumVoteValue *value,
>                                int index)
> @@ -615,34 +638,62 @@ free_exit:
>      return rewrite;
>  }
>  
> -static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> -                                         int64_t sector_num,
> -                                         QEMUIOVector *qiov,
> -                                         int nb_sectors,
> -                                         BlockDriverCompletionFunc *cb,
> -                                         void *opaque)
> +static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
>  {
> -    BDRVQuorumState *s = bs->opaque;
> -    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> -                                      nb_sectors, cb, opaque);
> +    BDRVQuorumState *s = acb->common.bs->opaque;
>      int i;
>  
> -    acb->is_read = true;
> -
>      for (i = 0; i < s->num_children; i++) {
> -        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
> -        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
> -        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
> +        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> +        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> +        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
>      }
>  
>      for (i = 0; i < s->num_children; i++) {
> -        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
> -                       quorum_aio_cb, &acb->qcrs[i]);
> +        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
> +                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
>      }
>  
>      return &acb->common;
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->common.bs->opaque;
> +
> +    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
> +                                                     acb->qiov->size);
> +    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
> +    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
> +                     acb->qcrs[acb->child_iter].buf);
> +    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
> +                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
> +                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
> +
> +    return &acb->common;
> +}
> +
> +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> +                                          int64_t sector_num,
> +                                          QEMUIOVector *qiov,
> +                                          int nb_sectors,
> +                                          BlockDriverCompletionFunc *cb,
> +                                          void *opaque)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> +                                      nb_sectors, cb, opaque);
> +    acb->is_read = true;
> +
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        acb->child_iter = s->num_children - 1;
> +        return read_quorum_children(acb);
> +    }
> +
> +    acb->child_iter = 0;
> +    return read_fifo_child(acb);
> +}
> +
>  static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>                                            int64_t sector_num,
>                                            QEMUIOVector *qiov,
> @@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "Rewrite corrupted block on read quorum",
>          },
> +        {
> +            .name = QUORUM_OPT_READ_PATTERN,
> +            .type = QEMU_OPT_STRING,
> +            .help = "Allowed pattern: quorum, fifo. Quorum is default",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static int parse_read_pattern(const char *opt)
> +{
> +    int i;
> +
> +    if (!opt) {
> +        /* Set quorum as default */
> +        return QUORUM_READ_PATTERN_QUORUM;
> +    }
> +
> +    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
> +        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
> +            return i;
> +        }
> +    }
> +
> +    return -EINVAL;
> +}
> +
>  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>                         Error **errp)
>  {
> @@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> -
> -    /* and validate it against s->num_children */
> -    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
>      if (ret < 0) {
> +        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
>          goto exit;
>      }
> +    s->read_pattern = ret;
>  
> -    /* is the driver in blkverify mode */
> -    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> -        s->num_children == 2 && s->threshold == 2) {
> -        s->is_blkverify = true;
> -    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> -        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> -                "and using two files with vote_threshold=2\n");
> -    }
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        /* and validate it against s->num_children */
> +        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +        if (ret < 0) {
> +            goto exit;
> +        }
>  
> -    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
> -    if (s->rewrite_corrupted && s->is_blkverify) {
> -        error_setg(&local_err,
> -                   "rewrite-corrupted=on cannot be used with blkverify=on");
> -        ret = -EINVAL;
> -        goto exit;
> +        /* is the driver in blkverify mode */
> +        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> +            s->num_children == 2 && s->threshold == 2) {
> +            s->is_blkverify = true;
> +        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> +            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> +                    "and using two files with vote_threshold=2\n");
> +        }
> +
> +        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
> +                                                 false);
> +        if (s->rewrite_corrupted && s->is_blkverify) {
> +            error_setg(&local_err,
> +                       "rewrite-corrupted=on cannot be used with blkverify=on");
> +            ret = -EINVAL;
> +            goto exit;
> +        }
>      }
>  
>      /* allocate the children BlockDriverState array */
> -- 
> 1.9.1
> 

I think I responded you with comments signaling a potential memory leak but you missed it.

Best regards

Benoît

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum Liu Yuan
@ 2014-08-11 16:42   ` Eric Blake
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Blake @ 2014-08-11 16:42 UTC (permalink / raw)
  To: Liu Yuan, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 401 bytes --]

On 07/16/2014 11:18 PM, Liu Yuan wrote:
> Cc: Eric Blake <eblake@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  qapi/block-core.json | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-08-11 12:31   ` Benoît Canet
@ 2014-08-12  2:41     ` Liu Yuan
  2014-08-14  7:01       ` Liu Yuan
  0 siblings, 1 reply; 10+ messages in thread
From: Liu Yuan @ 2014-08-12  2:41 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
> The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> >         VM
> >   --------------
> >   |            |
> >   v            v
> >   A            B
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its own.
> > So it would help performance if we do a single read instead of on all the nodes.
> > Further, if we run VM on either of the storage node, we can make a local read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 131 insertions(+), 48 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..ebf5c71 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY      "blkverify"
> >  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> >      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >                              * block if Quorum is reached.
> >                              */
> > +
> > +    QuorumReadPattern read_pattern;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> >  
> >      bool is_read;
> >      int vote_ret;
> > +    int child_iter;             /* which child to read in fifo pattern */
> >  };
> >  
> >  static bool quorum_vote(QuorumAIOCB *acb);
> > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >  
> >      if (acb->is_read) {
> >          for (i = 0; i < s->num_children; i++) {
> > -            qemu_vfree(acb->qcrs[i].buf);
> > -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            if (i <= acb->child_iter) {
> > +                qemu_vfree(acb->qcrs[i].buf);
> > +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            }
> >          }
> >      }
> >  
> > @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
> >      quorum_aio_finalize(acb);
> >  }
> >  
> > +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> > +
> > +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> > +{
> > +    int i;
> > +    assert(dest->niov == source->niov);
> > +    assert(dest->size == source->size);
> > +    for (i = 0; i < source->niov; i++) {
> > +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> > +        memcpy(dest->iov[i].iov_base,
> > +               source->iov[i].iov_base,
> > +               source->iov[i].iov_len);
> > +    }
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int ret)
> >  {
> >      QuorumChildRequest *sacb = opaque;
> > @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
> >      BDRVQuorumState *s = acb->common.bs->opaque;
> >      bool rewrite = false;
> >  
> > +    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> > +        /* We try to read next child in FIFO order if we fail to read */
> > +        if (ret < 0 && ++acb->child_iter < s->num_children) {
> > +            read_fifo_child(acb);
> > +            return;
> > +        }
> > +
> > +        if (ret == 0) {
> > +            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
> > +        }
> > +        acb->vote_ret = ret;
> > +        quorum_aio_finalize(acb);
> > +        return;
> > +    }
> > +
> >      sacb->ret = ret;
> >      acb->count++;
> >      if (ret == 0) {
> > @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
> >      return count;
> >  }
> >  
> > -static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> > -{
> > -    int i;
> > -    assert(dest->niov == source->niov);
> > -    assert(dest->size == source->size);
> > -    for (i = 0; i < source->niov; i++) {
> > -        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> > -        memcpy(dest->iov[i].iov_base,
> > -               source->iov[i].iov_base,
> > -               source->iov[i].iov_len);
> > -    }
> > -}
> > -
> >  static void quorum_count_vote(QuorumVotes *votes,
> >                                QuorumVoteValue *value,
> >                                int index)
> > @@ -615,34 +638,62 @@ free_exit:
> >      return rewrite;
> >  }
> >  
> > -static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> > -                                         int64_t sector_num,
> > -                                         QEMUIOVector *qiov,
> > -                                         int nb_sectors,
> > -                                         BlockDriverCompletionFunc *cb,
> > -                                         void *opaque)
> > +static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
> >  {
> > -    BDRVQuorumState *s = bs->opaque;
> > -    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> > -                                      nb_sectors, cb, opaque);
> > +    BDRVQuorumState *s = acb->common.bs->opaque;
> >      int i;
> >  
> > -    acb->is_read = true;
> > -
> >      for (i = 0; i < s->num_children; i++) {
> > -        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
> > -        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
> > -        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
> > +        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> > +        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> > +        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
> >      }
> >  
> >      for (i = 0; i < s->num_children; i++) {
> > -        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
> > -                       quorum_aio_cb, &acb->qcrs[i]);
> > +        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
> > +                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
> >      }
> >  
> >      return &acb->common;
> >  }
> >  
> > +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> > +{
> > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > +
> > +    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
> > +                                                     acb->qiov->size);
> > +    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
> > +    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
> > +                     acb->qcrs[acb->child_iter].buf);
> > +    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
> > +                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
> > +                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
> > +
> > +    return &acb->common;
> > +}
> > +
> > +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> > +                                          int64_t sector_num,
> > +                                          QEMUIOVector *qiov,
> > +                                          int nb_sectors,
> > +                                          BlockDriverCompletionFunc *cb,
> > +                                          void *opaque)
> > +{
> > +    BDRVQuorumState *s = bs->opaque;
> > +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> > +                                      nb_sectors, cb, opaque);
> > +    acb->is_read = true;
> > +
> > +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> > +        acb->child_iter = s->num_children - 1;
> > +        return read_quorum_children(acb);
> > +    }
> > +
> > +    acb->child_iter = 0;
> > +    return read_fifo_child(acb);
> > +}
> > +
> >  static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
> >                                            int64_t sector_num,
> >                                            QEMUIOVector *qiov,
> > @@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
> >              .type = QEMU_OPT_BOOL,
> >              .help = "Rewrite corrupted block on read quorum",
> >          },
> > +        {
> > +            .name = QUORUM_OPT_READ_PATTERN,
> > +            .type = QEMU_OPT_STRING,
> > +            .help = "Allowed pattern: quorum, fifo. Quorum is default",
> > +        },
> >          { /* end of list */ }
> >      },
> >  };
> >  
> > +static int parse_read_pattern(const char *opt)
> > +{
> > +    int i;
> > +
> > +    if (!opt) {
> > +        /* Set quorum as default */
> > +        return QUORUM_READ_PATTERN_QUORUM;
> > +    }
> > +
> > +    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
> > +        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
> > +            return i;
> > +        }
> > +    }
> > +
> > +    return -EINVAL;
> > +}
> > +
> >  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
> >                         Error **errp)
> >  {
> > @@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
> >      }
> >  
> >      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> > -
> > -    /* and validate it against s->num_children */
> > -    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> > +    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
> >      if (ret < 0) {
> > +        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
> >          goto exit;
> >      }
> > +    s->read_pattern = ret;
> >  
> > -    /* is the driver in blkverify mode */
> > -    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> > -        s->num_children == 2 && s->threshold == 2) {
> > -        s->is_blkverify = true;
> > -    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> > -        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> > -                "and using two files with vote_threshold=2\n");
> > -    }
> > +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> > +        /* and validate it against s->num_children */
> > +        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> > +        if (ret < 0) {
> > +            goto exit;
> > +        }
> >  
> > -    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
> > -    if (s->rewrite_corrupted && s->is_blkverify) {
> > -        error_setg(&local_err,
> > -                   "rewrite-corrupted=on cannot be used with blkverify=on");
> > -        ret = -EINVAL;
> > -        goto exit;
> > +        /* is the driver in blkverify mode */
> > +        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> > +            s->num_children == 2 && s->threshold == 2) {
> > +            s->is_blkverify = true;
> > +        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> > +            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> > +                    "and using two files with vote_threshold=2\n");
> > +        }
> > +
> > +        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
> > +                                                 false);
> > +        if (s->rewrite_corrupted && s->is_blkverify) {
> > +            error_setg(&local_err,
> > +                       "rewrite-corrupted=on cannot be used with blkverify=on");
> > +            ret = -EINVAL;
> > +            goto exit;
> > +        }
> >      }
> >  
> >      /* allocate the children BlockDriverState array */
> > -- 
> > 1.9.1
> > 
> 
> I think I responded you with comments signaling a potential memory leak but you missed it.
> 

Hi Benoit,

   Thanks for your time.

   I in fact responded you in
   
http://lists.gnu.org/archive/html/qemu-devel/2014-07/msg02103.html

For v4 I think there isn't memory leak problem since quorum_aio_finalize() will
be called in any case finally and will free the resrouces allocated by
all the read_fifo_child().

Or am I missing something?

Thanks
Yuan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-08-12  2:41     ` Liu Yuan
@ 2014-08-14  7:01       ` Liu Yuan
  0 siblings, 0 replies; 10+ messages in thread
From: Liu Yuan @ 2014-08-14  7:01 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Tue, Aug 12, 2014 at 10:41:28AM +0800, Liu Yuan wrote:
> On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
> > The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > > This patch adds single read pattern to quorum driver and quorum vote is default
> > > pattern.
> > > 
> > > For now we do a quorum vote on all the reads, it is designed for unreliable
> > > underlying storage such as non-redundant NFS to make sure data integrity at the
> > > cost of the read performance.
> > > 
> > > For some use cases as following:
> > > 
> > >         VM
> > >   --------------
> > >   |            |
> > >   v            v
> > >   A            B
> > > 
> > > Both A and B has hardware raid storage to justify the data integrity on its own.
> > > So it would help performance if we do a single read instead of on all the nodes.
> > > Further, if we run VM on either of the storage node, we can make a local read
> > > request for better performance.
> > > 
> > > This patch generalize the above 2 nodes case in the N nodes. That is,
> > > 
> > > vm -> write to all the N nodes, read just one of them. If single read fails, we
> > > try to read next node in FIFO order specified by the startup command.
> > > 
> > > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > > functionality in the single device/node failure for now. But compared with DRBD
> > > we still have some advantages over it:
> > > 
> > > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > > practice case, we can't because B might not have enough resources to setup 20 VMs
> > > at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> > > images over the data center, we can very likely restart 20 VMs without any
> > > resource problem.
> > > 
> > > After all, I think we can build a more powerful replicated image functionality
> > > on quorum and block jobs(block mirror) to meet various High Availibility needs.
> > > 
> > > E.g, Enable single read pattern on 2 children,
> > > 
> > > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > > 
> > > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > > 
> > > Cc: Benoit Canet <benoit@irqsave.net>
> > > Cc: Eric Blake <eblake@redhat.com>
> > > Cc: Kevin Wolf <kwolf@redhat.com>
> > > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > > ---
> > >  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
> > >  1 file changed, 131 insertions(+), 48 deletions(-)
> > > 
> > > diff --git a/block/quorum.c b/block/quorum.c
> > > index d5ee9c0..ebf5c71 100644
> > > --- a/block/quorum.c
> > > +++ b/block/quorum.c
> > > @@ -24,6 +24,7 @@
> > >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> > >  #define QUORUM_OPT_BLKVERIFY      "blkverify"
> > >  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> > > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> > >  
> > >  /* This union holds a vote hash value */
> > >  typedef union QuorumVoteValue {
> > > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> > >      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> > >                              * block if Quorum is reached.
> > >                              */
> > > +
> > > +    QuorumReadPattern read_pattern;
> > >  } BDRVQuorumState;
> > >  
> > >  typedef struct QuorumAIOCB QuorumAIOCB;
> > > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> > >  
> > >      bool is_read;
> > >      int vote_ret;
> > > +    int child_iter;             /* which child to read in fifo pattern */
> > >  };
> > >  
> > >  static bool quorum_vote(QuorumAIOCB *acb);
> > > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> > >  
> > >      if (acb->is_read) {
> > >          for (i = 0; i < s->num_children; i++) {
> > > -            qemu_vfree(acb->qcrs[i].buf);
> > > -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > > +            if (i <= acb->child_iter) {
> > > +                qemu_vfree(acb->qcrs[i].buf);
> > > +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > > +            }
> > >          }
> > >      }
> > >  
> > > @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
> > >      quorum_aio_finalize(acb);
> > >  }
> > >  
> > > +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> > > +
> > > +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> > > +{
> > > +    int i;
> > > +    assert(dest->niov == source->niov);
> > > +    assert(dest->size == source->size);
> > > +    for (i = 0; i < source->niov; i++) {
> > > +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> > > +        memcpy(dest->iov[i].iov_base,
> > > +               source->iov[i].iov_base,
> > > +               source->iov[i].iov_len);
> > > +    }
> > > +}
> > > +
> > >  static void quorum_aio_cb(void *opaque, int ret)
> > >  {
> > >      QuorumChildRequest *sacb = opaque;
> > > @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
> > >      BDRVQuorumState *s = acb->common.bs->opaque;
> > >      bool rewrite = false;
> > >  
> > > +    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> > > +        /* We try to read next child in FIFO order if we fail to read */
> > > +        if (ret < 0 && ++acb->child_iter < s->num_children) {
> > > +            read_fifo_child(acb);
> > > +            return;
> > > +        }
> > > +
> > > +        if (ret == 0) {
> > > +            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
> > > +        }
> > > +        acb->vote_ret = ret;
> > > +        quorum_aio_finalize(acb);
> > > +        return;
> > > +    }
> > > +
> > >      sacb->ret = ret;
> > >      acb->count++;
> > >      if (ret == 0) {
> > > @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
> > >      return count;
> > >  }
> > >  
> > > -static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> > > -{
> > > -    int i;
> > > -    assert(dest->niov == source->niov);
> > > -    assert(dest->size == source->size);
> > > -    for (i = 0; i < source->niov; i++) {
> > > -        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> > > -        memcpy(dest->iov[i].iov_base,
> > > -               source->iov[i].iov_base,
> > > -               source->iov[i].iov_len);
> > > -    }
> > > -}
> > > -
> > >  static void quorum_count_vote(QuorumVotes *votes,
> > >                                QuorumVoteValue *value,
> > >                                int index)
> > > @@ -615,34 +638,62 @@ free_exit:
> > >      return rewrite;
> > >  }
> > >  
> > > -static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> > > -                                         int64_t sector_num,
> > > -                                         QEMUIOVector *qiov,
> > > -                                         int nb_sectors,
> > > -                                         BlockDriverCompletionFunc *cb,
> > > -                                         void *opaque)
> > > +static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
> > >  {
> > > -    BDRVQuorumState *s = bs->opaque;
> > > -    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> > > -                                      nb_sectors, cb, opaque);
> > > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > >      int i;
> > >  
> > > -    acb->is_read = true;
> > > -
> > >      for (i = 0; i < s->num_children; i++) {
> > > -        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
> > > -        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
> > > -        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
> > > +        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> > > +        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> > > +        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
> > >      }
> > >  
> > >      for (i = 0; i < s->num_children; i++) {
> > > -        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
> > > -                       quorum_aio_cb, &acb->qcrs[i]);
> > > +        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
> > > +                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
> > >      }
> > >  
> > >      return &acb->common;
> > >  }
> > >  
> > > +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> > > +{
> > > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > > +
> > > +    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
> > > +                                                     acb->qiov->size);
> > > +    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
> > > +    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
> > > +                     acb->qcrs[acb->child_iter].buf);
> > > +    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
> > > +                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
> > > +                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
> > > +
> > > +    return &acb->common;
> > > +}
> > > +
> > > +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> > > +                                          int64_t sector_num,
> > > +                                          QEMUIOVector *qiov,
> > > +                                          int nb_sectors,
> > > +                                          BlockDriverCompletionFunc *cb,
> > > +                                          void *opaque)
> > > +{
> > > +    BDRVQuorumState *s = bs->opaque;
> > > +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> > > +                                      nb_sectors, cb, opaque);
> > > +    acb->is_read = true;
> > > +
> > > +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> > > +        acb->child_iter = s->num_children - 1;
> > > +        return read_quorum_children(acb);
> > > +    }
> > > +
> > > +    acb->child_iter = 0;
> > > +    return read_fifo_child(acb);
> > > +}
> > > +
> > >  static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
> > >                                            int64_t sector_num,
> > >                                            QEMUIOVector *qiov,
> > > @@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
> > >              .type = QEMU_OPT_BOOL,
> > >              .help = "Rewrite corrupted block on read quorum",
> > >          },
> > > +        {
> > > +            .name = QUORUM_OPT_READ_PATTERN,
> > > +            .type = QEMU_OPT_STRING,
> > > +            .help = "Allowed pattern: quorum, fifo. Quorum is default",
> > > +        },
> > >          { /* end of list */ }
> > >      },
> > >  };
> > >  
> > > +static int parse_read_pattern(const char *opt)
> > > +{
> > > +    int i;
> > > +
> > > +    if (!opt) {
> > > +        /* Set quorum as default */
> > > +        return QUORUM_READ_PATTERN_QUORUM;
> > > +    }
> > > +
> > > +    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
> > > +        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
> > > +            return i;
> > > +        }
> > > +    }
> > > +
> > > +    return -EINVAL;
> > > +}
> > > +
> > >  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
> > >                         Error **errp)
> > >  {
> > > @@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
> > >      }
> > >  
> > >      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> > > -
> > > -    /* and validate it against s->num_children */
> > > -    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> > > +    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
> > >      if (ret < 0) {
> > > +        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
> > >          goto exit;
> > >      }
> > > +    s->read_pattern = ret;
> > >  
> > > -    /* is the driver in blkverify mode */
> > > -    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> > > -        s->num_children == 2 && s->threshold == 2) {
> > > -        s->is_blkverify = true;
> > > -    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> > > -        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> > > -                "and using two files with vote_threshold=2\n");
> > > -    }
> > > +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> > > +        /* and validate it against s->num_children */
> > > +        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> > > +        if (ret < 0) {
> > > +            goto exit;
> > > +        }
> > >  
> > > -    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
> > > -    if (s->rewrite_corrupted && s->is_blkverify) {
> > > -        error_setg(&local_err,
> > > -                   "rewrite-corrupted=on cannot be used with blkverify=on");
> > > -        ret = -EINVAL;
> > > -        goto exit;
> > > +        /* is the driver in blkverify mode */
> > > +        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> > > +            s->num_children == 2 && s->threshold == 2) {
> > > +            s->is_blkverify = true;
> > > +        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> > > +            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> > > +                    "and using two files with vote_threshold=2\n");
> > > +        }
> > > +
> > > +        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
> > > +                                                 false);
> > > +        if (s->rewrite_corrupted && s->is_blkverify) {
> > > +            error_setg(&local_err,
> > > +                       "rewrite-corrupted=on cannot be used with blkverify=on");
> > > +            ret = -EINVAL;
> > > +            goto exit;
> > > +        }
> > >      }
> > >  
> > >      /* allocate the children BlockDriverState array */
> > > -- 
> > > 1.9.1
> > > 
> > 
> > I think I responded you with comments signaling a potential memory leak but you missed it.
> > 
> 
> Hi Benoit,
> 
>    Thanks for your time.
> 
>    I in fact responded you in
>    
> http://lists.gnu.org/archive/html/qemu-devel/2014-07/msg02103.html
> 
> For v4 I think there isn't memory leak problem since quorum_aio_finalize() will
> be called in any case finally and will free the resrouces allocated by
> all the read_fifo_child().
> 
> Or am I missing something?
> 

Ping...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
  2014-08-11  7:18   ` Liu Yuan
  2014-08-11 12:31   ` Benoît Canet
@ 2014-08-14 11:09   ` Benoît Canet
  2014-08-15  4:51     ` Liu Yuan
  2 siblings, 1 reply; 10+ messages in thread
From: Benoît Canet @ 2014-08-14 11:09 UTC (permalink / raw)
  To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> This patch adds single read pattern to quorum driver and quorum vote is default
> pattern.
> 
> For now we do a quorum vote on all the reads, it is designed for unreliable
> underlying storage such as non-redundant NFS to make sure data integrity at the
> cost of the read performance.
> 
> For some use cases as following:
> 
>         VM
>   --------------
>   |            |
>   v            v
>   A            B
> 
> Both A and B has hardware raid storage to justify the data integrity on its own.
> So it would help performance if we do a single read instead of on all the nodes.
> Further, if we run VM on either of the storage node, we can make a local read
> request for better performance.
> 
> This patch generalize the above 2 nodes case in the N nodes. That is,
> 
> vm -> write to all the N nodes, read just one of them. If single read fails, we
> try to read next node in FIFO order specified by the startup command.
> 
> The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> functionality in the single device/node failure for now. But compared with DRBD
> we still have some advantages over it:
> 
> - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> storage. And if A crashes, we need to restart all the VMs on node B. But for
> practice case, we can't because B might not have enough resources to setup 20 VMs
> at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> images over the data center, we can very likely restart 20 VMs without any
> resource problem.
> 
> After all, I think we can build a more powerful replicated image functionality
> on quorum and block jobs(block mirror) to meet various High Availibility needs.
> 
> E.g, Enable single read pattern on 2 children,
> 
> -drive driver=quorum,children.0.file.filename=0.qcow2,\
> children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> 
> [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> 
> Cc: Benoit Canet <benoit@irqsave.net>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> ---
>  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 131 insertions(+), 48 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index d5ee9c0..ebf5c71 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -24,6 +24,7 @@
>  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
>  #define QUORUM_OPT_BLKVERIFY      "blkverify"
>  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
>  
>  /* This union holds a vote hash value */
>  typedef union QuorumVoteValue {
> @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
>      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
>                              * block if Quorum is reached.
>                              */
> +
> +    QuorumReadPattern read_pattern;
>  } BDRVQuorumState;
>  
>  typedef struct QuorumAIOCB QuorumAIOCB;
> @@ -117,6 +120,7 @@ struct QuorumAIOCB {
>  
>      bool is_read;
>      int vote_ret;
> +    int child_iter;             /* which child to read in fifo pattern */
>  };
>  
>  static bool quorum_vote(QuorumAIOCB *acb);
> @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>  
>      if (acb->is_read) {
>          for (i = 0; i < s->num_children; i++) {
> -            qemu_vfree(acb->qcrs[i].buf);
> -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            if (i <= acb->child_iter) {
> +                qemu_vfree(acb->qcrs[i].buf);
> +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +            }

This seems convoluted.

What about ?
          /* on the quorum case acb->child_iter == s->num_children - 1 */
          for (i = 0; i <= acb->child_iter; i++) {
              qemu_vfree(acb->qcrs[i].buf);
              qemu_iovec_destroy(&acb->qcrs[i].qiov);
          }
          
>      }
>  
> @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
>      quorum_aio_finalize(acb);
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> +
> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> +{
> +    int i;
> +    assert(dest->niov == source->niov);
> +    assert(dest->size == source->size);
> +    for (i = 0; i < source->niov; i++) {
> +        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> +        memcpy(dest->iov[i].iov_base,
> +               source->iov[i].iov_base,
> +               source->iov[i].iov_len);
> +    }
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>      QuorumChildRequest *sacb = opaque;
> @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
>      BDRVQuorumState *s = acb->common.bs->opaque;
>      bool rewrite = false;
>  
> +    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> +        /* We try to read next child in FIFO order if we fail to read */
> +        if (ret < 0 && ++acb->child_iter < s->num_children) {
> +            read_fifo_child(acb);
> +            return;
> +        }
> +
> +        if (ret == 0) {
> +            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
> +        }
> +        acb->vote_ret = ret;
> +        quorum_aio_finalize(acb);
> +        return;
> +    }
> +
>      sacb->ret = ret;
>      acb->count++;
>      if (ret == 0) {
> @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
>      return count;
>  }
>  
> -static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> -{
> -    int i;
> -    assert(dest->niov == source->niov);
> -    assert(dest->size == source->size);
> -    for (i = 0; i < source->niov; i++) {
> -        assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> -        memcpy(dest->iov[i].iov_base,
> -               source->iov[i].iov_base,
> -               source->iov[i].iov_len);
> -    }
> -}
> -
>  static void quorum_count_vote(QuorumVotes *votes,
>                                QuorumVoteValue *value,
>                                int index)
> @@ -615,34 +638,62 @@ free_exit:
>      return rewrite;
>  }
>  
> -static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> -                                         int64_t sector_num,
> -                                         QEMUIOVector *qiov,
> -                                         int nb_sectors,
> -                                         BlockDriverCompletionFunc *cb,
> -                                         void *opaque)
> +static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
>  {
> -    BDRVQuorumState *s = bs->opaque;
> -    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> -                                      nb_sectors, cb, opaque);
> +    BDRVQuorumState *s = acb->common.bs->opaque;
>      int i;
>  
> -    acb->is_read = true;
> -
>      for (i = 0; i < s->num_children; i++) {
> -        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
> -        qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
> -        qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
> +        acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> +        qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> +        qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
>      }
>  
>      for (i = 0; i < s->num_children; i++) {
> -        bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
> -                       quorum_aio_cb, &acb->qcrs[i]);
> +        bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
> +                       acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
>      }
>  
>      return &acb->common;
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> +{
> +    BDRVQuorumState *s = acb->common.bs->opaque;
> +
> +    acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
> +                                                     acb->qiov->size);
> +    qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
> +    qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
> +                     acb->qcrs[acb->child_iter].buf);
> +    bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
> +                   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
> +                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
> +
> +    return &acb->common;
> +}
> +
> +static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> +                                          int64_t sector_num,
> +                                          QEMUIOVector *qiov,
> +                                          int nb_sectors,
> +                                          BlockDriverCompletionFunc *cb,
> +                                          void *opaque)
> +{
> +    BDRVQuorumState *s = bs->opaque;
> +    QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> +                                      nb_sectors, cb, opaque);
> +    acb->is_read = true;
> +
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        acb->child_iter = s->num_children - 1;
> +        return read_quorum_children(acb);
> +    }
> +
> +    acb->child_iter = 0;
> +    return read_fifo_child(acb);
> +}
> +
>  static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
>                                            int64_t sector_num,
>                                            QEMUIOVector *qiov,
> @@ -782,10 +833,33 @@ static QemuOptsList quorum_runtime_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "Rewrite corrupted block on read quorum",
>          },
> +        {
> +            .name = QUORUM_OPT_READ_PATTERN,
> +            .type = QEMU_OPT_STRING,
> +            .help = "Allowed pattern: quorum, fifo. Quorum is default",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static int parse_read_pattern(const char *opt)
> +{
> +    int i;
> +
> +    if (!opt) {
> +        /* Set quorum as default */
> +        return QUORUM_READ_PATTERN_QUORUM;
> +    }
> +
> +    for (i = 0; i < QUORUM_READ_PATTERN_MAX; i++) {
> +        if (!strcmp(opt, QuorumReadPattern_lookup[i])) {
> +            return i;
> +        }
> +    }
> +
> +    return -EINVAL;
> +}
> +
>  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>                         Error **errp)
>  {
> @@ -827,28 +901,37 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
> -
> -    /* and validate it against s->num_children */
> -    ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +    ret = parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN));
>      if (ret < 0) {
> +        error_setg(&local_err, "Please set read-pattern as fifo or quorum\n");
>          goto exit;
>      }
> +    s->read_pattern = ret;
>  
> -    /* is the driver in blkverify mode */
> -    if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> -        s->num_children == 2 && s->threshold == 2) {
> -        s->is_blkverify = true;
> -    } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> -        fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> -                "and using two files with vote_threshold=2\n");
> -    }
> +    if (s->read_pattern == QUORUM_READ_PATTERN_QUORUM) {
> +        /* and validate it against s->num_children */
> +        ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
> +        if (ret < 0) {
> +            goto exit;
> +        }
>  
> -    s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
> -    if (s->rewrite_corrupted && s->is_blkverify) {
> -        error_setg(&local_err,
> -                   "rewrite-corrupted=on cannot be used with blkverify=on");
> -        ret = -EINVAL;
> -        goto exit;
> +        /* is the driver in blkverify mode */
> +        if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
> +            s->num_children == 2 && s->threshold == 2) {
> +            s->is_blkverify = true;
> +        } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
> +            fprintf(stderr, "blkverify mode is set by setting blkverify=on "
> +                    "and using two files with vote_threshold=2\n");
> +        }
> +
> +        s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
> +                                                 false);
> +        if (s->rewrite_corrupted && s->is_blkverify) {
> +            error_setg(&local_err,
> +                       "rewrite-corrupted=on cannot be used with blkverify=on");
> +            ret = -EINVAL;
> +            goto exit;
> +        }
>      }
>  
>      /* allocate the children BlockDriverState array */
> -- 
> 1.9.1
> 

Seems good however I will need to read it again a couple of times.

Best regards

Benoît

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
  2014-08-14 11:09   ` Benoît Canet
@ 2014-08-15  4:51     ` Liu Yuan
  0 siblings, 0 replies; 10+ messages in thread
From: Liu Yuan @ 2014-08-15  4:51 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On Thu, Aug 14, 2014 at 01:09:32PM +0200, Benoît Canet wrote:
> The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> >         VM
> >   --------------
> >   |            |
> >   v            v
> >   A            B
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its own.
> > So it would help performance if we do a single read instead of on all the nodes.
> > Further, if we run VM on either of the storage node, we can make a local read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/quorum.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 131 insertions(+), 48 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..ebf5c71 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY      "blkverify"
> >  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> >      bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
> >                              * block if Quorum is reached.
> >                              */
> > +
> > +    QuorumReadPattern read_pattern;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> >  
> >      bool is_read;
> >      int vote_ret;
> > +    int child_iter;             /* which child to read in fifo pattern */
> >  };
> >  
> >  static bool quorum_vote(QuorumAIOCB *acb);
> > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >  
> >      if (acb->is_read) {
> >          for (i = 0; i < s->num_children; i++) {
> > -            qemu_vfree(acb->qcrs[i].buf);
> > -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            if (i <= acb->child_iter) {
> > +                qemu_vfree(acb->qcrs[i].buf);
> > +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            }
> 
> This seems convoluted.
> 
> What about ?
>           /* on the quorum case acb->child_iter == s->num_children - 1 */
>           for (i = 0; i <= acb->child_iter; i++) {
>               qemu_vfree(acb->qcrs[i].buf);
>               qemu_iovec_destroy(&acb->qcrs[i].qiov);
>           }
>           
> >      }

Sounds good. I'll update the patch v5.

Thanks
Yuan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-08-15  4:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-17  5:18 [Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum Liu Yuan
2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum Liu Yuan
2014-08-11 16:42   ` Eric Blake
2014-07-17  5:18 ` [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support Liu Yuan
2014-08-11  7:18   ` Liu Yuan
2014-08-11 12:31   ` Benoît Canet
2014-08-12  2:41     ` Liu Yuan
2014-08-14  7:01       ` Liu Yuan
2014-08-14 11:09   ` Benoît Canet
2014-08-15  4:51     ` Liu Yuan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).