* [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
@ 2014-07-15 3:19 Liu Yuan
2014-07-15 3:33 ` Eric Blake
0 siblings, 1 reply; 4+ messages in thread
From: Liu Yuan @ 2014-07-15 3:19 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi
This patch adds single read pattern to quorum driver and quorum vote is default
pattern.
For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.
For some use cases as following:
VM
--------------
| |
v v
A B
Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.
This patch generalize the above 2 nodes case in the N nodes. That is,
vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.
The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:
- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.
After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.
E.g, Enable single read pattern on 2 children,
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1
[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
---
v2:
- rename single as 'fifo'
- rename read_all_children as read_quorum_children
- fix quorum_aio_finalize() for fifo pattern
block.c | 2 +-
block/quorum.c | 181 +++++++++++++++++++++++++++++++++++++--------------
qapi/block-core.json | 7 +-
3 files changed, 138 insertions(+), 52 deletions(-)
diff --git a/block.c b/block.c
index 8800a6b..3dfa27c 100644
--- a/block.c
+++ b/block.c
@@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
if (!drv)
return -ENOMEDIUM;
-
+
if (!bs->backing_hd) {
return -ENOTSUP;
}
diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1443fc5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
#define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
#define QUORUM_OPT_BLKVERIFY "blkverify"
#define QUORUM_OPT_REWRITE "rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN "read-pattern"
/* This union holds a vote hash value */
typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
* block if Quorum is reached.
*/
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_FIFO 1
+ int read_pattern; /* fifo: read a single child and try first one
+ * first. If error, try next child in an
+ * FIFO order specifed by command line.
+ * Return error if no child read succeeds.
+ * quorum: read all the children and do a quorum
+ * vote on reads.
+ */
} BDRVQuorumState;
typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
bool is_read;
int vote_ret;
+ int child_iter; /* which child to read in fifo pattern */
};
static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
if (acb->is_read) {
for (i = 0; i < s->num_children; i++) {
- qemu_vfree(acb->qcrs[i].buf);
- qemu_iovec_destroy(&acb->qcrs[i].qiov);
+ if (i <= acb->child_iter) {
+ qemu_vfree(acb->qcrs[i].buf);
+ qemu_iovec_destroy(&acb->qcrs[i].qiov);
+ }
}
}
@@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
quorum_aio_finalize(acb);
}
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+ int i;
+ assert(dest->niov == source->niov);
+ assert(dest->size == source->size);
+ for (i = 0; i < source->niov; i++) {
+ assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+ memcpy(dest->iov[i].iov_base,
+ source->iov[i].iov_base,
+ source->iov[i].iov_len);
+ }
+}
+
static void quorum_aio_cb(void *opaque, int ret)
{
QuorumChildRequest *sacb = opaque;
@@ -263,6 +292,21 @@ static void quorum_aio_cb(void *opaque, int ret)
BDRVQuorumState *s = acb->common.bs->opaque;
bool rewrite = false;
+ if (acb->is_read && s->read_pattern == READ_PATTERN_FIFO) {
+ /* We try to read next child in FIFO order if we fail to read */
+ if (ret < 0 && ++acb->child_iter < s->num_children) {
+ read_fifo_child(acb);
+ return;
+ }
+
+ if (ret == 0) {
+ quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+ }
+ acb->vote_ret = ret;
+ quorum_aio_finalize(acb);
+ return;
+ }
+
sacb->ret = ret;
acb->count++;
if (ret == 0) {
@@ -343,19 +387,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
return count;
}
-static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
-{
- int i;
- assert(dest->niov == source->niov);
- assert(dest->size == source->size);
- for (i = 0; i < source->niov; i++) {
- assert(dest->iov[i].iov_len == source->iov[i].iov_len);
- memcpy(dest->iov[i].iov_base,
- source->iov[i].iov_base,
- source->iov[i].iov_len);
- }
-}
-
static void quorum_count_vote(QuorumVotes *votes,
QuorumVoteValue *value,
int index)
@@ -615,34 +646,62 @@ free_exit:
return rewrite;
}
-static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
- int64_t sector_num,
- QEMUIOVector *qiov,
- int nb_sectors,
- BlockDriverCompletionFunc *cb,
- void *opaque)
+static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
{
- BDRVQuorumState *s = bs->opaque;
- QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
- nb_sectors, cb, opaque);
+ BDRVQuorumState *s = acb->common.bs->opaque;
int i;
- acb->is_read = true;
-
for (i = 0; i < s->num_children; i++) {
- acb->qcrs[i].buf = qemu_blockalign(s->bs[i], qiov->size);
- qemu_iovec_init(&acb->qcrs[i].qiov, qiov->niov);
- qemu_iovec_clone(&acb->qcrs[i].qiov, qiov, acb->qcrs[i].buf);
+ acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
+ qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
+ qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
}
for (i = 0; i < s->num_children; i++) {
- bdrv_aio_readv(s->bs[i], sector_num, &acb->qcrs[i].qiov, nb_sectors,
- quorum_aio_cb, &acb->qcrs[i]);
+ bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
+ acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
}
return &acb->common;
}
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
+{
+ BDRVQuorumState *s = acb->common.bs->opaque;
+
+ acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
+ acb->qiov->size);
+ qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
+ qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
+ acb->qcrs[acb->child_iter].buf);
+ bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
+ &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
+ quorum_aio_cb, &acb->qcrs[acb->child_iter]);
+
+ return &acb->common;
+}
+
+static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
+ int64_t sector_num,
+ QEMUIOVector *qiov,
+ int nb_sectors,
+ BlockDriverCompletionFunc *cb,
+ void *opaque)
+{
+ BDRVQuorumState *s = bs->opaque;
+ QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
+ nb_sectors, cb, opaque);
+ acb->is_read = true;
+
+ if (s->read_pattern == READ_PATTERN_QUORUM) {
+ acb->child_iter = s->num_children - 1;
+ return read_quorum_children(acb);
+ }
+
+ acb->child_iter = 0;
+ return read_fifo_child(acb);
+}
+
static BlockDriverAIOCB *quorum_aio_writev(BlockDriverState *bs,
int64_t sector_num,
QEMUIOVector *qiov,
@@ -782,6 +841,11 @@ static QemuOptsList quorum_runtime_opts = {
.type = QEMU_OPT_BOOL,
.help = "Rewrite corrupted block on read quorum",
},
+ {
+ .name = QUORUM_OPT_READ_PATTERN,
+ .type = QEMU_OPT_STRING,
+ .help = "Allowed pattern: quorum, fifo. Quorum is default",
+ },
{ /* end of list */ }
},
};
@@ -796,6 +860,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
QDict *sub = NULL;
QList *list = NULL;
const QListEntry *lentry;
+ const char *read_pattern;
int i;
int ret = 0;
@@ -827,28 +892,44 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
}
s->threshold = qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHOLD, 0);
-
- /* and validate it against s->num_children */
- ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
- if (ret < 0) {
- goto exit;
+ read_pattern = qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN);
+ if (read_pattern) {
+ if (strcmp(read_pattern, "fifo") == 0) {
+ s->read_pattern = READ_PATTERN_FIFO;
+ } else if (strcmp(read_pattern, "quorum") == 0) {
+ s->read_pattern = READ_PATTERN_QUORUM;
+ } else {
+ error_setg(&local_err,
+ "Please set read-pattern as fifo or quorum\n");
+ ret = -EINVAL;
+ goto exit;
+ }
}
- /* is the driver in blkverify mode */
- if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
- s->num_children == 2 && s->threshold == 2) {
- s->is_blkverify = true;
- } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
- fprintf(stderr, "blkverify mode is set by setting blkverify=on "
- "and using two files with vote_threshold=2\n");
- }
+ if (s->read_pattern == READ_PATTERN_QUORUM) {
+ /* and validate it against s->num_children */
+ ret = quorum_valid_threshold(s->threshold, s->num_children, &local_err);
+ if (ret < 0) {
+ goto exit;
+ }
- s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE, false);
- if (s->rewrite_corrupted && s->is_blkverify) {
- error_setg(&local_err,
- "rewrite-corrupted=on cannot be used with blkverify=on");
- ret = -EINVAL;
- goto exit;
+ /* is the driver in blkverify mode */
+ if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) &&
+ s->num_children == 2 && s->threshold == 2) {
+ s->is_blkverify = true;
+ } else if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false)) {
+ fprintf(stderr, "blkverify mode is set by setting blkverify=on "
+ "and using two files with vote_threshold=2\n");
+ }
+
+ s->rewrite_corrupted = qemu_opt_get_bool(opts, QUORUM_OPT_REWRITE,
+ false);
+ if (s->rewrite_corrupted && s->is_blkverify) {
+ error_setg(&local_err,
+ "rewrite-corrupted=on cannot be used with blkverify=on");
+ ret = -EINVAL;
+ goto exit;
+ }
}
/* allocate the children BlockDriverState array */
diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..e3716bc 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1398,12 +1398,17 @@
# @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
# (Since 2.1)
#
+# @read-pattern: #optional choose quorum or fifo pattern for read
+# set to quorum by default (Since 2.2)
+#
# Since: 2.0
##
{ 'type': 'BlockdevOptionsQuorum',
'data': { '*blkverify': 'bool',
'children': [ 'BlockdevRef' ],
- 'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+ 'vote-threshold': 'int',
+ '*rewrite-corrupted': 'bool',
+ '*read-pattern': 'str' } }
##
# @BlockdevOptions
--
1.9.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
2014-07-15 3:19 [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support Liu Yuan
@ 2014-07-15 3:33 ` Eric Blake
2014-07-15 4:13 ` Liu Yuan
0 siblings, 1 reply; 4+ messages in thread
From: Eric Blake @ 2014-07-15 3:33 UTC (permalink / raw)
To: Liu Yuan, qemu-devel; +Cc: Kevin Wolf, Benoit Canet, Stefan Hajnoczi
[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]
On 07/14/2014 09:19 PM, Liu Yuan wrote:
> This patch adds single read pattern to quorum driver and quorum vote is default
> pattern.
>
> This patch generalize the above 2 nodes case in the N nodes. That is,
>
> vm -> write to all the N nodes, read just one of them. If single read fails, we
> try to read next node in FIFO order specified by the startup command.
>
> The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> functionality in the single device/node failure for now. But compared with DRBD
> we still have some advantages over it:
>
> +++ b/block.c
> @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
>
> if (!drv)
> return -ENOMEDIUM;
> -
> +
> if (!bs->backing_hd) {
> return -ENOTSUP;
> }
While this whitespace cleanup is nice, it doesn't belong in this patch,
when there is no other change to this unrelated file.
> +++ b/qapi/block-core.json
> @@ -1398,12 +1398,17 @@
> # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
> # (Since 2.1)
> #
> +# @read-pattern: #optional choose quorum or fifo pattern for read
> +# set to quorum by default (Since 2.2)
> +#
> # Since: 2.0
> ##
> { 'type': 'BlockdevOptionsQuorum',
> 'data': { '*blkverify': 'bool',
> 'children': [ 'BlockdevRef' ],
> - 'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
> + 'vote-threshold': 'int',
> + '*rewrite-corrupted': 'bool',
> + '*read-pattern': 'str' } }
Raw strings that encode a finite set of values are bad for type-safety.
Please add an enum:
{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
then use '*read-pattern': 'QuorumReadPattern'
Should we offer multiple modes in addition to 'quorum'? For example, I
could see a difference between 'fifo' (favor read from the first quorum
member always, unless it fails, good when the first member is local and
other member is remote) and 'round-robin' (evenly distribute reads; each
read goes to the next available quorum member, good when all members are
equally distant).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
2014-07-15 3:33 ` Eric Blake
@ 2014-07-15 4:13 ` Liu Yuan
2014-07-15 4:44 ` Eric Blake
0 siblings, 1 reply; 4+ messages in thread
From: Liu Yuan @ 2014-07-15 4:13 UTC (permalink / raw)
To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Benoit Canet
On Mon, Jul 14, 2014 at 09:33:59PM -0600, Eric Blake wrote:
> On 07/14/2014 09:19 PM, Liu Yuan wrote:
> > This patch adds single read pattern to quorum driver and quorum vote is default
> > pattern.
> >
>
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> >
> > vm -> write to all the N nodes, read just one of them. If single read fails, we
> > try to read next node in FIFO order specified by the startup command.
> >
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with DRBD
> > we still have some advantages over it:
> >
>
> > +++ b/block.c
> > @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
> >
> > if (!drv)
> > return -ENOMEDIUM;
> > -
> > +
> > if (!bs->backing_hd) {
> > return -ENOTSUP;
> > }
>
> While this whitespace cleanup is nice, it doesn't belong in this patch,
> when there is no other change to this unrelated file.
>
>
> > +++ b/qapi/block-core.json
> > @@ -1398,12 +1398,17 @@
> > # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
> > # (Since 2.1)
> > #
> > +# @read-pattern: #optional choose quorum or fifo pattern for read
> > +# set to quorum by default (Since 2.2)
> > +#
> > # Since: 2.0
> > ##
> > { 'type': 'BlockdevOptionsQuorum',
> > 'data': { '*blkverify': 'bool',
> > 'children': [ 'BlockdevRef' ],
> > - 'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
> > + 'vote-threshold': 'int',
> > + '*rewrite-corrupted': 'bool',
> > + '*read-pattern': 'str' } }
>
> Raw strings that encode a finite set of values are bad for type-safety.
> Please add an enum:
>
> { 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
>
> then use '*read-pattern': 'QuorumReadPattern'
For startup command line, did it imply a change too? Sorry I'm not familiar with
block-core.json.
With your suggestion, how we pass read-pattern via qmp?
read-pattern: fifo
or
read-pattern: quorum
>
> Should we offer multiple modes in addition to 'quorum'? For example, I
> could see a difference between 'fifo' (favor read from the first quorum
> member always, unless it fails, good when the first member is local and
> other member is remote) and 'round-robin' (evenly distribute reads; each
> read goes to the next available quorum member, good when all members are
> equally distant).
I guess so. 'round-robin' in your context would benefit use case seeking for
better read load balance.
Thanks
Yuan
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
2014-07-15 4:13 ` Liu Yuan
@ 2014-07-15 4:44 ` Eric Blake
0 siblings, 0 replies; 4+ messages in thread
From: Eric Blake @ 2014-07-15 4:44 UTC (permalink / raw)
To: Liu Yuan; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Benoit Canet
[-- Attachment #1: Type: text/plain, Size: 1919 bytes --]
On 07/14/2014 10:13 PM, Liu Yuan wrote:
>>> + '*read-pattern': 'str' } }
>>
>> Raw strings that encode a finite set of values are bad for type-safety.
>> Please add an enum:
>>
>> { 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
>>
>> then use '*read-pattern': 'QuorumReadPattern'
>
> For startup command line, did it imply a change too? Sorry I'm not familiar with
> block-core.json.
Adding an enum in the .json file will generate an enum that you can
reuse in your parsing code (look at commit 82a402e99,
parse_enum_option). The command line will be the same, but instead of
open-coding things, you can now use the same enum as what the QMP code
will use.
>
> With your suggestion, how we pass read-pattern via qmp?
>
> read-pattern: fifo
> or
> read-pattern: quorum
The QMP wire format will be identical - the user still passes
"read-pattern":"fifo" as part of building up their JSON. It's just that
the string "fifo" is now validated for correctness against the enum of
valid strings, instead of being open-ended anything goes.
>
>
>>
>> Should we offer multiple modes in addition to 'quorum'? For example, I
>> could see a difference between 'fifo' (favor read from the first quorum
>> member always, unless it fails, good when the first member is local and
>> other member is remote) and 'round-robin' (evenly distribute reads; each
>> read goes to the next available quorum member, good when all members are
>> equally distant).
>
> I guess so. 'round-robin' in your context would benefit use case seeking for
> better read load balance.
Of course, if you do add 'round-robin', do it in a separate patch; and
that's also a case where having the enum list valid modes will make it
easier to add in a new mode.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-07-15 4:45 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-15 3:19 [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support Liu Yuan
2014-07-15 3:33 ` Eric Blake
2014-07-15 4:13 ` Liu Yuan
2014-07-15 4:44 ` Eric Blake
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.