From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45924) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XONx1-00032O-7r for qemu-devel@nongnu.org; Mon, 01 Sep 2014 05:31:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XONws-00067g-9G for qemu-devel@nongnu.org; Mon, 01 Sep 2014 05:31:03 -0400 Received: from mail-pa0-x22e.google.com ([2607:f8b0:400e:c03::22e]:60274) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XONwr-00067B-TM for qemu-devel@nongnu.org; Mon, 01 Sep 2014 05:30:54 -0400 Received: by mail-pa0-f46.google.com with SMTP id eu11so11861507pac.19 for ; Mon, 01 Sep 2014 02:30:53 -0700 (PDT) Date: Mon, 1 Sep 2014 17:30:45 +0800 From: Liu Yuan Message-ID: <20140901093045.GF720@ubuntu-trusty> References: <1409557394-11853-1-git-send-email-namei.unix@gmail.com> <1409557394-11853-7-git-send-email-namei.unix@gmail.com> <20140901085743.GG15537@irqsave.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140901085743.GG15537@irqsave.net> Subject: Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?iso-8859-1?Q?Beno=EEt?= Canet Cc: Kevin Wolf , qemu-devel@nongnu.org, Stefan Hajnoczi On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote: > The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote : > > This allow VM continues to process even if some devices are broken meanwhile > > with proper configuration. > > > > We mark the device broken when the protocol tier notify back some broken > > state(s) of the device, such as diconnection via driver operations. We could > > also reset the device as sound when the protocol tier is repaired. > > > > Origianlly .threshold controls how we should decide the success of read/write > > and return the failure only if the success count of read/write is less than > > .threshold specified by users. But it doesn't record the states of underlying > > states and will impact performance a bit in some cases. > > > > For example, we have 3 children and .threshold is set 2. If one of the devices > > broken, we should still return success and continue to run VM. But for every > > IO operations, we will blindly send the requests to the broken device. > > > > To store broken state into driver state we can save requests to borken devices > > and resend the requests to the repaired ones by setting broken as false. > > > > This is especially useful for network based protocol such as sheepdog, which > > has a auto-reconnection mechanism and will never report EIO if the connection > > is broken but just store the requests to its local queue and wait for resending. > > Without broken state, quorum request will not come back until the connection is > > re-established. So we have to skip the broken deivces to allow VM to continue > > running with networked backed child (glusterfs, nfs, sheepdog, etc). > > > > With the combination of read-pattern and threshold, we can easily mimic the DRVD > > behavior with following configuration: > > > > read-pattern=fifo,threshold=1 will two children. > > > > Cc: Eric Blake > > Cc: Benoit Canet > > Cc: Kevin Wolf > > Cc: Stefan Hajnoczi > > Signed-off-by: Liu Yuan > > --- > > block/quorum.c | 102 ++++++++++++++++++++++++++++++++++++++-------- > > include/block/block_int.h | 3 ++ > > 2 files changed, 87 insertions(+), 18 deletions(-) > > > > diff --git a/block/quorum.c b/block/quorum.c > > index b9eeda3..7b07e35 100644 > > --- a/block/quorum.c > > +++ b/block/quorum.c > > @@ -120,6 +120,7 @@ struct QuorumAIOCB { > > int rewrite_count; /* number of replica to rewrite: count down to > > * zero once writes are fired > > */ > > + int issued_count; /* actual read&write issued count */ > > > > QuorumVotes votes; > > > > @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) > > if (acb->is_read) { > > /* on the quorum case acb->child_iter == s->num_children - 1 */ > > for (i = 0; i <= acb->child_iter; i++) { > > - qemu_vfree(acb->qcrs[i].buf); > > - qemu_iovec_destroy(&acb->qcrs[i].qiov); > > + if (acb->qcrs[i].buf) { > > + qemu_vfree(acb->qcrs[i].buf); > > + qemu_iovec_destroy(&acb->qcrs[i].qiov); > > + } > > } > > } > > > > @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, > > acb->count = 0; > > acb->success_count = 0; > > acb->rewrite_count = 0; > > + acb->issued_count = 0; > > acb->votes.compare = quorum_sha256_compare; > > QLIST_INIT(&acb->votes.vote_list); > > acb->is_read = false; > > @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) > > } > > } > > > > +static int next_fifo_child(QuorumAIOCB *acb) > > +{ > > + BDRVQuorumState *s = acb->common.bs->opaque; > > + int i; > > + > > + for (i = acb->child_iter; i < s->num_children; i++) { > > + if (!s->bs[i]->broken) { > > + break; > > + } > > + } > > + if (i == s->num_children) { > > + return -1; > > + } > > + return i; > > +} > > + > > static void quorum_aio_cb(void *opaque, int ret) > > { > > QuorumChildRequest *sacb = opaque; > > @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret) > > BDRVQuorumState *s = acb->common.bs->opaque; > > bool rewrite = false; > > > > + if (ret < 0) { > > + s->bs[acb->child_iter]->broken = true; > > + } > > child_iter is fifo mode stuff. > Do we need to write if (s->read_pattern == QUORUM_READ_PATTERN_FIFO && ret < 0) here ? Probably not. child_iter denotes which bs the QuorumChildRequest belongs to. > > > + > > if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) { > > /* We try to read next child in FIFO order if we fail to read */ > > - if (ret < 0 && ++acb->child_iter < s->num_children) { > > - read_fifo_child(acb); > > - return; > > + if (ret < 0) { > > + acb->child_iter = next_fifo_child(acb); > > You don't seem to increment child_iter anymore. No, I use a function next_fifo_child() to get the next_fifo_child because right now we have to skip the broken devices. > > > + if (acb->child_iter > 0) { > > + read_fifo_child(acb); > > + return; > > + } > > } > > > > if (ret == 0) { > > @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret) > > } else { > > quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); > > } > > - assert(acb->count <= s->num_children); > > - assert(acb->success_count <= s->num_children); > > - if (acb->count < s->num_children) { > > + assert(acb->count <= acb->issued_count); > > + assert(acb->success_count <= acb->issued_count); > > + if (acb->count < acb->issued_count) { > > return; > > } > > > > @@ -647,22 +674,46 @@ free_exit: > > return rewrite; > > } > > > > +static bool has_enough_children(BDRVQuorumState *s) > > +{ > > + int good = 0, i; > > + > > + for (i = 0; i < s->num_children; i++) { > > + if (s->bs[i]->broken) { > > + continue; > > + } > > + good++; > > + } > > + > > + if (good >= s->threshold) { > > + return true; > > + } else { > > + return false; > > + } > > +} > > + > > static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb) > > { > > BDRVQuorumState *s = acb->common.bs->opaque; > > int i; > > > > + if (!has_enough_children(s)) { > > + quorum_aio_release(acb); > > + return NULL; > > + } > > + > > for (i = 0; i < s->num_children; i++) { > > + if (s->bs[i]->broken) { > > + continue; > > + } > > acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size); > > qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov); > > qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf); > > - } > > - > > - for (i = 0; i < s->num_children; i++) { > > acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num, > > &acb->qcrs[i].qiov, > > acb->nb_sectors, quorum_aio_cb, > > &acb->qcrs[i]); > > + acb->issued_count++; > > } > > > > return &acb->common; > > @@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb) > > acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num, > > &acb->qcrs[i].qiov, acb->nb_sectors, > > quorum_aio_cb, &acb->qcrs[i]); > > - > > + acb->issued_count = 1; > > return &acb->common; > > } > > > > @@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs, > > BDRVQuorumState *s = bs->opaque; > > QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num, > > nb_sectors, cb, opaque); > > > + > Spurious carriage return. Good catch, dude. Thanks! Yuan