From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34205) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XONRc-0003mV-OP for qemu-devel@nongnu.org; Mon, 01 Sep 2014 04:58:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XONRX-0000yT-JP for qemu-devel@nongnu.org; Mon, 01 Sep 2014 04:58:36 -0400 Received: from lputeaux-656-01-25-125.w80-12.abo.wanadoo.fr ([80.12.84.125]:37263 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XONRX-0000yP-AH for qemu-devel@nongnu.org; Mon, 01 Sep 2014 04:58:31 -0400 Date: Mon, 1 Sep 2014 10:57:43 +0200 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20140901085743.GG15537@irqsave.net> References: <1409557394-11853-1-git-send-email-namei.unix@gmail.com> <1409557394-11853-7-git-send-email-namei.unix@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1409557394-11853-7-git-send-email-namei.unix@gmail.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Liu Yuan Cc: Kevin Wolf , qemu-devel@nongnu.org, Stefan Hajnoczi The Monday 01 Sep 2014 =E0 15:43:12 (+0800), Liu Yuan wrote : > This allow VM continues to process even if some devices are broken mean= while > with proper configuration. >=20 > We mark the device broken when the protocol tier notify back some broke= n > state(s) of the device, such as diconnection via driver operations. We = could > also reset the device as sound when the protocol tier is repaired. >=20 > Origianlly .threshold controls how we should decide the success of read= /write > and return the failure only if the success count of read/write is less = than > .threshold specified by users. But it doesn't record the states of unde= rlying > states and will impact performance a bit in some cases. >=20 > For example, we have 3 children and .threshold is set 2. If one of the = devices > broken, we should still return success and continue to run VM. But for = every > IO operations, we will blindly send the requests to the broken device. >=20 > To store broken state into driver state we can save requests to borken = devices > and resend the requests to the repaired ones by setting broken as false= . >=20 > This is especially useful for network based protocol such as sheepdog, = which > has a auto-reconnection mechanism and will never report EIO if the conn= ection > is broken but just store the requests to its local queue and wait for r= esending. > Without broken state, quorum request will not come back until the conne= ction is > re-established. So we have to skip the broken deivces to allow VM to co= ntinue > running with networked backed child (glusterfs, nfs, sheepdog, etc). >=20 > With the combination of read-pattern and threshold, we can easily mimic= the DRVD > behavior with following configuration: >=20 > read-pattern=3Dfifo,threshold=3D1 will two children. >=20 > Cc: Eric Blake > Cc: Benoit Canet > Cc: Kevin Wolf > Cc: Stefan Hajnoczi > Signed-off-by: Liu Yuan > --- > block/quorum.c | 102 ++++++++++++++++++++++++++++++++++++++= -------- > include/block/block_int.h | 3 ++ > 2 files changed, 87 insertions(+), 18 deletions(-) >=20 > diff --git a/block/quorum.c b/block/quorum.c > index b9eeda3..7b07e35 100644 > --- a/block/quorum.c > +++ b/block/quorum.c > @@ -120,6 +120,7 @@ struct QuorumAIOCB { > int rewrite_count; /* number of replica to rewrite: count= down to > * zero once writes are fired > */ > + int issued_count; /* actual read&write issued count */ > =20 > QuorumVotes votes; > =20 > @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) > if (acb->is_read) { > /* on the quorum case acb->child_iter =3D=3D s->num_children -= 1 */ > for (i =3D 0; i <=3D acb->child_iter; i++) { > - qemu_vfree(acb->qcrs[i].buf); > - qemu_iovec_destroy(&acb->qcrs[i].qiov); > + if (acb->qcrs[i].buf) { > + qemu_vfree(acb->qcrs[i].buf); > + qemu_iovec_destroy(&acb->qcrs[i].qiov); > + } > } > } > =20 > @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState = *s, > acb->count =3D 0; > acb->success_count =3D 0; > acb->rewrite_count =3D 0; > + acb->issued_count =3D 0; > acb->votes.compare =3D quorum_sha256_compare; > QLIST_INIT(&acb->votes.vote_list); > acb->is_read =3D false; > @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, Q= EMUIOVector *source) > } > } > =20 > +static int next_fifo_child(QuorumAIOCB *acb) > +{ > + BDRVQuorumState *s =3D acb->common.bs->opaque; > + int i; > + > + for (i =3D acb->child_iter; i < s->num_children; i++) { > + if (!s->bs[i]->broken) { > + break; > + } > + } > + if (i =3D=3D s->num_children) { > + return -1; > + } > + return i; > +} > + > static void quorum_aio_cb(void *opaque, int ret) > { > QuorumChildRequest *sacb =3D opaque; > @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret) > BDRVQuorumState *s =3D acb->common.bs->opaque; > bool rewrite =3D false; > =20 > + if (ret < 0) { > + s->bs[acb->child_iter]->broken =3D true; > + } child_iter is fifo mode stuff. Do we need to write if (s->read_pattern =3D=3D QUORUM_READ_PATTERN_FIFO &= & ret < 0) here ? > + > if (acb->is_read && s->read_pattern =3D=3D QUORUM_READ_PATTERN_FIF= O) { > /* We try to read next child in FIFO order if we fail to read = */ > - if (ret < 0 && ++acb->child_iter < s->num_children) { > - read_fifo_child(acb); > - return; > + if (ret < 0) { > + acb->child_iter =3D next_fifo_child(acb); You don't seem to increment child_iter anymore. > + if (acb->child_iter > 0) { > + read_fifo_child(acb); > + return; > + } > } > =20 > if (ret =3D=3D 0) { > @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret) > } else { > quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret); > } > - assert(acb->count <=3D s->num_children); > - assert(acb->success_count <=3D s->num_children); > - if (acb->count < s->num_children) { > + assert(acb->count <=3D acb->issued_count); > + assert(acb->success_count <=3D acb->issued_count); > + if (acb->count < acb->issued_count) { > return; > } > =20 > @@ -647,22 +674,46 @@ free_exit: > return rewrite; > } > =20 > +static bool has_enough_children(BDRVQuorumState *s) > +{ > + int good =3D 0, i; > + > + for (i =3D 0; i < s->num_children; i++) { > + if (s->bs[i]->broken) { > + continue; > + } > + good++; > + } > + > + if (good >=3D s->threshold) { > + return true; > + } else { > + return false; > + } > +} > + > static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb) > { > BDRVQuorumState *s =3D acb->common.bs->opaque; > int i; > =20 > + if (!has_enough_children(s)) { > + quorum_aio_release(acb); > + return NULL; > + } > + > for (i =3D 0; i < s->num_children; i++) { > + if (s->bs[i]->broken) { > + continue; > + } > acb->qcrs[i].buf =3D qemu_blockalign(s->bs[i], acb->qiov->size= ); > qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov); > qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].b= uf); > - } > - > - for (i =3D 0; i < s->num_children; i++) { > acb->qcrs[i].aiocb =3D bdrv_aio_readv(s->bs[i], acb->sector_nu= m, > &acb->qcrs[i].qiov, > acb->nb_sectors, quorum_ai= o_cb, > &acb->qcrs[i]); > + acb->issued_count++; > } > =20 > return &acb->common; > @@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOC= B *acb) > acb->qcrs[i].aiocb =3D bdrv_aio_readv(s->bs[i], acb->sector_num, > &acb->qcrs[i].qiov, acb->nb_se= ctors, > quorum_aio_cb, &acb->qcrs[i]); > - > + acb->issued_count =3D 1; > return &acb->common; > } > =20 > @@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriv= erState *bs, > BDRVQuorumState *s =3D bs->opaque; > QuorumAIOCB *acb =3D quorum_aio_get(s, bs, qiov, sector_num, > nb_sectors, cb, opaque); > + Spurious carriage return. > acb->is_read =3D true; > =20 > if (s->read_pattern =3D=3D QUORUM_READ_PATTERN_QUORUM) { > @@ -701,6 +753,11 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDri= verState *bs, > } > =20 > acb->child_iter =3D 0; > + acb->child_iter =3D next_fifo_child(acb); > + if (acb->child_iter < 0) { > + quorum_aio_release(acb); > + return NULL; > + } > return read_fifo_child(acb); > } > =20 > @@ -716,10 +773,19 @@ static BlockDriverAIOCB *quorum_aio_writev(BlockD= riverState *bs, > cb, opaque); > int i; > =20 > + if (!has_enough_children(s)) { > + quorum_aio_release(acb); > + return NULL; > + } > + > for (i =3D 0; i < s->num_children; i++) { > + if (s->bs[i]->broken) { > + continue; > + } > acb->qcrs[i].aiocb =3D bdrv_aio_writev(s->bs[i], sector_num, q= iov, > nb_sectors, &quorum_aio_c= b, > &acb->qcrs[i]); > + acb->issued_count++; > } > =20 > return &acb->common; > @@ -926,6 +992,12 @@ static int quorum_open(BlockDriverState *bs, QDict= *options, int flags, > } > =20 > s->threshold =3D qemu_opt_get_number(opts, QUORUM_OPT_VOTE_THRESHO= LD, 0); > + /* and validate it against s->num_children */ > + ret =3D quorum_valid_threshold(s->threshold, s->num_children, &loc= al_err); > + if (ret < 0) { > + goto exit; > + } > + > ret =3D parse_read_pattern(qemu_opt_get(opts, QUORUM_OPT_READ_PATT= ERN)); > if (ret < 0) { > error_setg(&local_err, "Please set read-pattern as fifo or quo= rum\n"); > @@ -934,12 +1006,6 @@ static int quorum_open(BlockDriverState *bs, QDic= t *options, int flags, > s->read_pattern =3D ret; > =20 > if (s->read_pattern =3D=3D QUORUM_READ_PATTERN_QUORUM) { > - /* and validate it against s->num_children */ > - ret =3D quorum_valid_threshold(s->threshold, s->num_children, = &local_err); > - if (ret < 0) { > - goto exit; > - } > - > /* is the driver in blkverify mode */ > if (qemu_opt_get_bool(opts, QUORUM_OPT_BLKVERIFY, false) && > s->num_children =3D=3D 2 && s->threshold =3D=3D 2) { > diff --git a/include/block/block_int.h b/include/block/block_int.h > index 9fdec7f..599110b 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -402,6 +402,9 @@ struct BlockDriverState { > =20 > /* The error object in use for blocking operations on backing_hd *= / > Error *backing_blocker; > + > + /* True if the associated device is broken */ > + bool broken; > }; > =20 > int get_tmp_filename(char *filename, int size); > --=20 > 1.9.1 >=20 >=20