qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Liu Yuan <namei.unix@gmail.com>
To: "Benoît Canet" <benoit.canet@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
Date: Mon, 1 Sep 2014 17:30:45 +0800	[thread overview]
Message-ID: <20140901093045.GF720@ubuntu-trusty> (raw)
In-Reply-To: <20140901085743.GG15537@irqsave.net>

On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote :
> > This allow VM continues to process even if some devices are broken meanwhile
> > with proper configuration.
> > 
> > We mark the device broken when the protocol tier notify back some broken
> > state(s) of the device, such as diconnection via driver operations. We could
> > also reset the device as sound when the protocol tier is repaired.
> > 
> > Origianlly .threshold controls how we should decide the success of read/write
> > and return the failure only if the success count of read/write is less than
> > .threshold specified by users. But it doesn't record the states of underlying
> > states and will impact performance a bit in some cases.
> > 
> > For example, we have 3 children and .threshold is set 2. If one of the devices
> > broken, we should still return success and continue to run VM. But for every
> > IO operations, we will blindly send the requests to the broken device.
> > 
> > To store broken state into driver state we can save requests to borken devices
> > and resend the requests to the repaired ones by setting broken as false.
> > 
> > This is especially useful for network based protocol such as sheepdog, which
> > has a auto-reconnection mechanism and will never report EIO if the connection
> > is broken but just store the requests to its local queue and wait for resending.
> > Without broken state, quorum request will not come back until the connection is
> > re-established. So we have to skip the broken deivces to allow VM to continue
> > running with networked backed child (glusterfs, nfs, sheepdog, etc).
> > 
> > With the combination of read-pattern and threshold, we can easily mimic the DRVD
> > behavior with following configuration:
> > 
> >  read-pattern=fifo,threshold=1 will two children.
> > 
> > Cc: Eric Blake <eblake@redhat.com>
> > Cc: Benoit Canet <benoit@irqsave.net>
> > Cc: Kevin Wolf <kwolf@redhat.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Liu Yuan <namei.unix@gmail.com>
> > ---
> >  block/quorum.c            | 102 ++++++++++++++++++++++++++++++++++++++--------
> >  include/block/block_int.h |   3 ++
> >  2 files changed, 87 insertions(+), 18 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index b9eeda3..7b07e35 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -120,6 +120,7 @@ struct QuorumAIOCB {
> >      int rewrite_count;          /* number of replica to rewrite: count down to
> >                                   * zero once writes are fired
> >                                   */
> > +    int issued_count;           /* actual read&write issued count */
> >  
> >      QuorumVotes votes;
> >  
> > @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >      if (acb->is_read) {
> >          /* on the quorum case acb->child_iter == s->num_children - 1 */
> >          for (i = 0; i <= acb->child_iter; i++) {
> > -            qemu_vfree(acb->qcrs[i].buf);
> > -            qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            if (acb->qcrs[i].buf) {
> > +                qemu_vfree(acb->qcrs[i].buf);
> > +                qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +            }
> >          }
> >      }
> >  
> > @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >      acb->count = 0;
> >      acb->success_count = 0;
> >      acb->rewrite_count = 0;
> > +    acb->issued_count = 0;
> >      acb->votes.compare = quorum_sha256_compare;
> >      QLIST_INIT(&acb->votes.vote_list);
> >      acb->is_read = false;
> > @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> >      }
> >  }
> >  
> > +static int next_fifo_child(QuorumAIOCB *acb)
> > +{
> > +    BDRVQuorumState *s = acb->common.bs->opaque;
> > +    int i;
> > +
> > +    for (i = acb->child_iter; i < s->num_children; i++) {
> > +        if (!s->bs[i]->broken) {
> > +            break;
> > +        }
> > +    }
> > +    if (i == s->num_children) {
> > +        return -1;
> > +    }
> > +    return i;
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int ret)
> >  {
> >      QuorumChildRequest *sacb = opaque;
> > @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
> >      BDRVQuorumState *s = acb->common.bs->opaque;
> >      bool rewrite = false;
> >  
> > +    if (ret < 0) {
> > +        s->bs[acb->child_iter]->broken = true;
> > +    }
> 
> child_iter is fifo mode stuff.
> Do we need to write if (s->read_pattern == QUORUM_READ_PATTERN_FIFO && ret < 0) here ?

Probably not. child_iter denotes which bs the QuorumChildRequest belongs to.

> 
> > +
> >      if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
> >          /* We try to read next child in FIFO order if we fail to read */
> > -        if (ret < 0 && ++acb->child_iter < s->num_children) {
> > -            read_fifo_child(acb);
> > -            return;
> > +        if (ret < 0) {
> > +            acb->child_iter = next_fifo_child(acb);
> 
> You don't seem to increment child_iter anymore.

No, I use a function next_fifo_child() to get the next_fifo_child because right
now we have to skip the broken devices.


> 
> > +            if (acb->child_iter > 0) {
> > +                read_fifo_child(acb);
> > +                return;
> > +            }
> >          }
> >  
> >          if (ret == 0) {
> > @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
> >      } else {
> >          quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
> >      }
> > -    assert(acb->count <= s->num_children);
> > -    assert(acb->success_count <= s->num_children);
> > -    if (acb->count < s->num_children) {
> > +    assert(acb->count <= acb->issued_count);
> > +    assert(acb->success_count <= acb->issued_count);
> > +    if (acb->count < acb->issued_count) {
> >          return;
> >      }
> >  
> > @@ -647,22 +674,46 @@ free_exit:
> >      return rewrite;
> >  }
> >  
> > +static bool has_enough_children(BDRVQuorumState *s)
> > +{
> > +    int good = 0, i;
> > +
> > +    for (i = 0; i < s->num_children; i++) {
> > +        if (s->bs[i]->broken) {
> > +            continue;
> > +        }
> > +        good++;
> > +    }
> > +
> > +    if (good >= s->threshold) {
> > +        return true;
> > +    } else {
> > +        return false;
> > +    }
> > +}
> > +
> >  static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb)
> >  {
> >      BDRVQuorumState *s = acb->common.bs->opaque;
> >      int i;
> >  
> > +    if (!has_enough_children(s)) {
> > +        quorum_aio_release(acb);
> > +        return NULL;
> > +    }
> > +
> >      for (i = 0; i < s->num_children; i++) {
> > +        if (s->bs[i]->broken) {
> > +            continue;
> > +        }
> >          acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
> >          qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
> >          qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
> > -    }
> > -
> > -    for (i = 0; i < s->num_children; i++) {
> >          acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
> >                                              &acb->qcrs[i].qiov,
> >                                              acb->nb_sectors, quorum_aio_cb,
> >                                              &acb->qcrs[i]);
> > +        acb->issued_count++;
> >      }
> >  
> >      return &acb->common;
> > @@ -679,7 +730,7 @@ static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
> >      acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
> >                                          &acb->qcrs[i].qiov, acb->nb_sectors,
> >                                          quorum_aio_cb, &acb->qcrs[i]);
> > -
> > +    acb->issued_count = 1;
> >      return &acb->common;
> >  }
> >  
> > @@ -693,6 +744,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs,
> >      BDRVQuorumState *s = bs->opaque;
> >      QuorumAIOCB *acb = quorum_aio_get(s, bs, qiov, sector_num,
> >                                        nb_sectors, cb, opaque);
> 
> > +
> Spurious carriage return.

Good catch, dude. Thanks!

Yuan

  reply	other threads:[~2014-09-01  9:31 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-01  7:43 [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks Liu Yuan
2014-09-01  8:28   ` Benoît Canet
2014-09-01  9:19     ` Liu Yuan
2014-09-01  9:28       ` Benoît Canet
2014-09-01  9:40         ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver Liu Yuan
2014-09-01  8:31   ` Benoît Canet
2014-09-01  9:22     ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper Liu Yuan
2014-09-01  8:33   ` Benoît Canet
2014-09-01  7:43 ` [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel() Liu Yuan
2014-09-01  8:35   ` Benoît Canet
2014-09-01  9:26     ` Liu Yuan
2014-09-01  9:32       ` Benoît Canet
2014-09-01  9:46         ` Liu Yuan
2014-09-01  7:43 ` [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState Liu Yuan
2014-09-01  8:57   ` Benoît Canet
2014-09-01  9:30     ` Liu Yuan [this message]
2014-09-01  7:43 ` [Qemu-devel] [PATCH 7/8] block: add two helpers Liu Yuan
2014-09-01  8:59   ` Benoît Canet
2014-09-01  7:43 ` [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic Liu Yuan
2014-09-01  9:37   ` Benoît Canet
2014-09-01  9:45     ` Liu Yuan
2014-09-01  8:19 ` [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver Benoît Canet
2014-09-02 22:19 ` Benoît Canet
2014-09-10  7:31   ` Liu Yuan
2014-09-07 15:12 ` Benoît Canet
2014-09-10  7:18   ` Liu Yuan
2014-09-10 13:12     ` Benoît Canet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140901093045.GF720@ubuntu-trusty \
    --to=namei.unix@gmail.com \
    --cc=benoit.canet@irqsave.net \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).