qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: "Denis V. Lunev" <den@openvz.org>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	vsementsov@virtuozzo.com, Fam Zheng <famz@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>,
	Jeff Cody <jcody@redhat.com>, Eric Blake <eblake@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror
Date: Wed, 15 Jun 2016 10:48:20 +0100	[thread overview]
Message-ID: <20160615094820.GD26488@stefanha-x1.localdomain> (raw)
In-Reply-To: <1465917916-22348-9-git-send-email-den@openvz.org>

[-- Attachment #1: Type: text/plain, Size: 6084 bytes --]

On Tue, Jun 14, 2016 at 06:25:15PM +0300, Denis V. Lunev wrote:
> Block commit of the active image to the backing store on a slow disk
> could never end. For example with the guest with the following loop
> inside
>     while true; do
>         dd bs=1k count=1 if=/dev/zero of=x
>     done
> running above slow storage could not complete the operation with a
> resonable amount of time:
>     virsh blockcommit rhel7 sda --active --shallow
>     virsh qemu-monitor-event
>     virsh qemu-monitor-command rhel7 \
>         '{"execute":"block-job-complete",\
>           "arguments":{"device":"drive-scsi0-0-0-0"} }'
>     virsh qemu-monitor-event
> Completion event is never received.
> 
> This problem could not be fixed easily with the current architecture. We
> should either prohibit guest writes (making dirty bitmap dirty) or switch
> to the sycnchronous scheme.
> 
> This patch implements the latter. It adds mirror_before_write_notify
> callback. In this case all data written from the guest is synchnonously
> written to the mirror target. Though the problem is solved partially.
> We should switch from bdrv_dirty_bitmap to simple hbitmap. This will be
> done in the next patch.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> Reviewed-by: Vladimir Sementsov-Ogievskiy<vsementsov@virtuozzo.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Jeff Cody <jcody@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> ---
>  block/mirror.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 78 insertions(+)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 7471211..086256c 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -58,6 +58,9 @@ typedef struct MirrorBlockJob {
>      QSIMPLEQ_HEAD(, MirrorBuffer) buf_free;
>      int buf_free_count;
>  
> +    NotifierWithReturn before_write;
> +    CoQueue dependent_writes;
> +
>      unsigned long *in_flight_bitmap;
>      int in_flight;
>      int sectors_in_flight;
> @@ -125,6 +128,7 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>      g_free(op->buf);

qemu_vfree() must be used for qemu_blockalign() memory.

>      g_free(op);
>  
> +    qemu_co_queue_restart_all(&s->dependent_writes);
>      if (s->waiting_for_io) {
>          qemu_coroutine_enter(s->common.co, NULL);
>      }
> @@ -511,6 +515,74 @@ static void mirror_exit(BlockJob *job, void *opaque)
>      bdrv_unref(src);
>  }
>  
> +static int coroutine_fn mirror_before_write_notify(
> +        NotifierWithReturn *notifier, void *opaque)
> +{
> +    MirrorBlockJob *s = container_of(notifier, MirrorBlockJob, before_write);
> +    BdrvTrackedRequest *req = opaque;
> +    MirrorOp *op;
> +    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> +    int64_t end_sector = sector_num + nb_sectors;
> +    int64_t aligned_start, aligned_end;
> +
> +    if (req->type != BDRV_TRACKED_DISCARD && req->type != BDRV_TRACKED_WRITE) {
> +        /* this is not discard and write, we do not care */
> +        return 0;
> +    }
> +
> +    while (1) {
> +        bool waited = false;
> +        int64_t sn;
> +
> +        for (sn = sector_num; sn < end_sector; sn += sectors_per_chunk) {
> +            int64_t chunk = sn / sectors_per_chunk;
> +            if (test_bit(chunk, s->in_flight_bitmap)) {
> +                trace_mirror_yield_in_flight(s, chunk, s->in_flight);
> +                qemu_co_queue_wait(&s->dependent_writes);
> +                waited = true;
> +            }
> +        }
> +
> +        if (!waited) {
> +            break;
> +        }
> +    }
> +
> +    aligned_start = QEMU_ALIGN_UP(sector_num, sectors_per_chunk);
> +    aligned_end = QEMU_ALIGN_DOWN(sector_num + nb_sectors, sectors_per_chunk);
> +    if (aligned_end > aligned_start) {
> +        bdrv_reset_dirty_bitmap(s->dirty_bitmap, aligned_start,
> +                                aligned_end - aligned_start);
> +    }
> +
> +    if (req->type == BDRV_TRACKED_DISCARD) {
> +        mirror_do_zero_or_discard(s, sector_num, nb_sectors, true);
> +        return 0;
> +    }
> +
> +    s->in_flight++;
> +    s->sectors_in_flight += nb_sectors;
> +
> +    /* Allocate a MirrorOp that is used as an AIO callback.  */
> +    op = g_new(MirrorOp, 1);
> +    op->s = s;
> +    op->sector_num = sector_num;
> +    op->nb_sectors = nb_sectors;
> +    op->buf = qemu_try_blockalign(blk_bs(s->target), req->qiov->size);
> +    if (op->buf == NULL) {
> +        g_free(op);
> +        return -ENOMEM;
> +    }
> +    qemu_iovec_init(&op->qiov, req->qiov->niov);
> +    qemu_iovec_clone(&op->qiov, req->qiov, op->buf);

Now op->qiov's iovec[] array is equivalent to req->qiov but points to
op->buf.  But you never copied the data from req->qiov to op->buf so
junk gets written to the target!

> +    blk_aio_pwritev(s->target, req->offset, &op->qiov, 0,
> +                    mirror_write_complete, op);
> +    return 0;
> +}

The commit message and description claims this is synchronous but it is
not.  Async requests are being generated by guest I/O.  There is no rate
limiting if s->target is slower than bs.  In that case the queued AIO
requests keep getting longer (including the bounce buffers).  The guest
will exhaust host memory or aio functions will fail (i.e. Linux AIO max
requests is reached).

If you want this to be synchronous you have to yield the coroutine until
the request completes.  Synchronous writes increase latency so this
cannot be the new default.

A different solution is to detect when the dirty bitmap reaches a
minimum threshold and then employ I/O throttling on bs.  That way the
guest experiences no vcpu/network downtime and the I/O performance only
drops during the convergence phase.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

  parent reply	other threads:[~2016-06-15  9:48 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-14 15:25 [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Denis V. Lunev
2016-06-14 15:25 ` [Qemu-devel] [PATCH 1/9] mirror: fix calling of blk_aio_pwritev/blk_aio_preadv Denis V. Lunev
2016-06-14 22:48   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 2/9] mirror: create mirror_dirty_init helper for mirror_run Denis V. Lunev
2016-06-15  2:29   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 3/9] mirror: optimize dirty bitmap filling in mirror_run a bit Denis V. Lunev
2016-06-15  2:36   ` Eric Blake
2016-06-15  8:41     ` Denis V. Lunev
2016-06-15 12:25       ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 4/9] mirror: efficiently zero out target Denis V. Lunev
2016-06-15  3:00   ` Eric Blake
2016-06-15  8:46     ` Denis V. Lunev
2016-06-15 12:34       ` Eric Blake
2016-06-15 13:18         ` Denis V. Lunev
2016-07-06 14:33         ` Denis V. Lunev
2016-06-14 15:25 ` [Qemu-devel] [PATCH 5/9] mirror: improve performance of mirroring of empty disk Denis V. Lunev
2016-06-15  3:20   ` Eric Blake
2016-06-15  9:19     ` Stefan Hajnoczi
2016-06-15 10:37       ` Denis V. Lunev
2016-06-16 10:10         ` Stefan Hajnoczi
2016-06-17  2:53           ` Eric Blake
2016-06-17 13:56             ` Stefan Hajnoczi
2016-06-14 15:25 ` [Qemu-devel] [PATCH 6/9] block: pass qiov into before_write notifier Denis V. Lunev
2016-06-15  4:07   ` Eric Blake
2016-06-15  9:21   ` Stefan Hajnoczi
2016-06-15  9:24     ` Denis V. Lunev
2016-06-15  9:22   ` Stefan Hajnoczi
2016-06-14 15:25 ` [Qemu-devel] [PATCH 7/9] mirror: allow to save buffer for QEMUIOVector in MirrorOp Denis V. Lunev
2016-06-15  4:11   ` Eric Blake
2016-06-14 15:25 ` [Qemu-devel] [PATCH 8/9] mirror: use synch scheme for drive mirror Denis V. Lunev
2016-06-15  4:18   ` Eric Blake
2016-06-15  8:52     ` Denis V. Lunev
2016-06-15  9:48   ` Stefan Hajnoczi [this message]
2016-06-14 15:25 ` [Qemu-devel] [PATCH 9/9] mirror: replace bdrv_dirty_bitmap with plain hbitmap Denis V. Lunev
2016-06-15  9:06 ` [Qemu-devel] [PATCH 0/9] major rework of drive-mirror Kevin Wolf
2016-06-15  9:34   ` Denis V. Lunev
2016-06-15 10:25     ` Kevin Wolf
2016-06-15 10:44       ` Denis V. Lunev
2016-06-15  9:50 ` Stefan Hajnoczi
2016-06-15 11:09 ` Dr. David Alan Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160615094820.GD26488@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=famz@redhat.com \
    --cc=jcody@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).