From: Paolo Bonzini <pbonzini@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: jcody@redhat.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v2 26/45] mirror: introduce mirror job
Date: Tue, 16 Oct 2012 08:36:57 +0200 [thread overview]
Message-ID: <507D0089.8000104@redhat.com> (raw)
In-Reply-To: <507C4065.2010508@redhat.com>
Il 15/10/2012 18:57, Kevin Wolf ha scritto:
> Am 26.09.2012 17:56, schrieb Paolo Bonzini:
>> This patch adds the implementation of a new job that mirrors a disk to
>> a new image while letting the guest continue using the old image.
>> The target is treated as a "black box" and data is copied from the
>> source to the target in the background. This can be used for several
>> purposes, including storage migration, continuous replication, and
>> observation of the guest I/O in an external program. It is also a
>> first step in replacing the inefficient block migration code that is
>> part of QEMU.
>>
>> The job is possibly never-ending, but it is logically structured into
>> two phases: 1) copy all data as fast as possible until the target
>> first gets in sync with the source; 2) keep target in sync and
>> ensure that reopening to the target gets a correct (full) copy
>> of the source data.
>>
>> The second phase is indicated by the progress in "info block-jobs"
>> reporting the current offset to be equal to the length of the file.
>> When the job is cancelled in the second phase, QEMU will run the
>> job until the source is clean and quiescent, then it will report
>> successful completion of the job.
>>
>> In other words, the BLOCK_JOB_CANCELLED event means that the target
>> may _not_ be consistent with a past state of the source; the
>> BLOCK_JOB_COMPLETED event means that the target is consistent with
>> a past state of the source. (Note that it could already happen
>> that management lost the race against QEMU and got a completion
>> event instead of cancellation).
>>
>> It is not yet possible to complete the job and switch over to the target
>> disk. The next patches will fix this and add many refinements to the
>> basic idea introduced here. These include improved error management,
>> some tunable knobs and performance optimizations.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>> v1->v2: Always "goto immediate_exit" and similar code cleanups.
>> Error checking for bdrv_flush. Call bdrv_set_enable_write_cache
>> on the target to make it always writeback.
>>
>> block/Makefile.objs | 2 +-
>> block/mirror.c | 234 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> block_int.h | 20 +++++
>> qapi-schema.json | 17 ++++
>> trace-events | 7 ++
>> 5 file modificati, 279 inserzioni(+). 1 rimozione(-)
>> create mode 100644 block/mirror.c
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index c45affc..f1a394a 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -9,4 +9,4 @@ block-obj-$(CONFIG_LIBISCSI) += iscsi.o
>> block-obj-$(CONFIG_CURL) += curl.o
>> block-obj-$(CONFIG_RBD) += rbd.o
>>
>> -common-obj-y += stream.o
>> +common-obj-y += stream.o mirror.o
>> diff --git a/block/mirror.c b/block/mirror.c
>> new file mode 100644
>> index 0000000..09ea020
>> --- /dev/null
>> +++ b/block/mirror.c
>> @@ -0,0 +1,234 @@
>> +/*
>> + * Image mirroring
>> + *
>> + * Copyright Red Hat, Inc. 2012
>> + *
>> + * Authors:
>> + * Paolo Bonzini <pbonzini@redhat.com>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "trace.h"
>> +#include "blockjob.h"
>> +#include "block_int.h"
>> +#include "qemu/ratelimit.h"
>> +
>> +enum {
>> + /*
>> + * Size of data buffer for populating the image file. This should be large
>> + * enough to process multiple clusters in a single call, so that populating
>> + * contiguous regions of the image is efficient.
>> + */
>> + BLOCK_SIZE = 512 * BDRV_SECTORS_PER_DIRTY_CHUNK, /* in bytes */
>> +};
>> +
>> +#define SLICE_TIME 100000000ULL /* ns */
>> +
>> +typedef struct MirrorBlockJob {
>> + BlockJob common;
>> + RateLimit limit;
>> + BlockDriverState *target;
>> + MirrorSyncMode mode;
>> + int64_t sector_num;
>> + uint8_t *buf;
>> +} MirrorBlockJob;
>> +
>> +static int coroutine_fn mirror_iteration(MirrorBlockJob *s)
>> +{
>> + BlockDriverState *source = s->common.bs;
>> + BlockDriverState *target = s->target;
>> + QEMUIOVector qiov;
>> + int ret, nb_sectors;
>> + int64_t end;
>> + struct iovec iov;
>> +
>> + end = s->common.len >> BDRV_SECTOR_BITS;
>> + s->sector_num = bdrv_get_next_dirty(source, s->sector_num);
>> + nb_sectors = MIN(BDRV_SECTORS_PER_DIRTY_CHUNK, end - s->sector_num);
>> + bdrv_reset_dirty(source, s->sector_num, nb_sectors);
>> +
>> + /* Copy the dirty cluster. */
>> + iov.iov_base = s->buf;
>> + iov.iov_len = nb_sectors * 512;
>> + qemu_iovec_init_external(&qiov, &iov, 1);
>> +
>> + trace_mirror_one_iteration(s, s->sector_num, nb_sectors);
>> + ret = bdrv_co_readv(source, s->sector_num, nb_sectors, &qiov);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + return bdrv_co_writev(target, s->sector_num, nb_sectors, &qiov);
>> +}
>> +
>> +static void coroutine_fn mirror_run(void *opaque)
>> +{
>> + MirrorBlockJob *s = opaque;
>> + BlockDriverState *bs = s->common.bs;
>> + int64_t sector_num, end;
>> + int ret = 0;
>> + int n;
>> + bool synced = false;
>> +
>> + if (block_job_is_cancelled(&s->common)) {
>> + goto immediate_exit;
>> + }
>> +
>> + s->common.len = bdrv_getlength(bs);
>> + if (s->common.len < 0) {
>> + block_job_completed(&s->common, s->common.len);
>> + return;
>> + }
>> +
>> + end = s->common.len >> BDRV_SECTOR_BITS;
>> + s->buf = qemu_blockalign(bs, BLOCK_SIZE);
>> +
>> + if (s->mode != MIRROR_SYNC_MODE_NONE) {
>> + /* First part, loop on the sectors and initialize the dirty bitmap. */
>> + BlockDriverState *base;
>> + base = s->mode == MIRROR_SYNC_MODE_FULL ? NULL : bs->backing_hd;
>> + for (sector_num = 0; sector_num < end; ) {
>> + int64_t next = (sector_num | (BDRV_SECTORS_PER_DIRTY_CHUNK - 1)) + 1;
>> + ret = bdrv_co_is_allocated_above(bs, base,
>> + sector_num, next - sector_num, &n);
>> +
>> + if (ret < 0) {
>> + goto immediate_exit;
>> + }
>> +
>> + assert(n > 0);
>> + if (ret == 1) {
>> + bdrv_set_dirty(bs, sector_num, n);
>> + sector_num = next;
>> + } else {
>> + sector_num += n;
>> + }
>> + }
>> + }
>> +
>> + s->sector_num = -1;
>> + for (;;) {
>> + uint64_t delay_ns;
>> + int64_t cnt;
>> + bool should_complete;
>> +
>> + cnt = bdrv_get_dirty_count(bs);
>> + if (cnt != 0) {
>> + ret = mirror_iteration(s);
>> + if (ret < 0) {
>> + goto immediate_exit;
>> + }
>> + cnt = bdrv_get_dirty_count(bs);
>> + }
>> +
>> + should_complete = false;
>> + if (cnt == 0) {
>> + trace_mirror_before_flush(s);
>> + if (bdrv_flush(s->target) < 0) {
>> + goto immediate_exit;
>> + }
>
> Are you sure that we should signal successful completion when
> bdrv_flush() fails?
Hmm, of course not.
>> +
>> + /* We're out of the streaming phase. From now on, if the job
>> + * is cancelled we will actually complete all pending I/O and
>> + * report completion. This way, block-job-cancel will leave
>> + * the target in a consistent state.
>> + */
>
> Don't we have block_job_complete() for that now? Then I think the job
> can be cancelled immediately, even in an inconsistent state.
The idea was that block-job-cancel will still leave the target in a
consistent state if executed during the second phase. Otherwise it is
impossible to take a consistent snapshot and keep running on the first
image.
>> + synced = true;
>> + s->common.offset = end * BDRV_SECTOR_SIZE;
>> + should_complete = block_job_is_cancelled(&s->common);
>> + cnt = bdrv_get_dirty_count(bs);
>> + }
>> +
>> + if (cnt == 0 && should_complete) {
>> + /* The dirty bitmap is not updated while operations are pending.
>> + * If we're about to exit, wait for pending operations before
>> + * calling bdrv_get_dirty_count(bs), or we may exit while the
>> + * source has dirty data to copy!
>> + *
>> + * Note that I/O can be submitted by the guest while
>> + * mirror_populate runs.
>> + */
>> + trace_mirror_before_drain(s, cnt);
>> + bdrv_drain_all();
>> + cnt = bdrv_get_dirty_count(bs);
>> + }
>> +
>> + ret = 0;
>> + trace_mirror_before_sleep(s, cnt, synced);
>> + if (!synced) {
>> + /* Publish progress */
>> + s->common.offset = end * BDRV_SECTOR_SIZE - cnt * BLOCK_SIZE;
>> +
>> + if (s->common.speed) {
>> + delay_ns = ratelimit_calculate_delay(&s->limit, BDRV_SECTORS_PER_DIRTY_CHUNK);
>> + } else {
>> + delay_ns = 0;
>> + }
>> +
>> + /* Note that even when no rate limit is applied we need to yield
>> + * with no pending I/O here so that qemu_aio_flush() returns.
>> + */
>> + block_job_sleep_ns(&s->common, rt_clock, delay_ns);
>> + if (block_job_is_cancelled(&s->common)) {
>> + break;
>> + }
>> + } else if (!should_complete) {
>> + delay_ns = (cnt == 0 ? SLICE_TIME : 0);
>> + block_job_sleep_ns(&s->common, rt_clock, delay_ns);
>
> Why don't we check block_job_is_cancelled() here? I can't see how
> cancellation works in the second phase, except when cnt becomes 0.
Indeed cancellation requires consistency (and hence cnt == 0) in the
second phase.
> But this isn't guaranteed, is it?
Not guaranteed, but in practice it works and you can always throttle
writes on the source to guarantee that it does.
>> + } else if (cnt == 0) {
>> + /* The two disks are in sync. Exit and report successful
>> + * completion.
>> + */
>> + assert(QLIST_EMPTY(&bs->tracked_requests));
>> + s->common.cancelled = false;
>> + break;
>> + }
>> + }
>> +
>> +immediate_exit:
>> + g_free(s->buf);
>> + bdrv_set_dirty_tracking(bs, false);
>> + bdrv_close(s->target);
>> + bdrv_delete(s->target);
>> + block_job_completed(&s->common, ret);
>> +}
>
> Kevin
>
next prev parent reply other threads:[~2012-10-16 6:37 UTC|newest]
Thread overview: 102+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-26 15:56 [Qemu-devel] [PATCH v2 00/45] Block job improvements for 1.3 Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 01/45] qerror/block: introduce QERR_BLOCK_JOB_NOT_ACTIVE Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 02/45] blockdev: rename block_stream_cb to a generic block_job_cb Paolo Bonzini
2012-09-27 11:56 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 03/45] block: fix documentation of block_job_cancel_sync Paolo Bonzini
2012-09-27 12:03 ` Kevin Wolf
2012-09-27 12:08 ` Paolo Bonzini
2012-09-27 12:13 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 04/45] block: move job APIs to separate files Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 05/45] block: add block_job_query Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 06/45] block: add support for job pause/resume Paolo Bonzini
2012-09-26 17:31 ` Eric Blake
2012-09-27 12:18 ` Kevin Wolf
2012-09-27 12:27 ` Paolo Bonzini
2012-09-27 12:45 ` Kevin Wolf
2012-09-27 12:57 ` Paolo Bonzini
2012-09-27 13:51 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 07/45] qmp: add block-job-pause and block-job-resume Paolo Bonzini
2012-09-26 17:45 ` Eric Blake
2012-09-27 9:23 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 08/45] qemu-iotests: add test for pausing a streaming operation Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 09/45] block: rename block_job_complete to block_job_completed Paolo Bonzini
2012-09-27 12:30 ` Kevin Wolf
2012-09-27 20:31 ` Jeff Cody
2012-09-28 11:00 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 10/45] iostatus: rename BlockErrorAction, BlockQMPEventAction Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 11/45] iostatus: move BlockdevOnError declaration to QAPI Paolo Bonzini
2012-09-26 17:54 ` Eric Blake
2012-09-27 9:23 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 12/45] iostatus: change is_read to a bool Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 13/45] iostatus: reorganize io error code Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 14/45] block: introduce block job error Paolo Bonzini
2012-09-26 19:10 ` Eric Blake
2012-09-26 19:27 ` Eric Blake
2012-09-27 9:24 ` Paolo Bonzini
2012-09-27 13:41 ` Kevin Wolf
2012-09-27 14:50 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 15/45] stream: add on-error argument Paolo Bonzini
2012-09-26 20:53 ` Eric Blake
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 16/45] blkdebug: process all set_state rules in the old state Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 17/45] qemu-iotests: map underscore to dash in QMP argument names Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 18/45] qemu-iotests: add tests for streaming error handling Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 19/45] block: add bdrv_query_info Paolo Bonzini
2012-10-15 15:42 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 20/45] block: add bdrv_query_stats Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 21/45] block: add bdrv_open_backing_file Paolo Bonzini
2012-09-27 18:14 ` Jeff Cody
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 22/45] block: introduce new dirty bitmap functionality Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 23/45] block: export dirty bitmap information in query-block Paolo Bonzini
2012-10-15 16:08 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 24/45] block: add block-job-complete Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 25/45] block: introduce BLOCK_JOB_READY event Paolo Bonzini
2012-09-27 0:01 ` Eric Blake
2012-09-27 9:25 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 26/45] mirror: introduce mirror job Paolo Bonzini
2012-10-15 16:57 ` Kevin Wolf
2012-10-16 6:36 ` Paolo Bonzini [this message]
2012-10-16 8:24 ` Kevin Wolf
2012-10-16 8:35 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 27/45] qmp: add drive-mirror command Paolo Bonzini
2012-09-27 0:14 ` Eric Blake
2012-09-27 19:49 ` Jeff Cody
2012-10-15 17:33 ` Kevin Wolf
2012-10-16 6:39 ` Paolo Bonzini
2012-10-18 13:13 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 28/45] mirror: implement completion Paolo Bonzini
2012-10-15 17:49 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 29/45] qemu-iotests: add mirroring test case Paolo Bonzini
2012-09-27 0:26 ` Eric Blake
2012-10-18 12:43 ` Kevin Wolf
2012-10-18 12:50 ` Paolo Bonzini
2012-10-18 13:08 ` Kevin Wolf
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 30/45] iostatus: forward block_job_iostatus_reset to block job Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 31/45] mirror: add support for on-source-error/on-target-error Paolo Bonzini
2012-10-18 13:07 ` Kevin Wolf
2012-10-18 13:10 ` Paolo Bonzini
2012-10-18 13:56 ` Kevin Wolf
2012-10-18 14:52 ` Paolo Bonzini
2012-10-19 8:04 ` Kevin Wolf
2012-10-19 9:30 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 32/45] qmp: add pull_event function Paolo Bonzini
2012-09-26 17:17 ` Luiz Capitulino
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 33/45] qemu-iotests: add testcases for mirroring on-source-error/on-target-error Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 34/45] host-utils: add ffsl Paolo Bonzini
2012-09-27 1:14 ` Eric Blake
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 35/45] add hierarchical bitmap data type and test cases Paolo Bonzini
2012-09-27 2:53 ` Eric Blake
2012-09-27 9:27 ` Paolo Bonzini
2012-10-24 14:41 ` Kevin Wolf
2012-10-24 14:50 ` Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 36/45] block: implement dirty bitmap using HBitmap Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 37/45] block: make round_to_clusters public Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 38/45] mirror: perform COW if the cluster size is bigger than the granularity Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 39/45] block: return count of dirty sectors, not chunks Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 40/45] block: allow customizing the granularity of the dirty bitmap Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 41/45] mirror: allow customizing the granularity Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 42/45] mirror: switch mirror_iteration to AIO Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 43/45] mirror: add buf-size argument to drive-mirror Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 44/45] mirror: support more than one in-flight AIO operation Paolo Bonzini
2012-09-26 15:56 ` [Qemu-devel] [PATCH v2 45/45] mirror: support arbitrarily-sized iterations Paolo Bonzini
2012-09-27 14:05 ` [Qemu-devel] [PATCH v2 00/45] Block job improvements for 1.3 Kevin Wolf
2012-09-27 14:57 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=507D0089.8000104@redhat.com \
--to=pbonzini@redhat.com \
--cc=jcody@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.