From: Kevin Wolf <kwolf@redhat.com>
To: anthony@codemonkey.ws
Cc: kwolf@redhat.com, qemu-devel@nongnu.org
Subject: [Qemu-devel] [PATCH 24/32] mirror: introduce mirror job
Date: Wed, 24 Oct 2012 11:50:48 +0200 [thread overview]
Message-ID: <1351072256-6112-25-git-send-email-kwolf@redhat.com> (raw)
In-Reply-To: <1351072256-6112-1-git-send-email-kwolf@redhat.com>
From: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the implementation of a new job that mirrors a disk to
a new image while letting the guest continue using the old image.
The target is treated as a "black box" and data is copied from the
source to the target in the background. This can be used for several
purposes, including storage migration, continuous replication, and
observation of the guest I/O in an external program. It is also a
first step in replacing the inefficient block migration code that is
part of QEMU.
The job is possibly never-ending, but it is logically structured into
two phases: 1) copy all data as fast as possible until the target
first gets in sync with the source; 2) keep target in sync and
ensure that reopening to the target gets a correct (full) copy
of the source data.
The second phase is indicated by the progress in "info block-jobs"
reporting the current offset to be equal to the length of the file.
When the job is cancelled in the second phase, QEMU will run the
job until the source is clean and quiescent, then it will report
successful completion of the job.
In other words, the BLOCK_JOB_CANCELLED event means that the target
may _not_ be consistent with a past state of the source; the
BLOCK_JOB_COMPLETED event means that the target is consistent with
a past state of the source. (Note that it could already happen
that management lost the race against QEMU and got a completion
event instead of cancellation).
It is not yet possible to complete the job and switch over to the target
disk. The next patches will fix this and add many refinements to the
basic idea introduced here. These include improved error management,
some tunable knobs and performance optimizations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/Makefile.objs | 1 +
block/mirror.c | 235 +++++++++++++++++++++++++++++++++++++++++++++++++++
block_int.h | 20 +++++
qapi-schema.json | 17 ++++
trace-events | 7 ++
5 files changed, 280 insertions(+), 0 deletions(-)
create mode 100644 block/mirror.c
diff --git a/block/Makefile.objs b/block/Makefile.objs
index 554f429..806e526 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -12,3 +12,4 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
common-obj-y += stream.o
common-obj-y += commit.o
+common-obj-y += mirror.o
diff --git a/block/mirror.c b/block/mirror.c
new file mode 100644
index 0000000..b353798
--- /dev/null
+++ b/block/mirror.c
@@ -0,0 +1,235 @@
+/*
+ * Image mirroring
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ * Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "trace.h"
+#include "blockjob.h"
+#include "block_int.h"
+#include "qemu/ratelimit.h"
+
+enum {
+ /*
+ * Size of data buffer for populating the image file. This should be large
+ * enough to process multiple clusters in a single call, so that populating
+ * contiguous regions of the image is efficient.
+ */
+ BLOCK_SIZE = 512 * BDRV_SECTORS_PER_DIRTY_CHUNK, /* in bytes */
+};
+
+#define SLICE_TIME 100000000ULL /* ns */
+
+typedef struct MirrorBlockJob {
+ BlockJob common;
+ RateLimit limit;
+ BlockDriverState *target;
+ MirrorSyncMode mode;
+ int64_t sector_num;
+ uint8_t *buf;
+} MirrorBlockJob;
+
+static int coroutine_fn mirror_iteration(MirrorBlockJob *s)
+{
+ BlockDriverState *source = s->common.bs;
+ BlockDriverState *target = s->target;
+ QEMUIOVector qiov;
+ int ret, nb_sectors;
+ int64_t end;
+ struct iovec iov;
+
+ end = s->common.len >> BDRV_SECTOR_BITS;
+ s->sector_num = bdrv_get_next_dirty(source, s->sector_num);
+ nb_sectors = MIN(BDRV_SECTORS_PER_DIRTY_CHUNK, end - s->sector_num);
+ bdrv_reset_dirty(source, s->sector_num, nb_sectors);
+
+ /* Copy the dirty cluster. */
+ iov.iov_base = s->buf;
+ iov.iov_len = nb_sectors * 512;
+ qemu_iovec_init_external(&qiov, &iov, 1);
+
+ trace_mirror_one_iteration(s, s->sector_num, nb_sectors);
+ ret = bdrv_co_readv(source, s->sector_num, nb_sectors, &qiov);
+ if (ret < 0) {
+ return ret;
+ }
+ return bdrv_co_writev(target, s->sector_num, nb_sectors, &qiov);
+}
+
+static void coroutine_fn mirror_run(void *opaque)
+{
+ MirrorBlockJob *s = opaque;
+ BlockDriverState *bs = s->common.bs;
+ int64_t sector_num, end;
+ int ret = 0;
+ int n;
+ bool synced = false;
+
+ if (block_job_is_cancelled(&s->common)) {
+ goto immediate_exit;
+ }
+
+ s->common.len = bdrv_getlength(bs);
+ if (s->common.len < 0) {
+ block_job_completed(&s->common, s->common.len);
+ return;
+ }
+
+ end = s->common.len >> BDRV_SECTOR_BITS;
+ s->buf = qemu_blockalign(bs, BLOCK_SIZE);
+
+ if (s->mode != MIRROR_SYNC_MODE_NONE) {
+ /* First part, loop on the sectors and initialize the dirty bitmap. */
+ BlockDriverState *base;
+ base = s->mode == MIRROR_SYNC_MODE_FULL ? NULL : bs->backing_hd;
+ for (sector_num = 0; sector_num < end; ) {
+ int64_t next = (sector_num | (BDRV_SECTORS_PER_DIRTY_CHUNK - 1)) + 1;
+ ret = bdrv_co_is_allocated_above(bs, base,
+ sector_num, next - sector_num, &n);
+
+ if (ret < 0) {
+ goto immediate_exit;
+ }
+
+ assert(n > 0);
+ if (ret == 1) {
+ bdrv_set_dirty(bs, sector_num, n);
+ sector_num = next;
+ } else {
+ sector_num += n;
+ }
+ }
+ }
+
+ s->sector_num = -1;
+ for (;;) {
+ uint64_t delay_ns;
+ int64_t cnt;
+ bool should_complete;
+
+ cnt = bdrv_get_dirty_count(bs);
+ if (cnt != 0) {
+ ret = mirror_iteration(s);
+ if (ret < 0) {
+ goto immediate_exit;
+ }
+ cnt = bdrv_get_dirty_count(bs);
+ }
+
+ should_complete = false;
+ if (cnt == 0) {
+ trace_mirror_before_flush(s);
+ ret = bdrv_flush(s->target);
+ if (ret < 0) {
+ goto immediate_exit;
+ }
+
+ /* We're out of the streaming phase. From now on, if the job
+ * is cancelled we will actually complete all pending I/O and
+ * report completion. This way, block-job-cancel will leave
+ * the target in a consistent state.
+ */
+ synced = true;
+ s->common.offset = end * BDRV_SECTOR_SIZE;
+ should_complete = block_job_is_cancelled(&s->common);
+ cnt = bdrv_get_dirty_count(bs);
+ }
+
+ if (cnt == 0 && should_complete) {
+ /* The dirty bitmap is not updated while operations are pending.
+ * If we're about to exit, wait for pending operations before
+ * calling bdrv_get_dirty_count(bs), or we may exit while the
+ * source has dirty data to copy!
+ *
+ * Note that I/O can be submitted by the guest while
+ * mirror_populate runs.
+ */
+ trace_mirror_before_drain(s, cnt);
+ bdrv_drain_all();
+ cnt = bdrv_get_dirty_count(bs);
+ }
+
+ ret = 0;
+ trace_mirror_before_sleep(s, cnt, synced);
+ if (!synced) {
+ /* Publish progress */
+ s->common.offset = end * BDRV_SECTOR_SIZE - cnt * BLOCK_SIZE;
+
+ if (s->common.speed) {
+ delay_ns = ratelimit_calculate_delay(&s->limit, BDRV_SECTORS_PER_DIRTY_CHUNK);
+ } else {
+ delay_ns = 0;
+ }
+
+ /* Note that even when no rate limit is applied we need to yield
+ * with no pending I/O here so that qemu_aio_flush() returns.
+ */
+ block_job_sleep_ns(&s->common, rt_clock, delay_ns);
+ if (block_job_is_cancelled(&s->common)) {
+ break;
+ }
+ } else if (!should_complete) {
+ delay_ns = (cnt == 0 ? SLICE_TIME : 0);
+ block_job_sleep_ns(&s->common, rt_clock, delay_ns);
+ } else if (cnt == 0) {
+ /* The two disks are in sync. Exit and report successful
+ * completion.
+ */
+ assert(QLIST_EMPTY(&bs->tracked_requests));
+ s->common.cancelled = false;
+ break;
+ }
+ }
+
+immediate_exit:
+ g_free(s->buf);
+ bdrv_set_dirty_tracking(bs, false);
+ bdrv_close(s->target);
+ bdrv_delete(s->target);
+ block_job_completed(&s->common, ret);
+}
+
+static void mirror_set_speed(BlockJob *job, int64_t speed, Error **errp)
+{
+ MirrorBlockJob *s = container_of(job, MirrorBlockJob, common);
+
+ if (speed < 0) {
+ error_set(errp, QERR_INVALID_PARAMETER, "speed");
+ return;
+ }
+ ratelimit_set_speed(&s->limit, speed / BDRV_SECTOR_SIZE, SLICE_TIME);
+}
+
+static BlockJobType mirror_job_type = {
+ .instance_size = sizeof(MirrorBlockJob),
+ .job_type = "mirror",
+ .set_speed = mirror_set_speed,
+};
+
+void mirror_start(BlockDriverState *bs, BlockDriverState *target,
+ int64_t speed, MirrorSyncMode mode,
+ BlockDriverCompletionFunc *cb,
+ void *opaque, Error **errp)
+{
+ MirrorBlockJob *s;
+
+ s = block_job_create(&mirror_job_type, bs, speed, cb, opaque, errp);
+ if (!s) {
+ return;
+ }
+
+ s->target = target;
+ s->mode = mode;
+ bdrv_set_dirty_tracking(bs, true);
+ bdrv_set_enable_write_cache(s->target, true);
+ s->common.co = qemu_coroutine_create(mirror_run);
+ trace_mirror_start(bs, s, s->common.co, opaque);
+ qemu_coroutine_enter(s->common.co, s);
+}
diff --git a/block_int.h b/block_int.h
index f4bae04..aaa46a8 100644
--- a/block_int.h
+++ b/block_int.h
@@ -331,4 +331,24 @@ void commit_start(BlockDriverState *bs, BlockDriverState *base,
BlockdevOnError on_error, BlockDriverCompletionFunc *cb,
void *opaque, Error **errp);
+/*
+ * mirror_start:
+ * @bs: Block device to operate on.
+ * @target: Block device to write to.
+ * @speed: The maximum speed, in bytes per second, or 0 for unlimited.
+ * @mode: Whether to collapse all images in the chain to the target.
+ * @cb: Completion function for the job.
+ * @opaque: Opaque pointer value passed to @cb.
+ * @errp: Error object.
+ *
+ * Start a mirroring operation on @bs. Clusters that are allocated
+ * in @bs will be written to @bs until the job is cancelled or
+ * manually completed. At the end of a successful mirroring job,
+ * @bs will be switched to read from @target.
+ */
+void mirror_start(BlockDriverState *bs, BlockDriverState *target,
+ int64_t speed, MirrorSyncMode mode,
+ BlockDriverCompletionFunc *cb,
+ void *opaque, Error **errp);
+
#endif /* BLOCK_INT_H */
diff --git a/qapi-schema.json b/qapi-schema.json
index 37bbeca..8c4b7c8 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -1166,6 +1166,23 @@
'data': ['report', 'ignore', 'enospc', 'stop'] }
##
+# @MirrorSyncMode:
+#
+# An enumeration of possible behaviors for the initial synchronization
+# phase of storage mirroring.
+#
+# @top: copies data in the topmost image to the destination
+#
+# @full: copies data from all images to the destination
+#
+# @none: only copy data written from now on
+#
+# Since: 1.3
+##
+{ 'enum': 'MirrorSyncMode',
+ 'data': ['top', 'full', 'none'] }
+
+##
# @BlockJobInfo:
#
# Information about a long-running block device operation.
diff --git a/trace-events b/trace-events
index 9ab8e27..09b5d55 100644
--- a/trace-events
+++ b/trace-events
@@ -77,6 +77,13 @@ stream_start(void *bs, void *base, void *s, void *co, void *opaque) "bs %p base
commit_one_iteration(void *s, int64_t sector_num, int nb_sectors, int is_allocated) "s %p sector_num %"PRId64" nb_sectors %d is_allocated %d"
commit_start(void *bs, void *base, void *top, void *s, void *co, void *opaque) "bs %p base %p top %p s %p co %p opaque %p"
+# block/mirror.c
+mirror_start(void *bs, void *s, void *co, void *opaque) "bs %p s %p co %p opaque %p"
+mirror_before_flush(void *s) "s %p"
+mirror_before_drain(void *s, int64_t cnt) "s %p dirty count %"PRId64
+mirror_before_sleep(void *s, int64_t cnt, int synced) "s %p dirty count %"PRId64" synced %d"
+mirror_one_iteration(void *s, int64_t sector_num, int nb_sectors) "s %p sector_num %"PRId64" nb_sectors %d"
+
# blockdev.c
qmp_block_job_cancel(void *job) "job %p"
qmp_block_job_pause(void *job) "job %p"
--
1.7.6.5
next prev parent reply other threads:[~2012-10-24 9:51 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-24 9:50 [Qemu-devel] [PULL 00/32] Block patches Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 01/32] qemu-img: Fix division by zero for zero size images Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 02/32] qemu-iotests: Test qemu-img operation on zero size image Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 03/32] qmp: fix __accept() in qmp.py Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 04/32] qemu-img rebase: use empty string to rebase without backing file Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 05/32] block: make bdrv_find_backing_image compare canonical filenames Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 06/32] block: in commit, determine base image from the top image Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 07/32] qemu-iotests: add relative backing file tests for block-commit (040) Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 08/32] qemu-img: Add --backing-chain option to info command Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 09/32] qemu-iotests: Add 043 backing file chain infinite loop test Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 10/32] qemu-img: document 'info --backing-chain' Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 11/32] block: bdrv_create(): don't leak cco.filename on error Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 12/32] monitor: Allow add-fd to any specified fd set Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 13/32] monitor: Enable adding an inherited fd to an " Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 14/32] monitor: Prevent removing fd from set during init Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 15/32] qemu-config: Add new -add-fd command line option Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 16/32] block: add bdrv_query_info Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 17/32] block: add bdrv_query_stats Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 18/32] block: add bdrv_open_backing_file Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 19/32] block: introduce new dirty bitmap functionality Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 20/32] block: export dirty bitmap information in query-block Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 21/32] block: rename block_job_complete to block_job_completed Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 22/32] block: add block-job-complete Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 23/32] block: introduce BLOCK_JOB_READY event Kevin Wolf
2012-10-24 9:50 ` Kevin Wolf [this message]
2012-10-24 9:50 ` [Qemu-devel] [PATCH 25/32] qmp: add drive-mirror command Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 26/32] mirror: implement completion Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 27/32] qemu-iotests: add mirroring test case Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 28/32] iostatus: forward block_job_iostatus_reset to block job Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 29/32] mirror: add support for on-source-error/on-target-error Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 30/32] qmp: add pull_event function Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 31/32] qemu-iotests: add testcases for mirroring on-source-error/on-target-error Kevin Wolf
2012-10-24 9:50 ` [Qemu-devel] [PATCH 32/32] osdep: Less restrictive F_SEFL in qemu_dup_flags() Kevin Wolf
2012-10-29 19:24 ` [Qemu-devel] [PULL 00/32] Block patches Anthony Liguori
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1351072256-6112-25-git-send-email-kwolf@redhat.com \
--to=kwolf@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).