From: Kevin Wolf <kwolf@redhat.com>
To: qemu-block@nongnu.org
Cc: kwolf@redhat.com, qemu-devel@nongnu.org
Subject: [PULL 05/27] qemu-img: rebase: avoid unnecessary COW operations
Date: Tue, 31 Oct 2023 19:58:56 +0100 [thread overview]
Message-ID: <20231031185918.346940-6-kwolf@redhat.com> (raw)
In-Reply-To: <20231031185918.346940-1-kwolf@redhat.com>
From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
When rebasing an image from one backing file to another, we need to
compare data from old and new backings. If the diff between that data
happens to be unaligned to the target cluster size, we might end up
doing partial writes, which would lead to copy-on-write and additional IO.
Consider the following simple case (virtual_size == cluster_size == 64K):
base <-- inc1 <-- inc2
qemu-io -c "write -P 0xaa 0 32K" base.qcow2
qemu-io -c "write -P 0xcc 32K 32K" base.qcow2
qemu-io -c "write -P 0xbb 0 32K" inc1.qcow2
qemu-io -c "write -P 0xcc 32K 32K" inc1.qcow2
qemu-img rebase -f qcow2 -b base.qcow2 -F qcow2 inc2.qcow2
While doing rebase, we'll write a half of the cluster to inc2, and block
layer will have to read the 2nd half of the same cluster from the base image
inc1 while doing this write operation, although the whole cluster is already
read earlier to perform data comparison.
In order to avoid these unnecessary IO cycles, let's make sure every
write request is aligned to the overlay subcluster boundaries. Using
subcluster size is universal as for the images which don't have them
this size equals to the cluster size. so in any case we end up aligning
to the smallest unit of allocation.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Message-ID: <20230919165804.439110-6-andrey.drobyshev@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
qemu-img.c | 74 +++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 54 insertions(+), 20 deletions(-)
diff --git a/qemu-img.c b/qemu-img.c
index a64a664a37..01de77295e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3530,6 +3530,7 @@ static int img_rebase(int argc, char **argv)
uint8_t *buf_new = NULL;
BlockDriverState *bs = NULL, *prefix_chain_bs = NULL;
BlockDriverState *unfiltered_bs;
+ BlockDriverInfo bdi = {0};
char *filename;
const char *fmt, *cache, *src_cache, *out_basefmt, *out_baseimg;
int c, flags, src_flags, ret;
@@ -3540,6 +3541,7 @@ static int img_rebase(int argc, char **argv)
bool quiet = false;
Error *local_err = NULL;
bool image_opts = false;
+ int64_t write_align;
/* Parse commandline parameters */
fmt = NULL;
@@ -3663,6 +3665,20 @@ static int img_rebase(int argc, char **argv)
}
}
+ /*
+ * We need overlay subcluster size to make sure write requests are
+ * aligned.
+ */
+ ret = bdrv_get_info(unfiltered_bs, &bdi);
+ if (ret < 0) {
+ error_report("could not get block driver info");
+ goto out;
+ } else if (bdi.subcluster_size == 0) {
+ bdi.subcluster_size = 1;
+ }
+
+ write_align = bdi.subcluster_size;
+
/* For safe rebasing we need to compare old and new backing file */
if (!unsafe) {
QDict *options = NULL;
@@ -3762,7 +3778,7 @@ static int img_rebase(int argc, char **argv)
int64_t old_backing_size = 0;
int64_t new_backing_size = 0;
uint64_t offset;
- int64_t n;
+ int64_t n, n_old = 0, n_new = 0;
float local_progress = 0;
if (blk_old_backing && bdrv_opt_mem_align(blk_bs(blk_old_backing)) >
@@ -3808,7 +3824,8 @@ static int img_rebase(int argc, char **argv)
}
for (offset = 0; offset < size; offset += n) {
- bool buf_old_is_zero = false;
+ bool old_backing_eof = false;
+ int64_t n_alloc;
/* How many bytes can we handle with the next read? */
n = MIN(IO_BUF_SIZE, size - offset);
@@ -3853,33 +3870,46 @@ static int img_rebase(int argc, char **argv)
}
}
+ /*
+ * At this point we know that the region [offset; offset + n)
+ * is unallocated within the target image. This region might be
+ * unaligned to the target image's (sub)cluster boundaries, as
+ * old backing may have smaller clusters (or have subclusters).
+ * We extend it to the aligned boundaries to avoid CoW on
+ * partial writes in blk_pwrite(),
+ */
+ n += offset - QEMU_ALIGN_DOWN(offset, write_align);
+ offset = QEMU_ALIGN_DOWN(offset, write_align);
+ n += QEMU_ALIGN_UP(offset + n, write_align) - (offset + n);
+ n = MIN(n, size - offset);
+ assert(!bdrv_is_allocated(unfiltered_bs, offset, n, &n_alloc) &&
+ n_alloc == n);
+
+ /*
+ * Much like with the target image, we'll try to read as much
+ * of the old and new backings as we can.
+ */
+ n_old = MIN(n, MAX(0, old_backing_size - (int64_t) offset));
+ n_new = MIN(n, MAX(0, new_backing_size - (int64_t) offset));
+
/*
* Read old and new backing file and take into consideration that
* backing files may be smaller than the COW image.
*/
- if (offset >= old_backing_size) {
- memset(buf_old, 0, n);
- buf_old_is_zero = true;
+ memset(buf_old + n_old, 0, n - n_old);
+ if (!n_old) {
+ old_backing_eof = true;
} else {
- if (offset + n > old_backing_size) {
- n = old_backing_size - offset;
- }
-
- ret = blk_pread(blk_old_backing, offset, n, buf_old, 0);
+ ret = blk_pread(blk_old_backing, offset, n_old, buf_old, 0);
if (ret < 0) {
error_report("error while reading from old backing file");
goto out;
}
}
- if (offset >= new_backing_size || !blk_new_backing) {
- memset(buf_new, 0, n);
- } else {
- if (offset + n > new_backing_size) {
- n = new_backing_size - offset;
- }
-
- ret = blk_pread(blk_new_backing, offset, n, buf_new, 0);
+ memset(buf_new + n_new, 0, n - n_new);
+ if (n_new) {
+ ret = blk_pread(blk_new_backing, offset, n_new, buf_new, 0);
if (ret < 0) {
error_report("error while reading from new backing file");
goto out;
@@ -3893,11 +3923,12 @@ static int img_rebase(int argc, char **argv)
int64_t pnum;
if (compare_buffers(buf_old + written, buf_new + written,
- n - written, 0, &pnum))
+ n - written, write_align, &pnum))
{
- if (buf_old_is_zero) {
+ if (old_backing_eof) {
ret = blk_pwrite_zeroes(blk, offset + written, pnum, 0);
} else {
+ assert(written + pnum <= IO_BUF_SIZE);
ret = blk_pwrite(blk, offset + written, pnum,
buf_old + written, 0);
}
@@ -3909,6 +3940,9 @@ static int img_rebase(int argc, char **argv)
}
written += pnum;
+ if (offset + written >= old_backing_size) {
+ old_backing_eof = true;
+ }
}
qemu_progress_print(local_progress, 100);
}
--
2.41.0
next prev parent reply other threads:[~2023-10-31 19:05 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-31 18:58 [PULL 00/27] Block layer patches Kevin Wolf
2023-10-31 18:58 ` [PULL 01/27] qemu-img: rebase: stop when reaching EOF of old backing file Kevin Wolf
2023-10-31 18:58 ` [PULL 02/27] qemu-iotests: 024: add rebasing test case for overlay_size > backing_size Kevin Wolf
2023-10-31 18:58 ` [PULL 03/27] qemu-img: rebase: use backing files' BlockBackend for buffer alignment Kevin Wolf
2023-10-31 18:58 ` [PULL 04/27] qemu-img: add chunk size parameter to compare_buffers() Kevin Wolf
2023-10-31 18:58 ` Kevin Wolf [this message]
2023-10-31 18:58 ` [PULL 06/27] iotests/{024, 271}: add testcases for qemu-img rebase Kevin Wolf
2024-07-22 7:18 ` Thomas Huth
2024-07-30 9:43 ` Andrey Drobyshev
2023-10-31 18:58 ` [PULL 07/27] qemu-img: add compression option to rebase subcommand Kevin Wolf
2023-10-31 18:58 ` [PULL 08/27] iotests: add tests for "qemu-img rebase" with compression Kevin Wolf
2023-10-31 18:59 ` [PULL 09/27] block: Fix locking in media change monitor commands Kevin Wolf
2023-10-31 18:59 ` [PULL 10/27] iotests: Test media change with iothreads Kevin Wolf
2023-10-31 18:59 ` [PULL 11/27] blockjob: drop AioContext lock before calling bdrv_graph_wrlock() Kevin Wolf
2023-10-31 18:59 ` [PULL 12/27] block: avoid potential deadlock during bdrv_graph_wrlock() in bdrv_close() Kevin Wolf
2023-10-31 18:59 ` [PULL 13/27] blockdev: mirror: avoid potential deadlock when using iothread Kevin Wolf
2023-10-31 18:59 ` [PULL 14/27] block: rename blk_io_plug_call() API to defer_call() Kevin Wolf
2023-10-31 18:59 ` [PULL 15/27] util/defer-call: move defer_call() to util/ Kevin Wolf
2023-10-31 18:59 ` [PULL 16/27] virtio: use defer_call() in virtio_irqfd_notify() Kevin Wolf
2023-10-31 18:59 ` [PULL 17/27] virtio-blk: remove batch notification BH Kevin Wolf
2023-10-31 18:59 ` [PULL 18/27] blockjob: introduce block-job-change QMP command Kevin Wolf
2023-10-31 18:59 ` [PULL 19/27] block/mirror: set actively_synced even after the job is ready Kevin Wolf
2023-10-31 18:59 ` [PULL 20/27] block/mirror: move dirty bitmap to filter Kevin Wolf
2023-10-31 18:59 ` [PULL 21/27] block/mirror: determine copy_to_target only once Kevin Wolf
2023-10-31 18:59 ` [PULL 22/27] mirror: implement mirror_change method Kevin Wolf
2023-10-31 18:59 ` [PULL 23/27] qapi/block-core: use JobType for BlockJobInfo's type Kevin Wolf
2023-10-31 18:59 ` [PULL 24/27] qapi/block-core: turn BlockJobInfo into a union Kevin Wolf
2023-10-31 18:59 ` [PULL 25/27] blockjob: query driver-specific info via a new 'query' driver method Kevin Wolf
2023-10-31 18:59 ` [PULL 26/27] mirror: return mirror-specific information upon query Kevin Wolf
2023-10-31 18:59 ` [PULL 27/27] iotests: add test for changing mirror's copy_mode Kevin Wolf
2023-10-31 23:31 ` [PULL 00/27] Block layer patches Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231031185918.346940-6-kwolf@redhat.com \
--to=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).