* [Qemu-devel] [PATCH 0/2] qemu-img convert: Rewrite copying logic
@ 2015-02-18 15:19 Kevin Wolf
2015-02-18 15:19 ` [Qemu-devel] [PATCH 1/2] " Kevin Wolf
2015-02-18 15:19 ` [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests Kevin Wolf
0 siblings, 2 replies; 9+ messages in thread
From: Kevin Wolf @ 2015-02-18 15:19 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, mreitz
See the commit message of patch 1 for the background.
Kevin Wolf (2):
qemu-img convert: Rewrite copying logic
qemu-iotests: Some qemu-img convert tests
qemu-img.c | 511 +++++++++++++++++++++++++++------------------
tests/qemu-iotests/122 | 83 ++++++++
tests/qemu-iotests/122.out | 30 +++
tests/qemu-iotests/group | 1 +
4 files changed, 419 insertions(+), 206 deletions(-)
create mode 100755 tests/qemu-iotests/122
create mode 100644 tests/qemu-iotests/122.out
--
1.8.3.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
2015-02-18 15:19 [Qemu-devel] [PATCH 0/2] qemu-img convert: Rewrite copying logic Kevin Wolf
@ 2015-02-18 15:19 ` Kevin Wolf
2015-02-18 17:18 ` Max Reitz
2015-02-18 15:19 ` [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests Kevin Wolf
1 sibling, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2015-02-18 15:19 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, mreitz
The implementation of qemu-img convert is (a) messy, (b) buggy, and
(c) less efficient than possible. The changes required to beat some
sense into it are massive enough that incremental changes would only
make my and the reviewers' life harder. So throw it away and reimplement
it from scratch.
Let me give some examples what I mean by messy, buggy and inefficient:
(a) The copying logic of qemu-img convert has two separate branches for
compressed and normal target images, which roughly do the same -
except for a little code that handles actual differences between
compressed and uncompressed images, and much more code that
implements just a different set of optimisations and bugs. This is
unnecessary code duplication, and makes the code for compressed
output (unsurprisingly) suffer from bitrot.
The code for uncompressed ouput is run twice to count the the total
length for the progress bar. In the first run it just takes a
shortcut and runs only half the loop, and when it's done, it toggles
a boolean, jumps out of the loop with a backwards goto and starts
over. Works, but pretty is something different.
(b) Converting while keeping a backing file (-B option) is broken in
several ways. This includes not writing to the image file if the
input has zero clusters or data filled with zeros (ignoring that the
backing file will be visible instead).
It also doesn't correctly limit every iteration of the copy loop to
sectors of the same status so that too many sectors may be copied to
in the target image. For -B this gives an unexpected result, for
other images it just does more work than necessary.
Conversion with a compressed target completely ignores any target
backing file.
(c) qemu-img convert skips reading and writing an area if it knows from
metadata that copying isn't needed (except for the bug mentioned
above that ignores a status change in some cases). It does, however,
read from the source even if it knows that it will read zeros, and
then search for non-zero bytes in the read buffer, if it's possible
that a write might be needed.
This reimplementation of the copying core reorganises the code to remove
the duplication and have a much more obvious code flow, by essentially
splitting the copy iteration loop into three parts:
1. Find the number of contiguous sectors of the same status at the
current offset (This can also be called in a separate loop for the
copying loop in order to determine the total sectors for the progress
bar.)
2. Read sectors. If the status implies that there is no data there to
read (zero or unallocated cluster), don't do anything.
3. Write sectors depending on the status. If it's data, write it. If
we want the backing file to be visible (with -B), don't write it. If
it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
optimise the write at least where possible.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
1 file changed, 305 insertions(+), 206 deletions(-)
diff --git a/qemu-img.c b/qemu-img.c
index e148af8..5c8386e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1306,20 +1306,307 @@ out3:
return ret;
}
+enum ImgConvertBlockStatus {
+ BLK_DATA,
+ BLK_ZERO,
+ BLK_BACKING_FILE,
+};
+
+typedef struct ImgConvertState {
+ BlockDriverState **src;
+ int64_t *src_sectors;
+ int src_cur, src_num;
+ int64_t src_cur_offset;
+ int64_t total_sectors;
+ int64_t allocated_sectors;
+ enum ImgConvertBlockStatus status;
+ int64_t sector_next_status;
+ BlockDriverState *target;
+ bool has_zero_init;
+ bool compressed;
+ bool out_backing;
+ int min_sparse;
+ size_t cluster_sectors;
+ size_t buf_sectors;
+} ImgConvertState;
+
+static void convert_select_part(ImgConvertState *s, int64_t sector_num)
+{
+ while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
+ s->src_cur_offset += s->src_sectors[s->src_cur];
+ s->src_cur++;
+ assert(s->src_cur < s->src_num);
+ }
+}
+
+static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
+{
+ int64_t ret;
+ int n;
+
+ convert_select_part(s, sector_num);
+
+ assert(s->total_sectors > sector_num);
+ n = MIN(s->total_sectors - sector_num, INT_MAX);
+
+ if (s->sector_next_status <= sector_num) {
+ ret = bdrv_get_block_status(s->src[s->src_cur],
+ sector_num - s->src_cur_offset,
+ n, &n);
+ if (ret < 0) {
+ return ret;
+ }
+
+ if (ret & BDRV_BLOCK_ZERO) {
+ s->status = BLK_ZERO;
+ } else if (ret & BDRV_BLOCK_DATA) {
+ s->status = BLK_DATA;
+ } else if (!s->out_backing) {
+ /* Without a target backing file we must copy over the contents of
+ * the backing file as well. */
+ /* TODO Check block status of the backing file chain to avoid
+ * needlessly reading zeroes and limiting the iteration to the
+ * buffer size */
+ s->status = BLK_DATA;
+ } else {
+ s->status = BLK_BACKING_FILE;
+ }
+
+ s->sector_next_status = sector_num + n;
+ }
+
+ n = MIN(n, s->sector_next_status - sector_num);
+ if (s->status == BLK_DATA) {
+ n = MIN(n, s->buf_sectors);
+ }
+
+ /* We need to write complete clusters for compressed images, so if an
+ * unallocated area is shorter than that, we must consider the whole
+ * cluster allocated. */
+ if (s->compressed) {
+ if (n < s->cluster_sectors) {
+ n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
+ s->status = BLK_DATA;
+ } else {
+ n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
+ }
+ }
+
+ return n;
+}
+
+static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
+ uint8_t *buf)
+{
+ int n;
+ int ret;
+
+ if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
+ return 0;
+ }
+
+ assert(nb_sectors <= s->buf_sectors);
+ while (nb_sectors > 0) {
+ BlockDriverState *bs;
+ int64_t bs_sectors;
+
+ /* In the case of compression with multiple source files, we can get a
+ * nb_sectors that spreads into the next part. So we must be able to
+ * read across multiple BDSes for one convert_read() call. */
+ convert_select_part(s, sector_num);
+ bs = s->src[s->src_cur];
+ bs_sectors = s->src_sectors[s->src_cur];
+
+ n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
+ ret = bdrv_read(bs, sector_num, buf, n);
+ if (ret < 0) {
+ return ret;
+ }
+
+ sector_num += n;
+ nb_sectors -= n;
+ buf += n * BDRV_SECTOR_SIZE;
+ }
+
+ return 0;
+}
+
+static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
+ const uint8_t *buf)
+{
+ int ret;
+
+ while (nb_sectors > 0) {
+
+ int n = nb_sectors;
+
+ switch (s->status) {
+ case BLK_BACKING_FILE:
+ /* If we have a backing file, leave clusters unallocated that are
+ * unallocated in the source image, so that the backing file is
+ * visible at the respective offset. */
+ assert(s->out_backing);
+ break;
+
+ case BLK_DATA:
+ /* We must always write compressed clusters as a whole, so don't
+ * try to find zeroed parts in the buffer. We can only save the
+ * write if the buffer is completely zeroed. */
+ if (s->compressed) {
+ if (s->has_zero_init &&
+ buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
+ {
+ assert(!s->out_backing);
+ break;
+ }
+
+ ret = bdrv_write_compressed(s->target, sector_num, buf, n);
+ if (ret < 0) {
+ return ret;
+ }
+ break;
+ }
+
+ /* If there is real non-zero data, we must write it. Otherwise we
+ * can treat it as zero sectors. */
+ if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
+ ret = bdrv_write(s->target, sector_num, buf, n);
+ if (ret < 0) {
+ return ret;
+ }
+ break;
+ }
+ /* fall-through */
+
+ case BLK_ZERO:
+ if (s->has_zero_init) {
+ break;
+ }
+ /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
+ ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
+ if (ret < 0) {
+ return ret;
+ }
+ break;
+ }
+
+ sector_num += n;
+ nb_sectors -= n;
+ buf += n * BDRV_SECTOR_SIZE;
+ }
+
+ return 0;
+}
+
+static int convert_do_copy(ImgConvertState *s)
+{
+ uint8_t *buf = NULL;
+ int64_t sector_num, allocated_done;
+ int ret;
+ int n;
+
+ /* Check whether we have zero initialisation or can get it efficiently */
+ s->has_zero_init = s->min_sparse && !s->out_backing
+ ? bdrv_has_zero_init(s->target)
+ : false;
+
+ if (!s->has_zero_init && !s->out_backing &&
+ bdrv_can_write_zeroes_with_unmap(s->target))
+ {
+ ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
+ if (ret < 0) {
+ goto fail;
+ }
+ s->has_zero_init = 1;
+ }
+
+ /* Allocate buffer for copied data. For compressed images, only one cluster
+ * can be copied at a time. */
+ if (s->compressed) {
+ if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
+ error_report("invalid cluster size");
+ ret = -EINVAL;
+ goto fail;
+ }
+ s->buf_sectors = s->cluster_sectors;
+ }
+ buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
+
+ /* Calculate allocated sectors for progress */
+ s->allocated_sectors = 0;
+ sector_num = 0;
+ while (sector_num < s->total_sectors) {
+ n = convert_iteration_sectors(s, sector_num);
+ if (n < 0) {
+ ret = n;
+ goto fail;
+ }
+ if (s->status == BLK_DATA) {
+ s->allocated_sectors += n;
+ }
+ sector_num += n;
+ }
+
+ /* Do the copy */
+ s->src_cur = 0;
+ s->src_cur_offset = 0;
+ s->sector_next_status = 0;
+
+ sector_num = 0;
+ allocated_done = 0;
+
+ while (sector_num < s->total_sectors) {
+ n = convert_iteration_sectors(s, sector_num);
+ if (n < 0) {
+ ret = n;
+ goto fail;
+ }
+ if (s->status == BLK_DATA) {
+ allocated_done += n;
+ qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
+ 0);
+ }
+
+ ret = convert_read(s, sector_num, n, buf);
+ if (ret < 0) {
+ error_report("error while reading sector %" PRId64
+ ": %s", sector_num, strerror(-ret));
+ goto fail;
+ }
+
+ ret = convert_write(s, sector_num, n, buf);
+ if (ret < 0) {
+ error_report("error while writing sector %" PRId64
+ ": %s", sector_num, strerror(-ret));
+ goto fail;
+ }
+
+ sector_num += n;
+ }
+
+ if (s->compressed) {
+ /* signal EOF to align */
+ bdrv_write_compressed(s->target, 0, NULL, 0);
+ }
+
+ ret = 0;
+fail:
+ qemu_vfree(buf);
+ return ret;
+}
+
static int img_convert(int argc, char **argv)
{
- int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
+ int c, bs_n, bs_i, compress, cluster_sectors, skip_create;
int64_t ret = 0;
int progress = 0, flags, src_flags;
const char *fmt, *out_fmt, *cache, *src_cache, *out_baseimg, *out_filename;
BlockDriver *drv, *proto_drv;
BlockBackend **blk = NULL, *out_blk = NULL;
BlockDriverState **bs = NULL, *out_bs = NULL;
- int64_t total_sectors, nb_sectors, sector_num, bs_offset;
+ int64_t total_sectors;
int64_t *bs_sectors = NULL;
- uint8_t * buf = NULL;
size_t bufsectors = IO_BUF_SIZE / BDRV_SECTOR_SIZE;
- const uint8_t *buf1;
BlockDriverInfo bdi;
QemuOpts *opts = NULL;
QemuOptsList *create_opts = NULL;
@@ -1330,6 +1617,7 @@ static int img_convert(int argc, char **argv)
bool quiet = false;
Error *local_err = NULL;
QemuOpts *sn_opts = NULL;
+ ImgConvertState state;
fmt = NULL;
out_fmt = "raw";
@@ -1622,9 +1910,6 @@ static int img_convert(int argc, char **argv)
}
out_bs = blk_bs(out_blk);
- bs_i = 0;
- bs_offset = 0;
-
/* increase bufsectors from the default 4096 (2M) if opt_transfer_length
* or discard_alignment of the out_bs is greater. Limit to 32768 (16MB)
* as maximum. */
@@ -1633,8 +1918,6 @@ static int img_convert(int argc, char **argv)
out_bs->bl.discard_alignment))
);
- buf = qemu_blockalign(out_bs, bufsectors * BDRV_SECTOR_SIZE);
-
if (skip_create) {
int64_t output_sectors = bdrv_nb_sectors(out_bs);
if (output_sectors < 0) {
@@ -1661,203 +1944,20 @@ static int img_convert(int argc, char **argv)
cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
}
- if (compress) {
- if (cluster_sectors <= 0 || cluster_sectors > bufsectors) {
- error_report("invalid cluster size");
- ret = -1;
- goto out;
- }
- sector_num = 0;
-
- nb_sectors = total_sectors;
-
- for(;;) {
- int64_t bs_num;
- int remainder;
- uint8_t *buf2;
-
- nb_sectors = total_sectors - sector_num;
- if (nb_sectors <= 0)
- break;
- if (nb_sectors >= cluster_sectors)
- n = cluster_sectors;
- else
- n = nb_sectors;
-
- bs_num = sector_num - bs_offset;
- assert (bs_num >= 0);
- remainder = n;
- buf2 = buf;
- while (remainder > 0) {
- int nlow;
- while (bs_num == bs_sectors[bs_i]) {
- bs_offset += bs_sectors[bs_i];
- bs_i++;
- assert (bs_i < bs_n);
- bs_num = 0;
- /* printf("changing part: sector_num=%" PRId64 ", "
- "bs_i=%d, bs_offset=%" PRId64 ", bs_sectors=%" PRId64
- "\n", sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
- }
- assert (bs_num < bs_sectors[bs_i]);
-
- nlow = remainder > bs_sectors[bs_i] - bs_num
- ? bs_sectors[bs_i] - bs_num : remainder;
-
- ret = bdrv_read(bs[bs_i], bs_num, buf2, nlow);
- if (ret < 0) {
- error_report("error while reading sector %" PRId64 ": %s",
- bs_num, strerror(-ret));
- goto out;
- }
-
- buf2 += nlow * 512;
- bs_num += nlow;
-
- remainder -= nlow;
- }
- assert (remainder == 0);
-
- if (!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
- ret = bdrv_write_compressed(out_bs, sector_num, buf, n);
- if (ret != 0) {
- error_report("error while compressing sector %" PRId64
- ": %s", sector_num, strerror(-ret));
- goto out;
- }
- }
- sector_num += n;
- qemu_progress_print(100.0 * sector_num / total_sectors, 0);
- }
- /* signal EOF to align */
- bdrv_write_compressed(out_bs, 0, NULL, 0);
- } else {
- int64_t sectors_to_read, sectors_read, sector_num_next_status;
- bool count_allocated_sectors;
- int has_zero_init = min_sparse ? bdrv_has_zero_init(out_bs) : 0;
-
- if (!has_zero_init && bdrv_can_write_zeroes_with_unmap(out_bs)) {
- ret = bdrv_make_zero(out_bs, BDRV_REQ_MAY_UNMAP);
- if (ret < 0) {
- goto out;
- }
- has_zero_init = 1;
- }
-
- sectors_to_read = total_sectors;
- count_allocated_sectors = progress && (out_baseimg || has_zero_init);
-restart:
- sector_num = 0; // total number of sectors converted so far
- sectors_read = 0;
- sector_num_next_status = 0;
-
- for(;;) {
- nb_sectors = total_sectors - sector_num;
- if (nb_sectors <= 0) {
- if (count_allocated_sectors) {
- sectors_to_read = sectors_read;
- count_allocated_sectors = false;
- goto restart;
- }
- ret = 0;
- break;
- }
-
- while (sector_num - bs_offset >= bs_sectors[bs_i]) {
- bs_offset += bs_sectors[bs_i];
- bs_i ++;
- assert (bs_i < bs_n);
- /* printf("changing part: sector_num=%" PRId64 ", bs_i=%d, "
- "bs_offset=%" PRId64 ", bs_sectors=%" PRId64 "\n",
- sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
- }
-
- if ((out_baseimg || has_zero_init) &&
- sector_num >= sector_num_next_status) {
- n = nb_sectors > INT_MAX ? INT_MAX : nb_sectors;
- ret = bdrv_get_block_status(bs[bs_i], sector_num - bs_offset,
- n, &n1);
- if (ret < 0) {
- error_report("error while reading block status of sector %"
- PRId64 ": %s", sector_num - bs_offset,
- strerror(-ret));
- goto out;
- }
- /* If the output image is zero initialized, we are not working
- * on a shared base and the input is zero we can skip the next
- * n1 sectors */
- if (has_zero_init && !out_baseimg && (ret & BDRV_BLOCK_ZERO)) {
- sector_num += n1;
- continue;
- }
- /* If the output image is being created as a copy on write
- * image, assume that sectors which are unallocated in the
- * input image are present in both the output's and input's
- * base images (no need to copy them). */
- if (out_baseimg) {
- if (!(ret & BDRV_BLOCK_DATA)) {
- sector_num += n1;
- continue;
- }
- /* The next 'n1' sectors are allocated in the input image.
- * Copy only those as they may be followed by unallocated
- * sectors. */
- nb_sectors = n1;
- }
- /* avoid redundant callouts to get_block_status */
- sector_num_next_status = sector_num + n1;
- }
-
- n = MIN(nb_sectors, bufsectors);
-
- /* round down request length to an aligned sector, but
- * do not bother doing this on short requests. They happen
- * when we found an all-zero area, and the next sector to
- * write will not be sector_num + n. */
- if (cluster_sectors > 0 && n >= cluster_sectors) {
- int64_t next_aligned_sector = (sector_num + n);
- next_aligned_sector -= next_aligned_sector % cluster_sectors;
- if (sector_num + n > next_aligned_sector) {
- n = next_aligned_sector - sector_num;
- }
- }
-
- n = MIN(n, bs_sectors[bs_i] - (sector_num - bs_offset));
-
- sectors_read += n;
- if (count_allocated_sectors) {
- sector_num += n;
- continue;
- }
+ state = (ImgConvertState) {
+ .src = bs,
+ .src_sectors = bs_sectors,
+ .src_num = bs_n,
+ .total_sectors = total_sectors,
+ .target = out_bs,
+ .compressed = compress,
+ .out_backing = (bool) out_baseimg,
+ .min_sparse = min_sparse,
+ .cluster_sectors = cluster_sectors,
+ .buf_sectors = bufsectors,
+ };
+ ret = convert_do_copy(&state);
- n1 = n;
- ret = bdrv_read(bs[bs_i], sector_num - bs_offset, buf, n);
- if (ret < 0) {
- error_report("error while reading sector %" PRId64 ": %s",
- sector_num - bs_offset, strerror(-ret));
- goto out;
- }
- /* NOTE: at the same time we convert, we do not write zero
- sectors to have a chance to compress the image. Ideally, we
- should add a specific call to have the info to go faster */
- buf1 = buf;
- while (n > 0) {
- if (!has_zero_init ||
- is_allocated_sectors_min(buf1, n, &n1, min_sparse)) {
- ret = bdrv_write(out_bs, sector_num, buf1, n1);
- if (ret < 0) {
- error_report("error while writing sector %" PRId64
- ": %s", sector_num, strerror(-ret));
- goto out;
- }
- }
- sector_num += n1;
- n -= n1;
- buf1 += n1 * 512;
- }
- qemu_progress_print(100.0 * sectors_read / sectors_to_read, 0);
- }
- }
out:
if (!ret) {
qemu_progress_print(100, 0);
@@ -1865,7 +1965,6 @@ out:
qemu_progress_end();
qemu_opts_del(opts);
qemu_opts_free(create_opts);
- qemu_vfree(buf);
qemu_opts_del(sn_opts);
blk_unref(out_blk);
g_free(bs);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests
2015-02-18 15:19 [Qemu-devel] [PATCH 0/2] qemu-img convert: Rewrite copying logic Kevin Wolf
2015-02-18 15:19 ` [Qemu-devel] [PATCH 1/2] " Kevin Wolf
@ 2015-02-18 15:19 ` Kevin Wolf
2015-02-18 17:24 ` Max Reitz
2015-03-19 14:41 ` Max Reitz
1 sibling, 2 replies; 9+ messages in thread
From: Kevin Wolf @ 2015-02-18 15:19 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, mreitz
This adds a regression test for some problems that the qemu-img convert
rewrite just fixed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
tests/qemu-iotests/122 | 83 ++++++++++++++++++++++++++++++++++++++++++++++
tests/qemu-iotests/122.out | 30 +++++++++++++++++
tests/qemu-iotests/group | 1 +
3 files changed, 114 insertions(+)
create mode 100755 tests/qemu-iotests/122
create mode 100644 tests/qemu-iotests/122.out
diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
new file mode 100755
index 0000000..e0c472f
--- /dev/null
+++ b/tests/qemu-iotests/122
@@ -0,0 +1,83 @@
+#!/bin/bash
+#
+# Test some qemu-img convert cases
+#
+# Copyright (C) 2015 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+#
+
+# creator
+owner=kwolf@redhat.com
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+here="$PWD"
+tmp=/tmp/$$
+status=1 # failure is the default!
+
+_cleanup()
+{
+ _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt qcow2
+_supported_proto generic
+_supported_os Linux
+
+
+TEST_IMG=$TEST_IMG.base _make_test_img 64M
+$QEMU_IO -c "write -P 0x11 0 64M" $TEST_IMG.base 2>&1 | _filter_qemu_io | _filter_testdir
+
+
+echo
+echo "=== Check allocation status regression with -B ==="
+echo
+
+_make_test_img -b $TEST_IMG.base
+$QEMU_IO -c "write -P 0x22 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
+$QEMU_IMG map $TEST_IMG.orig | _filter_qemu_img_map
+
+
+echo
+echo "=== Check that zero clusters are kept in overlay ==="
+echo
+
+_make_test_img -b $TEST_IMG.base
+
+$QEMU_IO -c "write -P 0 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
+$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG convert -O $IMGFMT -c -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
+$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
+
+$QEMU_IO -c "write -z 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
+$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
+$QEMU_IMG convert -O $IMGFMT -c -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
+$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
+
+# TODO -B + -S 0 -> full allocation with no use of the backing file
+
+# success, all done
+echo '*** done'
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/122.out b/tests/qemu-iotests/122.out
new file mode 100644
index 0000000..6124099
--- /dev/null
+++ b/tests/qemu-iotests/122.out
@@ -0,0 +1,30 @@
+QA output created by 122
+Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
+wrote 67108864/67108864 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Check allocation status regression with -B ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base'
+wrote 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+Offset Length File
+0 0x300000 TEST_DIR/t.IMGFMT.orig
+0x300000 0x3d00000 TEST_DIR/t.IMGFMT.base
+
+=== Check that zero clusters are kept in overlay ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base'
+wrote 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 3145728/3145728 bytes at offset 0
+3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 4b2b93b..28c78e5 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -117,3 +117,4 @@
113 rw auto quick
114 rw auto quick
116 rw auto quick
+122 rw auto quick
--
1.8.3.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
2015-02-18 15:19 ` [Qemu-devel] [PATCH 1/2] " Kevin Wolf
@ 2015-02-18 17:18 ` Max Reitz
2015-02-19 15:47 ` Kevin Wolf
0 siblings, 1 reply; 9+ messages in thread
From: Max Reitz @ 2015-02-18 17:18 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: stefanha
On 2015-02-18 at 10:19, Kevin Wolf wrote:
> The implementation of qemu-img convert is (a) messy, (b) buggy, and
> (c) less efficient than possible. The changes required to beat some
> sense into it are massive enough that incremental changes would only
> make my and the reviewers' life harder. So throw it away and reimplement
> it from scratch.
>
> Let me give some examples what I mean by messy, buggy and inefficient:
>
> (a) The copying logic of qemu-img convert has two separate branches for
> compressed and normal target images, which roughly do the same -
> except for a little code that handles actual differences between
> compressed and uncompressed images, and much more code that
> implements just a different set of optimisations and bugs. This is
> unnecessary code duplication, and makes the code for compressed
> output (unsurprisingly) suffer from bitrot.
>
> The code for uncompressed ouput is run twice to count the the total
> length for the progress bar. In the first run it just takes a
> shortcut and runs only half the loop, and when it's done, it toggles
> a boolean, jumps out of the loop with a backwards goto and starts
> over. Works, but pretty is something different.
>
> (b) Converting while keeping a backing file (-B option) is broken in
> several ways. This includes not writing to the image file if the
> input has zero clusters or data filled with zeros (ignoring that the
> backing file will be visible instead).
>
> It also doesn't correctly limit every iteration of the copy loop to
> sectors of the same status so that too many sectors may be copied to
> in the target image. For -B this gives an unexpected result, for
> other images it just does more work than necessary.
>
> Conversion with a compressed target completely ignores any target
> backing file.
>
> (c) qemu-img convert skips reading and writing an area if it knows from
> metadata that copying isn't needed (except for the bug mentioned
> above that ignores a status change in some cases). It does, however,
> read from the source even if it knows that it will read zeros, and
> then search for non-zero bytes in the read buffer, if it's possible
> that a write might be needed.
>
> This reimplementation of the copying core reorganises the code to remove
> the duplication and have a much more obvious code flow, by essentially
> splitting the copy iteration loop into three parts:
>
> 1. Find the number of contiguous sectors of the same status at the
> current offset (This can also be called in a separate loop for the
> copying loop in order to determine the total sectors for the progress
> bar.)
>
> 2. Read sectors. If the status implies that there is no data there to
> read (zero or unallocated cluster), don't do anything.
>
> 3. Write sectors depending on the status. If it's data, write it. If
> we want the backing file to be visible (with -B), don't write it. If
> it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
> optimise the write at least where possible.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
> 1 file changed, 305 insertions(+), 206 deletions(-)
You have been very careful not to write "Rewrite img_convert()" or
something like that in the subject, so I can't really complain that
there are still bugs in the function (which are not related to the
copying logic), for instance:
$ ./qemu-img create -f qcow2 i1.qcow2 64M
Formatting 'i1.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off
$ ./qemu-img create -f qcow2 i2.qcow2 64M
Formatting 'i2.qcow2', fmt=qcow2 size=67108864 encryption=off
cluster_size=65536 lazy_refcounts=off
$ ./qemu-img snapshot -c foo i1.qcow2
$ ./qemu-img snapshot -c foo i2.qcow2
$ ./qemu-io -c 'write 0 64M' i2.qcow2
wrote 67108864/67108864 bytes at offset 0
64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
$ ./qemu-img convert -l snapshot.name=foo -O qcow2 i{1,2}.qcow2 o.qcow2
qemu-img: error while reading sector 131072: Input/output error
("No support for concatenating multiple snapshot" should be enforced for
sn_opts != NULL)
> diff --git a/qemu-img.c b/qemu-img.c
> index e148af8..5c8386e 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1306,20 +1306,307 @@ out3:
> return ret;
> }
>
> +enum ImgConvertBlockStatus {
> + BLK_DATA,
> + BLK_ZERO,
> + BLK_BACKING_FILE,
> +};
> +
> +typedef struct ImgConvertState {
> + BlockDriverState **src;
> + int64_t *src_sectors;
> + int src_cur, src_num;
> + int64_t src_cur_offset;
> + int64_t total_sectors;
> + int64_t allocated_sectors;
> + enum ImgConvertBlockStatus status;
> + int64_t sector_next_status;
> + BlockDriverState *target;
> + bool has_zero_init;
> + bool compressed;
> + bool out_backing;
Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would be
better; "out_backing" to me sounds like this should describe the backing
file.
> + int min_sparse;
> + size_t cluster_sectors;
> + size_t buf_sectors;
> +} ImgConvertState;
> +
> +static void convert_select_part(ImgConvertState *s, int64_t sector_num)
> +{
> + while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
> + s->src_cur_offset += s->src_sectors[s->src_cur];
> + s->src_cur++;
> + assert(s->src_cur < s->src_num);
> + }
> +}
> +
> +static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
> +{
> + int64_t ret;
> + int n;
> +
> + convert_select_part(s, sector_num);
> +
> + assert(s->total_sectors > sector_num);
> + n = MIN(s->total_sectors - sector_num, INT_MAX);
Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and in
other places, too)... In practice, this would only be relevant to reads
and writes, though, and those are limited by s->buf_sectors.
> +
> + if (s->sector_next_status <= sector_num) {
> + ret = bdrv_get_block_status(s->src[s->src_cur],
> + sector_num - s->src_cur_offset,
> + n, &n);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + if (ret & BDRV_BLOCK_ZERO) {
> + s->status = BLK_ZERO;
> + } else if (ret & BDRV_BLOCK_DATA) {
> + s->status = BLK_DATA;
> + } else if (!s->out_backing) {
> + /* Without a target backing file we must copy over the contents of
> + * the backing file as well. */
> + /* TODO Check block status of the backing file chain to avoid
> + * needlessly reading zeroes and limiting the iteration to the
> + * buffer size */
> + s->status = BLK_DATA;
> + } else {
> + s->status = BLK_BACKING_FILE;
> + }
> +
> + s->sector_next_status = sector_num + n;
> + }
> +
> + n = MIN(n, s->sector_next_status - sector_num);
> + if (s->status == BLK_DATA) {
> + n = MIN(n, s->buf_sectors);
> + }
> +
> + /* We need to write complete clusters for compressed images, so if an
> + * unallocated area is shorter than that, we must consider the whole
> + * cluster allocated. */
> + if (s->compressed) {
> + if (n < s->cluster_sectors) {
> + n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
> + s->status = BLK_DATA;
> + } else {
> + n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> + }
> + }
> +
> + return n;
> +}
> +
> +static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> + uint8_t *buf)
> +{
> + int n;
> + int ret;
> +
> + if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
> + return 0;
> + }
> +
> + assert(nb_sectors <= s->buf_sectors);
> + while (nb_sectors > 0) {
> + BlockDriverState *bs;
> + int64_t bs_sectors;
> +
> + /* In the case of compression with multiple source files, we can get a
> + * nb_sectors that spreads into the next part. So we must be able to
> + * read across multiple BDSes for one convert_read() call. */
> + convert_select_part(s, sector_num);
> + bs = s->src[s->src_cur];
> + bs_sectors = s->src_sectors[s->src_cur];
> +
> + n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
> + ret = bdrv_read(bs, sector_num, buf, n);
Shouldn't this be sector_num - s->src_cur_offset?
> + if (ret < 0) {
> + return ret;
> + }
> +
> + sector_num += n;
> + nb_sectors -= n;
> + buf += n * BDRV_SECTOR_SIZE;
> + }
> +
> + return 0;
> +}
> +
> +static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> + const uint8_t *buf)
> +{
> + int ret;
> +
> + while (nb_sectors > 0) {
> +
This empty line looks a bit strange to me...
> + int n = nb_sectors;
> +
> + switch (s->status) {
> + case BLK_BACKING_FILE:
> + /* If we have a backing file, leave clusters unallocated that are
> + * unallocated in the source image, so that the backing file is
> + * visible at the respective offset. */
> + assert(s->out_backing);
> + break;
> +
> + case BLK_DATA:
> + /* We must always write compressed clusters as a whole, so don't
> + * try to find zeroed parts in the buffer. We can only save the
> + * write if the buffer is completely zeroed. */
But you shouldn't be doing that if n < s->min_sparse, I think (if people
specify -S 0, they don't want you to zero anything, and that's
guaranteed by the man page). Considering that is_allocated_sectors_min()
basically has a min = MIN(min, n) part, too, I'd be fine with making -S
0 a special case here ("if (s->has_zero_init && s->min_sparse &&
buffer_is_zero(...))").
Actually, now that I'm looking at is_allocated_sectors_min(), if the
first sectors is not allocated, it will return false and thus both the
current qemu-img convert and the new one after this patch won't write
data. That's a bug, I think (because it is guaranteed by the man page).
> + if (s->compressed) {
> + if (s->has_zero_init &&
> + buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> + {
> + assert(!s->out_backing);
> + break;
> + }
> +
> + ret = bdrv_write_compressed(s->target, sector_num, buf, n);
> + if (ret < 0) {
> + return ret;
> + }
> + break;
> + }
> +
> + /* If there is real non-zero data, we must write it. Otherwise we
> + * can treat it as zero sectors. */
> + if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
So I think this should be "if (!s->min_sparse || ...)".
> + ret = bdrv_write(s->target, sector_num, buf, n);
> + if (ret < 0) {
> + return ret;
> + }
> + break;
> + }
> + /* fall-through */
> +
> + case BLK_ZERO:
> + if (s->has_zero_init) {
And I don't really know what to do about this. Should we write zeros if
!s->min_sparse? Or is this a special case and the user can't expect
qemu-img convert to really write zeros if the source image contained
unallocated/zero clusters? That would make sense, but once again, the
man page clearly states that "If sparse_size is 0, […] the destination
image will always be fully allocated." Maybe we could argue that "fully"
in this case means "as fully as the source image was allocated".
> + break;
> + }
> + /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros
read, bdrv_make_zero() below will have been invoked and thus
s->has_zero_init = true (unless you take my suggestion and leave
s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't really
mind that case here).
> + ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
> + if (ret < 0) {
> + return ret;
> + }
> + break;
> + }
> +
> + sector_num += n;
> + nb_sectors -= n;
> + buf += n * BDRV_SECTOR_SIZE;
> + }
> +
> + return 0;
> +}
> +
> +static int convert_do_copy(ImgConvertState *s)
> +{
> + uint8_t *buf = NULL;
> + int64_t sector_num, allocated_done;
> + int ret;
> + int n;
> +
> + /* Check whether we have zero initialisation or can get it efficiently */
> + s->has_zero_init = s->min_sparse && !s->out_backing
> + ? bdrv_has_zero_init(s->target)
> + : false;
> +
> + if (!s->has_zero_init && !s->out_backing &&
> + bdrv_can_write_zeroes_with_unmap(s->target))
> + {
> + ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> + if (ret < 0) {
> + goto fail;
I think just leaving s->has_zero_init false should be fine, too (instead
of failing).
> + }
> + s->has_zero_init = 1;
s/1/true/
> + }
> +
> + /* Allocate buffer for copied data. For compressed images, only one cluster
> + * can be copied at a time. */
> + if (s->compressed) {
> + if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
> + error_report("invalid cluster size");
> + ret = -EINVAL;
> + goto fail;
> + }
> + s->buf_sectors = s->cluster_sectors;
> + }
> + buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> +
> + /* Calculate allocated sectors for progress */
> + s->allocated_sectors = 0;
> + sector_num = 0;
> + while (sector_num < s->total_sectors) {
> + n = convert_iteration_sectors(s, sector_num);
> + if (n < 0) {
> + ret = n;
> + goto fail;
> + }
> + if (s->status == BLK_DATA) {
> + s->allocated_sectors += n;
> + }
> + sector_num += n;
> + }
> +
> + /* Do the copy */
> + s->src_cur = 0;
> + s->src_cur_offset = 0;
> + s->sector_next_status = 0;
> +
> + sector_num = 0;
> + allocated_done = 0;
> +
> + while (sector_num < s->total_sectors) {
> + n = convert_iteration_sectors(s, sector_num);
> + if (n < 0) {
> + ret = n;
> + goto fail;
> + }
> + if (s->status == BLK_DATA) {
> + allocated_done += n;
> + qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
> + 0);
> + }
> +
> + ret = convert_read(s, sector_num, n, buf);
> + if (ret < 0) {
> + error_report("error while reading sector %" PRId64
> + ": %s", sector_num, strerror(-ret));
> + goto fail;
> + }
> +
> + ret = convert_write(s, sector_num, n, buf);
> + if (ret < 0) {
> + error_report("error while writing sector %" PRId64
> + ": %s", sector_num, strerror(-ret));
> + goto fail;
> + }
> +
> + sector_num += n;
> + }
> +
> + if (s->compressed) {
> + /* signal EOF to align */
> + bdrv_write_compressed(s->target, 0, NULL, 0);
Is there a reason for ignoring the return value other than "That's how
img_convert() used to do it"?
All in all, this patch looks pretty good to me and certainly makes the
function much more clear. But as the bdrv_read() call probably needs to
be fixed, and because I think the -S 0 behavior is buggy, I cannot give
an R-b at this point (but I will if those are fixed, or if you tell me
why they don't need to be).
Max
> + }
> +
> + ret = 0;
> +fail:
> + qemu_vfree(buf);
> + return ret;
> +}
> +
> static int img_convert(int argc, char **argv)
> {
> - int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
> + int c, bs_n, bs_i, compress, cluster_sectors, skip_create;
> int64_t ret = 0;
> int progress = 0, flags, src_flags;
> const char *fmt, *out_fmt, *cache, *src_cache, *out_baseimg, *out_filename;
> BlockDriver *drv, *proto_drv;
> BlockBackend **blk = NULL, *out_blk = NULL;
> BlockDriverState **bs = NULL, *out_bs = NULL;
> - int64_t total_sectors, nb_sectors, sector_num, bs_offset;
> + int64_t total_sectors;
> int64_t *bs_sectors = NULL;
> - uint8_t * buf = NULL;
> size_t bufsectors = IO_BUF_SIZE / BDRV_SECTOR_SIZE;
> - const uint8_t *buf1;
> BlockDriverInfo bdi;
> QemuOpts *opts = NULL;
> QemuOptsList *create_opts = NULL;
> @@ -1330,6 +1617,7 @@ static int img_convert(int argc, char **argv)
> bool quiet = false;
> Error *local_err = NULL;
> QemuOpts *sn_opts = NULL;
> + ImgConvertState state;
>
> fmt = NULL;
> out_fmt = "raw";
> @@ -1622,9 +1910,6 @@ static int img_convert(int argc, char **argv)
> }
> out_bs = blk_bs(out_blk);
>
> - bs_i = 0;
> - bs_offset = 0;
> -
> /* increase bufsectors from the default 4096 (2M) if opt_transfer_length
> * or discard_alignment of the out_bs is greater. Limit to 32768 (16MB)
> * as maximum. */
> @@ -1633,8 +1918,6 @@ static int img_convert(int argc, char **argv)
> out_bs->bl.discard_alignment))
> );
>
> - buf = qemu_blockalign(out_bs, bufsectors * BDRV_SECTOR_SIZE);
> -
> if (skip_create) {
> int64_t output_sectors = bdrv_nb_sectors(out_bs);
> if (output_sectors < 0) {
> @@ -1661,203 +1944,20 @@ static int img_convert(int argc, char **argv)
> cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
> }
>
> - if (compress) {
> - if (cluster_sectors <= 0 || cluster_sectors > bufsectors) {
> - error_report("invalid cluster size");
> - ret = -1;
> - goto out;
> - }
> - sector_num = 0;
> -
> - nb_sectors = total_sectors;
> -
> - for(;;) {
> - int64_t bs_num;
> - int remainder;
> - uint8_t *buf2;
> -
> - nb_sectors = total_sectors - sector_num;
> - if (nb_sectors <= 0)
> - break;
> - if (nb_sectors >= cluster_sectors)
> - n = cluster_sectors;
> - else
> - n = nb_sectors;
> -
> - bs_num = sector_num - bs_offset;
> - assert (bs_num >= 0);
> - remainder = n;
> - buf2 = buf;
> - while (remainder > 0) {
> - int nlow;
> - while (bs_num == bs_sectors[bs_i]) {
> - bs_offset += bs_sectors[bs_i];
> - bs_i++;
> - assert (bs_i < bs_n);
> - bs_num = 0;
> - /* printf("changing part: sector_num=%" PRId64 ", "
> - "bs_i=%d, bs_offset=%" PRId64 ", bs_sectors=%" PRId64
> - "\n", sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
> - }
> - assert (bs_num < bs_sectors[bs_i]);
> -
> - nlow = remainder > bs_sectors[bs_i] - bs_num
> - ? bs_sectors[bs_i] - bs_num : remainder;
> -
> - ret = bdrv_read(bs[bs_i], bs_num, buf2, nlow);
> - if (ret < 0) {
> - error_report("error while reading sector %" PRId64 ": %s",
> - bs_num, strerror(-ret));
> - goto out;
> - }
> -
> - buf2 += nlow * 512;
> - bs_num += nlow;
> -
> - remainder -= nlow;
> - }
> - assert (remainder == 0);
> -
> - if (!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
> - ret = bdrv_write_compressed(out_bs, sector_num, buf, n);
> - if (ret != 0) {
> - error_report("error while compressing sector %" PRId64
> - ": %s", sector_num, strerror(-ret));
> - goto out;
> - }
> - }
> - sector_num += n;
> - qemu_progress_print(100.0 * sector_num / total_sectors, 0);
> - }
> - /* signal EOF to align */
> - bdrv_write_compressed(out_bs, 0, NULL, 0);
> - } else {
> - int64_t sectors_to_read, sectors_read, sector_num_next_status;
> - bool count_allocated_sectors;
> - int has_zero_init = min_sparse ? bdrv_has_zero_init(out_bs) : 0;
> -
> - if (!has_zero_init && bdrv_can_write_zeroes_with_unmap(out_bs)) {
> - ret = bdrv_make_zero(out_bs, BDRV_REQ_MAY_UNMAP);
> - if (ret < 0) {
> - goto out;
> - }
> - has_zero_init = 1;
> - }
> -
> - sectors_to_read = total_sectors;
> - count_allocated_sectors = progress && (out_baseimg || has_zero_init);
> -restart:
> - sector_num = 0; // total number of sectors converted so far
> - sectors_read = 0;
> - sector_num_next_status = 0;
> -
> - for(;;) {
> - nb_sectors = total_sectors - sector_num;
> - if (nb_sectors <= 0) {
> - if (count_allocated_sectors) {
> - sectors_to_read = sectors_read;
> - count_allocated_sectors = false;
> - goto restart;
> - }
> - ret = 0;
> - break;
> - }
> -
> - while (sector_num - bs_offset >= bs_sectors[bs_i]) {
> - bs_offset += bs_sectors[bs_i];
> - bs_i ++;
> - assert (bs_i < bs_n);
> - /* printf("changing part: sector_num=%" PRId64 ", bs_i=%d, "
> - "bs_offset=%" PRId64 ", bs_sectors=%" PRId64 "\n",
> - sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
> - }
> -
> - if ((out_baseimg || has_zero_init) &&
> - sector_num >= sector_num_next_status) {
> - n = nb_sectors > INT_MAX ? INT_MAX : nb_sectors;
> - ret = bdrv_get_block_status(bs[bs_i], sector_num - bs_offset,
> - n, &n1);
> - if (ret < 0) {
> - error_report("error while reading block status of sector %"
> - PRId64 ": %s", sector_num - bs_offset,
> - strerror(-ret));
> - goto out;
> - }
> - /* If the output image is zero initialized, we are not working
> - * on a shared base and the input is zero we can skip the next
> - * n1 sectors */
> - if (has_zero_init && !out_baseimg && (ret & BDRV_BLOCK_ZERO)) {
> - sector_num += n1;
> - continue;
> - }
> - /* If the output image is being created as a copy on write
> - * image, assume that sectors which are unallocated in the
> - * input image are present in both the output's and input's
> - * base images (no need to copy them). */
> - if (out_baseimg) {
> - if (!(ret & BDRV_BLOCK_DATA)) {
> - sector_num += n1;
> - continue;
> - }
> - /* The next 'n1' sectors are allocated in the input image.
> - * Copy only those as they may be followed by unallocated
> - * sectors. */
> - nb_sectors = n1;
> - }
> - /* avoid redundant callouts to get_block_status */
> - sector_num_next_status = sector_num + n1;
> - }
> -
> - n = MIN(nb_sectors, bufsectors);
> -
> - /* round down request length to an aligned sector, but
> - * do not bother doing this on short requests. They happen
> - * when we found an all-zero area, and the next sector to
> - * write will not be sector_num + n. */
> - if (cluster_sectors > 0 && n >= cluster_sectors) {
> - int64_t next_aligned_sector = (sector_num + n);
> - next_aligned_sector -= next_aligned_sector % cluster_sectors;
> - if (sector_num + n > next_aligned_sector) {
> - n = next_aligned_sector - sector_num;
> - }
> - }
> -
> - n = MIN(n, bs_sectors[bs_i] - (sector_num - bs_offset));
> -
> - sectors_read += n;
> - if (count_allocated_sectors) {
> - sector_num += n;
> - continue;
> - }
> + state = (ImgConvertState) {
> + .src = bs,
> + .src_sectors = bs_sectors,
> + .src_num = bs_n,
> + .total_sectors = total_sectors,
> + .target = out_bs,
> + .compressed = compress,
> + .out_backing = (bool) out_baseimg,
> + .min_sparse = min_sparse,
> + .cluster_sectors = cluster_sectors,
> + .buf_sectors = bufsectors,
> + };
> + ret = convert_do_copy(&state);
>
> - n1 = n;
> - ret = bdrv_read(bs[bs_i], sector_num - bs_offset, buf, n);
> - if (ret < 0) {
> - error_report("error while reading sector %" PRId64 ": %s",
> - sector_num - bs_offset, strerror(-ret));
> - goto out;
> - }
> - /* NOTE: at the same time we convert, we do not write zero
> - sectors to have a chance to compress the image. Ideally, we
> - should add a specific call to have the info to go faster */
> - buf1 = buf;
> - while (n > 0) {
> - if (!has_zero_init ||
> - is_allocated_sectors_min(buf1, n, &n1, min_sparse)) {
> - ret = bdrv_write(out_bs, sector_num, buf1, n1);
> - if (ret < 0) {
> - error_report("error while writing sector %" PRId64
> - ": %s", sector_num, strerror(-ret));
> - goto out;
> - }
> - }
> - sector_num += n1;
> - n -= n1;
> - buf1 += n1 * 512;
> - }
> - qemu_progress_print(100.0 * sectors_read / sectors_to_read, 0);
> - }
> - }
> out:
> if (!ret) {
> qemu_progress_print(100, 0);
> @@ -1865,7 +1965,6 @@ out:
> qemu_progress_end();
> qemu_opts_del(opts);
> qemu_opts_free(create_opts);
> - qemu_vfree(buf);
> qemu_opts_del(sn_opts);
> blk_unref(out_blk);
> g_free(bs);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests
2015-02-18 15:19 ` [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests Kevin Wolf
@ 2015-02-18 17:24 ` Max Reitz
2015-03-19 14:41 ` Max Reitz
1 sibling, 0 replies; 9+ messages in thread
From: Max Reitz @ 2015-02-18 17:24 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: stefanha
On 2015-02-18 at 10:19, Kevin Wolf wrote:
> This adds a regression test for some problems that the qemu-img convert
> rewrite just fixed.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> tests/qemu-iotests/122 | 83 ++++++++++++++++++++++++++++++++++++++++++++++
> tests/qemu-iotests/122.out | 30 +++++++++++++++++
> tests/qemu-iotests/group | 1 +
> 3 files changed, 114 insertions(+)
> create mode 100755 tests/qemu-iotests/122
> create mode 100644 tests/qemu-iotests/122.out
>
> diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
> new file mode 100755
> index 0000000..e0c472f
> --- /dev/null
> +++ b/tests/qemu-iotests/122
> @@ -0,0 +1,83 @@
> +#!/bin/bash
> +#
> +# Test some qemu-img convert cases
> +#
> +# Copyright (C) 2015 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program. If not, see <http://www.gnu.org/licenses/>.
> +#
> +
> +# creator
> +owner=kwolf@redhat.com
> +
> +seq="$(basename $0)"
> +echo "QA output created by $seq"
> +
> +here="$PWD"
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +
> +_cleanup()
> +{
> + _cleanup_test_img
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +
> +_supported_fmt qcow2
> +_supported_proto generic
> +_supported_os Linux
> +
> +
> +TEST_IMG=$TEST_IMG.base _make_test_img 64M
> +$QEMU_IO -c "write -P 0x11 0 64M" $TEST_IMG.base 2>&1 | _filter_qemu_io | _filter_testdir
> +
> +
> +echo
> +echo "=== Check allocation status regression with -B ==="
> +echo
> +
> +_make_test_img -b $TEST_IMG.base
> +$QEMU_IO -c "write -P 0x22 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
> +$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
> +$QEMU_IMG map $TEST_IMG.orig | _filter_qemu_img_map
> +
> +
> +echo
> +echo "=== Check that zero clusters are kept in overlay ==="
> +echo
> +
> +_make_test_img -b $TEST_IMG.base
> +
> +$QEMU_IO -c "write -P 0 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
> +$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
> +$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
> +$QEMU_IMG convert -O $IMGFMT -c -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
> +$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
> +
> +$QEMU_IO -c "write -z 0 3M" $TEST_IMG 2>&1 | _filter_qemu_io | _filter_testdir
> +$QEMU_IMG convert -O $IMGFMT -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
> +$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
> +$QEMU_IMG convert -O $IMGFMT -c -B $TEST_IMG.base $TEST_IMG $TEST_IMG.orig
> +$QEMU_IO -c "read -P 0 0 3M" $TEST_IMG.orig 2>&1 | _filter_qemu_io | _filter_testdir
> +
> +# TODO -B + -S 0 -> full allocation with no use of the backing file
Also, -S 0 with -B (that is, with has_zero_init) and concatenation, please.
But even without those:
Reviewed-by: Max Reitz <mreitz@redhat.com>
> +
> +# success, all done
> +echo '*** done'
> +rm -f $seq.full
> +status=0
> diff --git a/tests/qemu-iotests/122.out b/tests/qemu-iotests/122.out
> new file mode 100644
> index 0000000..6124099
> --- /dev/null
> +++ b/tests/qemu-iotests/122.out
> @@ -0,0 +1,30 @@
> +QA output created by 122
> +Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=67108864
> +wrote 67108864/67108864 bytes at offset 0
> +64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +
> +=== Check allocation status regression with -B ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base'
> +wrote 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +Offset Length File
> +0 0x300000 TEST_DIR/t.IMGFMT.orig
> +0x300000 0x3d00000 TEST_DIR/t.IMGFMT.base
> +
> +=== Check that zero clusters are kept in overlay ===
> +
> +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 backing_file='TEST_DIR/t.IMGFMT.base'
> +wrote 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +wrote 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +read 3145728/3145728 bytes at offset 0
> +3 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +*** done
> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
> index 4b2b93b..28c78e5 100644
> --- a/tests/qemu-iotests/group
> +++ b/tests/qemu-iotests/group
> @@ -117,3 +117,4 @@
> 113 rw auto quick
> 114 rw auto quick
> 116 rw auto quick
> +122 rw auto quick
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
2015-02-18 17:18 ` Max Reitz
@ 2015-02-19 15:47 ` Kevin Wolf
2015-02-19 16:01 ` Max Reitz
0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2015-02-19 15:47 UTC (permalink / raw)
To: Max Reitz; +Cc: qemu-devel, stefanha
Am 18.02.2015 um 18:18 hat Max Reitz geschrieben:
> On 2015-02-18 at 10:19, Kevin Wolf wrote:
> >The implementation of qemu-img convert is (a) messy, (b) buggy, and
> >(c) less efficient than possible. The changes required to beat some
> >sense into it are massive enough that incremental changes would only
> >make my and the reviewers' life harder. So throw it away and reimplement
> >it from scratch.
> >
> >Let me give some examples what I mean by messy, buggy and inefficient:
> >
> >(a) The copying logic of qemu-img convert has two separate branches for
> > compressed and normal target images, which roughly do the same -
> > except for a little code that handles actual differences between
> > compressed and uncompressed images, and much more code that
> > implements just a different set of optimisations and bugs. This is
> > unnecessary code duplication, and makes the code for compressed
> > output (unsurprisingly) suffer from bitrot.
> >
> > The code for uncompressed ouput is run twice to count the the total
> > length for the progress bar. In the first run it just takes a
> > shortcut and runs only half the loop, and when it's done, it toggles
> > a boolean, jumps out of the loop with a backwards goto and starts
> > over. Works, but pretty is something different.
> >
> >(b) Converting while keeping a backing file (-B option) is broken in
> > several ways. This includes not writing to the image file if the
> > input has zero clusters or data filled with zeros (ignoring that the
> > backing file will be visible instead).
> >
> > It also doesn't correctly limit every iteration of the copy loop to
> > sectors of the same status so that too many sectors may be copied to
> > in the target image. For -B this gives an unexpected result, for
> > other images it just does more work than necessary.
> >
> > Conversion with a compressed target completely ignores any target
> > backing file.
> >
> >(c) qemu-img convert skips reading and writing an area if it knows from
> > metadata that copying isn't needed (except for the bug mentioned
> > above that ignores a status change in some cases). It does, however,
> > read from the source even if it knows that it will read zeros, and
> > then search for non-zero bytes in the read buffer, if it's possible
> > that a write might be needed.
> >
> >This reimplementation of the copying core reorganises the code to remove
> >the duplication and have a much more obvious code flow, by essentially
> >splitting the copy iteration loop into three parts:
> >
> >1. Find the number of contiguous sectors of the same status at the
> > current offset (This can also be called in a separate loop for the
> > copying loop in order to determine the total sectors for the progress
> > bar.)
> >
> >2. Read sectors. If the status implies that there is no data there to
> > read (zero or unallocated cluster), don't do anything.
> >
> >3. Write sectors depending on the status. If it's data, write it. If
> > we want the backing file to be visible (with -B), don't write it. If
> > it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
> > optimise the write at least where possible.
> >
> >Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >---
> > qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
> > 1 file changed, 305 insertions(+), 206 deletions(-)
>
> You have been very careful not to write "Rewrite img_convert()" or
> something like that in the subject, so I can't really complain that
> there are still bugs in the function (which are not related to the
> copying logic), for instance:
>
> $ ./qemu-img create -f qcow2 i1.qcow2 64M
> Formatting 'i1.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img create -f qcow2 i2.qcow2 64M
> Formatting 'i2.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img snapshot -c foo i1.qcow2
> $ ./qemu-img snapshot -c foo i2.qcow2
> $ ./qemu-io -c 'write 0 64M' i2.qcow2
> wrote 67108864/67108864 bytes at offset 0
> 64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
> $ ./qemu-img convert -l snapshot.name=foo -O qcow2 i{1,2}.qcow2 o.qcow2
> qemu-img: error while reading sector 131072: Input/output error
>
> ("No support for concatenating multiple snapshot" should be enforced
> for sn_opts != NULL)
Probably worth addressing, though not in this patch.
> >diff --git a/qemu-img.c b/qemu-img.c
> >index e148af8..5c8386e 100644
> >--- a/qemu-img.c
> >+++ b/qemu-img.c
> >@@ -1306,20 +1306,307 @@ out3:
> > return ret;
> > }
> >+enum ImgConvertBlockStatus {
> >+ BLK_DATA,
> >+ BLK_ZERO,
> >+ BLK_BACKING_FILE,
> >+};
> >+
> >+typedef struct ImgConvertState {
> >+ BlockDriverState **src;
> >+ int64_t *src_sectors;
> >+ int src_cur, src_num;
> >+ int64_t src_cur_offset;
> >+ int64_t total_sectors;
> >+ int64_t allocated_sectors;
> >+ enum ImgConvertBlockStatus status;
> >+ int64_t sector_next_status;
> >+ BlockDriverState *target;
> >+ bool has_zero_init;
> >+ bool compressed;
> >+ bool out_backing;
>
> Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would
> be better; "out_backing" to me sounds like this should describe the
> backing file.
Okay, will rename as target_has_backing.
> >+ int min_sparse;
> >+ size_t cluster_sectors;
> >+ size_t buf_sectors;
> >+} ImgConvertState;
> >+
> >+static void convert_select_part(ImgConvertState *s, int64_t sector_num)
> >+{
> >+ while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
> >+ s->src_cur_offset += s->src_sectors[s->src_cur];
> >+ s->src_cur++;
> >+ assert(s->src_cur < s->src_num);
> >+ }
> >+}
> >+
> >+static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
> >+{
> >+ int64_t ret;
> >+ int n;
> >+
> >+ convert_select_part(s, sector_num);
> >+
> >+ assert(s->total_sectors > sector_num);
> >+ n = MIN(s->total_sectors - sector_num, INT_MAX);
>
> Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and
> in other places, too)... In practice, this would only be relevant to
> reads and writes, though, and those are limited by s->buf_sectors.
I'll make that BDRV_REQUEST_MAX_SECTORS.
> >+
> >+ if (s->sector_next_status <= sector_num) {
> >+ ret = bdrv_get_block_status(s->src[s->src_cur],
> >+ sector_num - s->src_cur_offset,
> >+ n, &n);
> >+ if (ret < 0) {
> >+ return ret;
> >+ }
> >+
> >+ if (ret & BDRV_BLOCK_ZERO) {
> >+ s->status = BLK_ZERO;
> >+ } else if (ret & BDRV_BLOCK_DATA) {
> >+ s->status = BLK_DATA;
> >+ } else if (!s->out_backing) {
> >+ /* Without a target backing file we must copy over the contents of
> >+ * the backing file as well. */
> >+ /* TODO Check block status of the backing file chain to avoid
> >+ * needlessly reading zeroes and limiting the iteration to the
> >+ * buffer size */
> >+ s->status = BLK_DATA;
> >+ } else {
> >+ s->status = BLK_BACKING_FILE;
> >+ }
> >+
> >+ s->sector_next_status = sector_num + n;
> >+ }
> >+
> >+ n = MIN(n, s->sector_next_status - sector_num);
> >+ if (s->status == BLK_DATA) {
> >+ n = MIN(n, s->buf_sectors);
> >+ }
> >+
> >+ /* We need to write complete clusters for compressed images, so if an
> >+ * unallocated area is shorter than that, we must consider the whole
> >+ * cluster allocated. */
> >+ if (s->compressed) {
> >+ if (n < s->cluster_sectors) {
> >+ n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
> >+ s->status = BLK_DATA;
> >+ } else {
> >+ n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> >+ }
> >+ }
> >+
> >+ return n;
> >+}
> >+
> >+static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> >+ uint8_t *buf)
> >+{
> >+ int n;
> >+ int ret;
> >+
> >+ if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
> >+ return 0;
> >+ }
> >+
> >+ assert(nb_sectors <= s->buf_sectors);
> >+ while (nb_sectors > 0) {
> >+ BlockDriverState *bs;
> >+ int64_t bs_sectors;
> >+
> >+ /* In the case of compression with multiple source files, we can get a
> >+ * nb_sectors that spreads into the next part. So we must be able to
> >+ * read across multiple BDSes for one convert_read() call. */
> >+ convert_select_part(s, sector_num);
> >+ bs = s->src[s->src_cur];
> >+ bs_sectors = s->src_sectors[s->src_cur];
> >+
> >+ n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
> >+ ret = bdrv_read(bs, sector_num, buf, n);
>
> Shouldn't this be sector_num - s->src_cur_offset?
I suppose it should be, thanks for catching this. Not a good sign with
respect to our qemu-iotests coverage.
> >+ if (ret < 0) {
> >+ return ret;
> >+ }
> >+
> >+ sector_num += n;
> >+ nb_sectors -= n;
> >+ buf += n * BDRV_SECTOR_SIZE;
> >+ }
> >+
> >+ return 0;
> >+}
> >+
> >+static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> >+ const uint8_t *buf)
> >+{
> >+ int ret;
> >+
> >+ while (nb_sectors > 0) {
> >+
>
> This empty line looks a bit strange to me...
Removed.
> >+ int n = nb_sectors;
> >+
> >+ switch (s->status) {
> >+ case BLK_BACKING_FILE:
> >+ /* If we have a backing file, leave clusters unallocated that are
> >+ * unallocated in the source image, so that the backing file is
> >+ * visible at the respective offset. */
> >+ assert(s->out_backing);
> >+ break;
> >+
> >+ case BLK_DATA:
> >+ /* We must always write compressed clusters as a whole, so don't
> >+ * try to find zeroed parts in the buffer. We can only save the
> >+ * write if the buffer is completely zeroed. */
>
> But you shouldn't be doing that if n < s->min_sparse, I think (if
> people specify -S 0, they don't want you to zero anything, and
> that's guaranteed by the man page). Considering that
> is_allocated_sectors_min() basically has a min = MIN(min, n) part,
> too, I'd be fine with making -S 0 a special case here ("if
> (s->has_zero_init && s->min_sparse && buffer_is_zero(...))").
Hm, okay. With a compressed target n isn't variable, so there's only
always or never, depending on min_sparse < cluster_size. But I can do
it.
I'm actually not really sure what the use case for -S with a compressed
target would be.
> Actually, now that I'm looking at is_allocated_sectors_min(), if the
> first sectors is not allocated, it will return false and thus both
> the current qemu-img convert and the new one after this patch won't
> write data. That's a bug, I think (because it is guaranteed by the
> man page).
Or the description in the man page is wrong.
The intention with -S was that we avoid splitting up writes into too
many small chunks because that costs a lot of time. If you look at it
from that angle, it's doing exactly the right thing because skipping
zeros at the start doesn't increase the number of write requests.
> >+ if (s->compressed) {
> >+ if (s->has_zero_init &&
> >+ buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> >+ {
> >+ assert(!s->out_backing);
> >+ break;
> >+ }
I wonder whether !s->has_zero_init shouldn't actually treat it as
BLK_ZERO. That should "compress" even better than writing a gzip
compressed cluster of zeros.
> >+ ret = bdrv_write_compressed(s->target, sector_num, buf, n);
> >+ if (ret < 0) {
> >+ return ret;
> >+ }
> >+ break;
> >+ }
> >+
> >+ /* If there is real non-zero data, we must write it. Otherwise we
> >+ * can treat it as zero sectors. */
> >+ if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
>
> So I think this should be "if (!s->min_sparse || ...)".
I'd rather change is_allocated_sectors_min(). But only if there is a use
case, otherwise we should fix the description in the man page.
> >+ ret = bdrv_write(s->target, sector_num, buf, n);
> >+ if (ret < 0) {
> >+ return ret;
> >+ }
> >+ break;
> >+ }
> >+ /* fall-through */
> >+
> >+ case BLK_ZERO:
> >+ if (s->has_zero_init) {
>
> And I don't really know what to do about this. Should we write zeros
> if !s->min_sparse? Or is this a special case and the user can't
> expect qemu-img convert to really write zeros if the source image
> contained unallocated/zero clusters? That would make sense, but once
> again, the man page clearly states that "If sparse_size is 0, […]
> the destination image will always be fully allocated." Maybe we
> could argue that "fully" in this case means "as fully as the source
> image was allocated".
FWIW, bdrv_is_allocated() considers zero clusters allocated. ;-)
I don't think that -S 0 should mean full preallocation. If you want
that, I think you can get it with -o (at the cost of writing some parts
of the image twice).
> >+ break;
> >+ }
> >+ /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
>
> Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros
> read, bdrv_make_zero() below will have been invoked and thus
> s->has_zero_init = true (unless you take my suggestion and leave
> s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't
> really mind that case here).
Good point.
(I wanted to remove the comment either way, it was just there to get
input during the review.)
> >+ ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
> >+ if (ret < 0) {
> >+ return ret;
> >+ }
> >+ break;
> >+ }
> >+
> >+ sector_num += n;
> >+ nb_sectors -= n;
> >+ buf += n * BDRV_SECTOR_SIZE;
> >+ }
> >+
> >+ return 0;
> >+}
> >+
> >+static int convert_do_copy(ImgConvertState *s)
> >+{
> >+ uint8_t *buf = NULL;
> >+ int64_t sector_num, allocated_done;
> >+ int ret;
> >+ int n;
> >+
> >+ /* Check whether we have zero initialisation or can get it efficiently */
> >+ s->has_zero_init = s->min_sparse && !s->out_backing
> >+ ? bdrv_has_zero_init(s->target)
> >+ : false;
> >+
> >+ if (!s->has_zero_init && !s->out_backing &&
> >+ bdrv_can_write_zeroes_with_unmap(s->target))
> >+ {
> >+ ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> >+ if (ret < 0) {
> >+ goto fail;
>
> I think just leaving s->has_zero_init false should be fine, too
> (instead of failing).
>
> >+ }
> >+ s->has_zero_init = 1;
>
> s/1/true/
Yes.
> >+ }
> >+
> >+ /* Allocate buffer for copied data. For compressed images, only one cluster
> >+ * can be copied at a time. */
> >+ if (s->compressed) {
> >+ if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
> >+ error_report("invalid cluster size");
> >+ ret = -EINVAL;
> >+ goto fail;
> >+ }
> >+ s->buf_sectors = s->cluster_sectors;
> >+ }
> >+ buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> >+
> >+ /* Calculate allocated sectors for progress */
> >+ s->allocated_sectors = 0;
> >+ sector_num = 0;
> >+ while (sector_num < s->total_sectors) {
> >+ n = convert_iteration_sectors(s, sector_num);
> >+ if (n < 0) {
> >+ ret = n;
> >+ goto fail;
> >+ }
> >+ if (s->status == BLK_DATA) {
> >+ s->allocated_sectors += n;
> >+ }
> >+ sector_num += n;
> >+ }
> >+
> >+ /* Do the copy */
> >+ s->src_cur = 0;
> >+ s->src_cur_offset = 0;
> >+ s->sector_next_status = 0;
> >+
> >+ sector_num = 0;
> >+ allocated_done = 0;
> >+
> >+ while (sector_num < s->total_sectors) {
> >+ n = convert_iteration_sectors(s, sector_num);
> >+ if (n < 0) {
> >+ ret = n;
> >+ goto fail;
> >+ }
> >+ if (s->status == BLK_DATA) {
> >+ allocated_done += n;
> >+ qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
> >+ 0);
> >+ }
> >+
> >+ ret = convert_read(s, sector_num, n, buf);
> >+ if (ret < 0) {
> >+ error_report("error while reading sector %" PRId64
> >+ ": %s", sector_num, strerror(-ret));
> >+ goto fail;
> >+ }
> >+
> >+ ret = convert_write(s, sector_num, n, buf);
> >+ if (ret < 0) {
> >+ error_report("error while writing sector %" PRId64
> >+ ": %s", sector_num, strerror(-ret));
> >+ goto fail;
> >+ }
> >+
> >+ sector_num += n;
> >+ }
> >+
> >+ if (s->compressed) {
> >+ /* signal EOF to align */
> >+ bdrv_write_compressed(s->target, 0, NULL, 0);
>
> Is there a reason for ignoring the return value other than "That's
> how img_convert() used to do it"?
No. Isn't that one good enough? ;-)
So the code in qcow2 says this:
if (nb_sectors == 0) {
/* align end of file to a sector boundary to ease reading with
sector based I/Os */
cluster_offset = bdrv_getlength(bs->file);
return bdrv_truncate(bs->file, cluster_offset);
}
I don't think we have any such restrictions any more, so it's mostly
useless. Perhaps ancient qemu versions would fail to read such an image,
but recent ones shouldn't.
In fact, our bdrv_pwrite() currently maps to sector-aligned functions in
the protocol driver, so I think at the moment we already get the
alignment automatically. This might change again once we convert block
drivers to byte offsets.
Kevin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
2015-02-19 15:47 ` Kevin Wolf
@ 2015-02-19 16:01 ` Max Reitz
0 siblings, 0 replies; 9+ messages in thread
From: Max Reitz @ 2015-02-19 16:01 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, stefanha
On 2015-02-19 at 10:47, Kevin Wolf wrote:
> Am 18.02.2015 um 18:18 hat Max Reitz geschrieben:
[snip]
>> Actually, now that I'm looking at is_allocated_sectors_min(), if the
>> first sectors is not allocated, it will return false and thus both
>> the current qemu-img convert and the new one after this patch won't
>> write data. That's a bug, I think (because it is guaranteed by the
>> man page).
> Or the description in the man page is wrong.
>
> The intention with -S was that we avoid splitting up writes into too
> many small chunks because that costs a lot of time. If you look at it
> from that angle, it's doing exactly the right thing because skipping
> zeros at the start doesn't increase the number of write requests.
Feel free to fix it. But I remember someone recently asking about
preallocation for qemu-img convert in the #qemu IRC channel, and "-S 0"
was one of the valid answers.
The nice thing about -S 0 is that it works with all image formats; I
know that qcow2 is always our main concern (especially when it comes to
the target of qemu-img convert), but I think using -S 0 for
preallocation is justified.
Considering that in this case the man page was lying (because the code
did not allocate everything), I'd be fine with fixing up the man page
and leaving the behavior as-is, though, too.
[snip]
>>> +
>>> + if (s->compressed) {
>>> + /* signal EOF to align */
>>> + bdrv_write_compressed(s->target, 0, NULL, 0);
>> Is there a reason for ignoring the return value other than "That's
>> how img_convert() used to do it"?
> No. Isn't that one good enough? ;-)
I don't know, your commit message states that the old implementation is
buggy, so I don't think that's enough. :-)
> So the code in qcow2 says this:
>
> if (nb_sectors == 0) {
> /* align end of file to a sector boundary to ease reading with
> sector based I/Os */
> cluster_offset = bdrv_getlength(bs->file);
> return bdrv_truncate(bs->file, cluster_offset);
> }
>
> I don't think we have any such restrictions any more, so it's mostly
> useless. Perhaps ancient qemu versions would fail to read such an image,
> but recent ones shouldn't.
>
> In fact, our bdrv_pwrite() currently maps to sector-aligned functions in
> the protocol driver, so I think at the moment we already get the
> alignment automatically. This might change again once we convert block
> drivers to byte offsets.
So you're intending to drop the bdrv_write_compressed() completely? I'd
be fine with that as well.
Max
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests
2015-02-18 15:19 ` [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests Kevin Wolf
2015-02-18 17:24 ` Max Reitz
@ 2015-03-19 14:41 ` Max Reitz
2015-03-19 14:43 ` Max Reitz
1 sibling, 1 reply; 9+ messages in thread
From: Max Reitz @ 2015-03-19 14:41 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: stefanha
On 2015-02-18 at 10:19, Kevin Wolf wrote:
> This adds a regression test for some problems that the qemu-img convert
> rewrite just fixed.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> tests/qemu-iotests/122 | 83 ++++++++++++++++++++++++++++++++++++++++++++++
> tests/qemu-iotests/122.out | 30 +++++++++++++++++
> tests/qemu-iotests/group | 1 +
> 3 files changed, 114 insertions(+)
> create mode 100755 tests/qemu-iotests/122
> create mode 100644 tests/qemu-iotests/122.out
[snip]
> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
> index 4b2b93b..28c78e5 100644
> --- a/tests/qemu-iotests/group
> +++ b/tests/qemu-iotests/group
> @@ -117,3 +117,4 @@
> 113 rw auto quick
> 114 rw auto quick
> 116 rw auto quick
> +122 rw auto quick
This test takes 17 s on my HDD, and 7 s in tmpfs. Are you sure it should
be added to the "quick" group?
Other than that, looks good.
Max
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests
2015-03-19 14:41 ` Max Reitz
@ 2015-03-19 14:43 ` Max Reitz
0 siblings, 0 replies; 9+ messages in thread
From: Max Reitz @ 2015-03-19 14:43 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: stefanha
On 2015-03-19 at 10:41, Max Reitz wrote:
> On 2015-02-18 at 10:19, Kevin Wolf wrote:
>> This adds a regression test for some problems that the qemu-img convert
>> rewrite just fixed.
>>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>> tests/qemu-iotests/122 | 83
>> ++++++++++++++++++++++++++++++++++++++++++++++
>> tests/qemu-iotests/122.out | 30 +++++++++++++++++
>> tests/qemu-iotests/group | 1 +
>> 3 files changed, 114 insertions(+)
>> create mode 100755 tests/qemu-iotests/122
>> create mode 100644 tests/qemu-iotests/122.out
>
> [snip]
>
>> diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
>> index 4b2b93b..28c78e5 100644
>> --- a/tests/qemu-iotests/group
>> +++ b/tests/qemu-iotests/group
>> @@ -117,3 +117,4 @@
>> 113 rw auto quick
>> 114 rw auto quick
>> 116 rw auto quick
>> +122 rw auto quick
>
> This test takes 17 s on my HDD, and 7 s in tmpfs. Are you sure it
> should be added to the "quick" group?
>
> Other than that, looks good.
>
> Max
Argh, disregard that, I had the wrong version open. Sorry.
Max
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-03-19 14:43 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-18 15:19 [Qemu-devel] [PATCH 0/2] qemu-img convert: Rewrite copying logic Kevin Wolf
2015-02-18 15:19 ` [Qemu-devel] [PATCH 1/2] " Kevin Wolf
2015-02-18 17:18 ` Max Reitz
2015-02-19 15:47 ` Kevin Wolf
2015-02-19 16:01 ` Max Reitz
2015-02-18 15:19 ` [Qemu-devel] [PATCH 2/2] qemu-iotests: Some qemu-img convert tests Kevin Wolf
2015-02-18 17:24 ` Max Reitz
2015-03-19 14:41 ` Max Reitz
2015-03-19 14:43 ` Max Reitz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).