From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:34975)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1YO8GH-0004kr-6M
	for qemu-devel@nongnu.org; Wed, 18 Feb 2015 12:18:11 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1YO8GD-0004n7-Mc
	for qemu-devel@nongnu.org; Wed, 18 Feb 2015 12:18:09 -0500
Received: from mx1.redhat.com ([209.132.183.28]:36007)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1YO8GD-0004n0-DA
	for qemu-devel@nongnu.org; Wed, 18 Feb 2015 12:18:05 -0500
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
	(int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t1IHI3W7015969
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=FAIL)
	for <qemu-devel@nongnu.org>; Wed, 18 Feb 2015 12:18:04 -0500
Message-ID: <54E4C94A.2020404@redhat.com>
Date: Wed, 18 Feb 2015 12:18:02 -0500
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1424272761-27554-1-git-send-email-kwolf@redhat.com>
	<1424272761-27554-2-git-send-email-kwolf@redhat.com>
In-Reply-To: <1424272761-27554-2-git-send-email-kwolf@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH 1/2] qemu-img convert: Rewrite copying logic
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org
Cc: stefanha@redhat.com

On 2015-02-18 at 10:19, Kevin Wolf wrote:
> The implementation of qemu-img convert is (a) messy, (b) buggy, and
> (c) less efficient than possible. The changes required to beat some
> sense into it are massive enough that incremental changes would only
> make my and the reviewers' life harder. So throw it away and reimplemen=
t
> it from scratch.
>
> Let me give some examples what I mean by messy, buggy and inefficient:
>
> (a) The copying logic of qemu-img convert has two separate branches for
>      compressed and normal target images, which roughly do the same -
>      except for a little code that handles actual differences between
>      compressed and uncompressed images, and much more code that
>      implements just a different set of optimisations and bugs. This is
>      unnecessary code duplication, and makes the code for compressed
>      output (unsurprisingly) suffer from bitrot.
>
>      The code for uncompressed ouput is run twice to count the the tota=
l
>      length for the progress bar. In the first run it just takes a
>      shortcut and runs only half the loop, and when it's done, it toggl=
es
>      a boolean, jumps out of the loop with a backwards goto and starts
>      over. Works, but pretty is something different.
>
> (b) Converting while keeping a backing file (-B option) is broken in
>      several ways. This includes not writing to the image file if the
>      input has zero clusters or data filled with zeros (ignoring that t=
he
>      backing file will be visible instead).
>
>      It also doesn't correctly limit every iteration of the copy loop t=
o
>      sectors of the same status so that too many sectors may be copied =
to
>      in the target image. For -B this gives an unexpected result, for
>      other images it just does more work than necessary.
>
>      Conversion with a compressed target completely ignores any target
>      backing file.
>
> (c) qemu-img convert skips reading and writing an area if it knows from
>      metadata that copying isn't needed (except for the bug mentioned
>      above that ignores a status change in some cases). It does, howeve=
r,
>      read from the source even if it knows that it will read zeros, and
>      then search for non-zero bytes in the read buffer, if it's possibl=
e
>      that a write might be needed.
>
> This reimplementation of the copying core reorganises the code to remov=
e
> the duplication and have a much more obvious code flow, by essentially
> splitting the copy iteration loop into three parts:
>
> 1. Find the number of contiguous sectors of the same status at the
>     current offset (This can also be called in a separate loop for the
>     copying loop in order to determine the total sectors for the progre=
ss
>     bar.)
>
> 2. Read sectors. If the status implies that there is no data there to
>     read (zero or unallocated cluster), don't do anything.
>
> 3. Write sectors depending on the status. If it's data, write it. If
>     we want the backing file to be visible (with -B), don't write it. I=
f
>     it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() =
to
>     optimise the write at least where possible.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++----------------=
---------
>   1 file changed, 305 insertions(+), 206 deletions(-)

You have been very careful not to write "Rewrite img_convert()" or=20
something like that in the subject, so I can't really complain that=20
there are still bugs in the function (which are not related to the=20
copying logic), for instance:

$ ./qemu-img create -f qcow2 i1.qcow2 64M
Formatting 'i1.qcow2', fmt=3Dqcow2 size=3D67108864 encryption=3Doff=20
cluster_size=3D65536 lazy_refcounts=3Doff
$ ./qemu-img create -f qcow2 i2.qcow2 64M
Formatting 'i2.qcow2', fmt=3Dqcow2 size=3D67108864 encryption=3Doff=20
cluster_size=3D65536 lazy_refcounts=3Doff
$ ./qemu-img snapshot -c foo i1.qcow2
$ ./qemu-img snapshot -c foo i2.qcow2
$ ./qemu-io -c 'write 0 64M' i2.qcow2
wrote 67108864/67108864 bytes at offset 0
64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
$ ./qemu-img convert -l snapshot.name=3Dfoo -O qcow2 i{1,2}.qcow2 o.qcow2
qemu-img: error while reading sector 131072: Input/output error

("No support for concatenating multiple snapshot" should be enforced for=20
sn_opts !=3D NULL)

> diff --git a/qemu-img.c b/qemu-img.c
> index e148af8..5c8386e 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1306,20 +1306,307 @@ out3:
>       return ret;
>   }
>  =20
> +enum ImgConvertBlockStatus {
> +    BLK_DATA,
> +    BLK_ZERO,
> +    BLK_BACKING_FILE,
> +};
> +
> +typedef struct ImgConvertState {
> +    BlockDriverState **src;
> +    int64_t *src_sectors;
> +    int src_cur, src_num;
> +    int64_t src_cur_offset;
> +    int64_t total_sectors;
> +    int64_t allocated_sectors;
> +    enum ImgConvertBlockStatus status;
> +    int64_t sector_next_status;
> +    BlockDriverState *target;
> +    bool has_zero_init;
> +    bool compressed;
> +    bool out_backing;

Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would be=20
better; "out_backing" to me sounds like this should describe the backing=20
file.

> +    int min_sparse;
> +    size_t cluster_sectors;
> +    size_t buf_sectors;
> +} ImgConvertState;
> +
> +static void convert_select_part(ImgConvertState *s, int64_t sector_num=
)
> +{
> +    while (sector_num - s->src_cur_offset >=3D s->src_sectors[s->src_c=
ur]) {
> +        s->src_cur_offset +=3D s->src_sectors[s->src_cur];
> +        s->src_cur++;
> +        assert(s->src_cur < s->src_num);
> +    }
> +}
> +
> +static int convert_iteration_sectors(ImgConvertState *s, int64_t secto=
r_num)
> +{
> +    int64_t ret;
> +    int n;
> +
> +    convert_select_part(s, sector_num);
> +
> +    assert(s->total_sectors > sector_num);
> +    n =3D MIN(s->total_sectors - sector_num, INT_MAX);

Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and in=20
other places, too)... In practice, this would only be relevant to reads=20
and writes, though, and those are limited by s->buf_sectors.

> +
> +    if (s->sector_next_status <=3D sector_num) {
> +        ret =3D bdrv_get_block_status(s->src[s->src_cur],
> +                                    sector_num - s->src_cur_offset,
> +                                    n, &n);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if (ret & BDRV_BLOCK_ZERO) {
> +            s->status =3D BLK_ZERO;
> +        } else if (ret & BDRV_BLOCK_DATA) {
> +            s->status =3D BLK_DATA;
> +        } else if (!s->out_backing) {
> +            /* Without a target backing file we must copy over the con=
tents of
> +             * the backing file as well. */
> +            /* TODO Check block status of the backing file chain to av=
oid
> +             * needlessly reading zeroes and limiting the iteration to=
 the
> +             * buffer size */
> +            s->status =3D BLK_DATA;
> +        } else {
> +            s->status =3D BLK_BACKING_FILE;
> +        }
> +
> +        s->sector_next_status =3D sector_num + n;
> +    }
> +
> +    n =3D MIN(n, s->sector_next_status - sector_num);
> +    if (s->status =3D=3D BLK_DATA) {
> +        n =3D MIN(n, s->buf_sectors);
> +    }
> +
> +    /* We need to write complete clusters for compressed images, so if=
 an
> +     * unallocated area is shorter than that, we must consider the who=
le
> +     * cluster allocated. */
> +    if (s->compressed) {
> +        if (n < s->cluster_sectors) {
> +            n =3D MIN(s->cluster_sectors, s->total_sectors - sector_nu=
m);
> +            s->status =3D BLK_DATA;
> +        } else {
> +            n =3D QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> +        }
> +    }
> +
> +    return n;
> +}
> +
> +static int convert_read(ImgConvertState *s, int64_t sector_num, int nb=
_sectors,
> +                        uint8_t *buf)
> +{
> +    int n;
> +    int ret;
> +
> +    if (s->status =3D=3D BLK_ZERO || s->status =3D=3D BLK_BACKING_FILE=
) {
> +        return 0;
> +    }
> +
> +    assert(nb_sectors <=3D s->buf_sectors);
> +    while (nb_sectors > 0) {
> +        BlockDriverState *bs;
> +        int64_t bs_sectors;
> +
> +        /* In the case of compression with multiple source files, we c=
an get a
> +         * nb_sectors that spreads into the next part. So we must be a=
ble to
> +         * read across multiple BDSes for one convert_read() call. */
> +        convert_select_part(s, sector_num);
> +        bs =3D s->src[s->src_cur];
> +        bs_sectors =3D s->src_sectors[s->src_cur];
> +
> +        n =3D MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_of=
fset));
> +        ret =3D bdrv_read(bs, sector_num, buf, n);

Shouldn't this be sector_num - s->src_cur_offset?

> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        sector_num +=3D n;
> +        nb_sectors -=3D n;
> +        buf +=3D n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int convert_write(ImgConvertState *s, int64_t sector_num, int n=
b_sectors,
> +                         const uint8_t *buf)
> +{
> +    int ret;
> +
> +    while (nb_sectors > 0) {
> +

This empty line looks a bit strange to me...

> +        int n =3D nb_sectors;
> +
> +        switch (s->status) {
> +        case BLK_BACKING_FILE:
> +            /* If we have a backing file, leave clusters unallocated t=
hat are
> +             * unallocated in the source image, so that the backing fi=
le is
> +             * visible at the respective offset. */
> +            assert(s->out_backing);
> +            break;
> +
> +        case BLK_DATA:
> +            /* We must always write compressed clusters as a whole, so=
 don't
> +             * try to find zeroed parts in the buffer. We can only sav=
e the
> +             * write if the buffer is completely zeroed. */

But you shouldn't be doing that if n < s->min_sparse, I think (if people=20
specify -S 0, they don't want you to zero anything, and that's=20
guaranteed by the man page). Considering that is_allocated_sectors_min()=20
basically has a min =3D MIN(min, n) part, too, I'd be fine with making -S=
=20
0 a special case here ("if (s->has_zero_init && s->min_sparse &&=20
buffer_is_zero(...))").

Actually, now that I'm looking at is_allocated_sectors_min(), if the=20
first sectors is not allocated, it will return false and thus both the=20
current qemu-img convert and the new one after this patch won't write=20
data. That's a bug, I think (because it is guaranteed by the man page).

> +            if (s->compressed) {
> +                if (s->has_zero_init &&
> +                    buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> +                {
> +                    assert(!s->out_backing);
> +                    break;
> +                }
> +
> +                ret =3D bdrv_write_compressed(s->target, sector_num, b=
uf, n);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +                break;
> +            }
> +
> +            /* If there is real non-zero data, we must write it. Other=
wise we
> +             * can treat it as zero sectors. */
> +            if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {

So I think this should be "if (!s->min_sparse || ...)".

> +                ret =3D bdrv_write(s->target, sector_num, buf, n);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +                break;
> +            }
> +            /* fall-through */
> +
> +        case BLK_ZERO:
> +            if (s->has_zero_init) {

And I don't really know what to do about this. Should we write zeros if=20
!s->min_sparse? Or is this a special case and the user can't expect=20
qemu-img convert to really write zeros if the source image contained=20
unallocated/zero clusters? That would make sense, but once again, the=20
man page clearly states that "If sparse_size is 0, [=E2=80=A6] the destin=
ation=20
image will always be fully allocated." Maybe we could argue that "fully"=20
in this case means "as fully as the source image was allocated".

> +                break;
> +            }
> +            /* TODO Should this use BDRV_REQ_MAY_UNMAP? */

Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros=20
read, bdrv_make_zero() below will have been invoked and thus=20
s->has_zero_init =3D true (unless you take my suggestion and leave=20
s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't really=20
mind that case here).

> +            ret =3D bdrv_write_zeroes(s->target, sector_num, n, 0);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            break;
> +        }
> +
> +        sector_num +=3D n;
> +        nb_sectors -=3D n;
> +        buf +=3D n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int convert_do_copy(ImgConvertState *s)
> +{
> +    uint8_t *buf =3D NULL;
> +    int64_t sector_num, allocated_done;
> +    int ret;
> +    int n;
> +
> +    /* Check whether we have zero initialisation or can get it efficie=
ntly */
> +    s->has_zero_init =3D s->min_sparse && !s->out_backing
> +                     ? bdrv_has_zero_init(s->target)
> +                     : false;
> +
> +    if (!s->has_zero_init && !s->out_backing &&
> +        bdrv_can_write_zeroes_with_unmap(s->target))
> +    {
> +        ret =3D bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> +        if (ret < 0) {
> +            goto fail;

I think just leaving s->has_zero_init false should be fine, too (instead=20
of failing).

> +        }
> +        s->has_zero_init =3D 1;

s/1/true/

> +    }
> +
> +    /* Allocate buffer for copied data. For compressed images, only on=
e cluster
> +     * can be copied at a time. */
> +    if (s->compressed) {
> +        if (s->cluster_sectors <=3D 0 || s->cluster_sectors > s->buf_s=
ectors) {
> +            error_report("invalid cluster size");
> +            ret =3D -EINVAL;
> +            goto fail;
> +        }
> +        s->buf_sectors =3D s->cluster_sectors;
> +    }
> +    buf =3D qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SI=
ZE);
> +
> +    /* Calculate allocated sectors for progress */
> +    s->allocated_sectors =3D 0;
> +    sector_num =3D 0;
> +    while (sector_num < s->total_sectors) {
> +        n =3D convert_iteration_sectors(s, sector_num);
> +        if (n < 0) {
> +            ret =3D n;
> +            goto fail;
> +        }
> +        if (s->status =3D=3D BLK_DATA) {
> +            s->allocated_sectors +=3D n;
> +        }
> +        sector_num +=3D n;
> +    }
> +
> +    /* Do the copy */
> +    s->src_cur =3D 0;
> +    s->src_cur_offset =3D 0;
> +    s->sector_next_status =3D 0;
> +
> +    sector_num =3D 0;
> +    allocated_done =3D 0;
> +
> +    while (sector_num < s->total_sectors) {
> +        n =3D convert_iteration_sectors(s, sector_num);
> +        if (n < 0) {
> +            ret =3D n;
> +            goto fail;
> +        }
> +        if (s->status =3D=3D BLK_DATA) {
> +            allocated_done +=3D n;
> +            qemu_progress_print(100.0 * allocated_done / s->allocated_=
sectors,
> +                                0);
> +        }
> +
> +        ret =3D convert_read(s, sector_num, n, buf);
> +        if (ret < 0) {
> +            error_report("error while reading sector %" PRId64
> +                         ": %s", sector_num, strerror(-ret));
> +            goto fail;
> +        }
> +
> +        ret =3D convert_write(s, sector_num, n, buf);
> +        if (ret < 0) {
> +            error_report("error while writing sector %" PRId64
> +                         ": %s", sector_num, strerror(-ret));
> +            goto fail;
> +        }
> +
> +        sector_num +=3D n;
> +    }
> +
> +    if (s->compressed) {
> +        /* signal EOF to align */
> +        bdrv_write_compressed(s->target, 0, NULL, 0);

Is there a reason for ignoring the return value other than "That's how=20
img_convert() used to do it"?


All in all, this patch looks pretty good to me and certainly makes the=20
function much more clear. But as the bdrv_read() call probably needs to=20
be fixed, and because I think the -S 0 behavior is buggy, I cannot give=20
an R-b at this point (but I will if those are fixed, or if you tell me=20
why they don't need to be).

Max

> +    }
> +
> +    ret =3D 0;
> +fail:
> +    qemu_vfree(buf);
> +    return ret;
> +}
> +
>   static int img_convert(int argc, char **argv)
>   {
> -    int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
> +    int c, bs_n, bs_i, compress, cluster_sectors, skip_create;
>       int64_t ret =3D 0;
>       int progress =3D 0, flags, src_flags;
>       const char *fmt, *out_fmt, *cache, *src_cache, *out_baseimg, *out=
_filename;
>       BlockDriver *drv, *proto_drv;
>       BlockBackend **blk =3D NULL, *out_blk =3D NULL;
>       BlockDriverState **bs =3D NULL, *out_bs =3D NULL;
> -    int64_t total_sectors, nb_sectors, sector_num, bs_offset;
> +    int64_t total_sectors;
>       int64_t *bs_sectors =3D NULL;
> -    uint8_t * buf =3D NULL;
>       size_t bufsectors =3D IO_BUF_SIZE / BDRV_SECTOR_SIZE;
> -    const uint8_t *buf1;
>       BlockDriverInfo bdi;
>       QemuOpts *opts =3D NULL;
>       QemuOptsList *create_opts =3D NULL;
> @@ -1330,6 +1617,7 @@ static int img_convert(int argc, char **argv)
>       bool quiet =3D false;
>       Error *local_err =3D NULL;
>       QemuOpts *sn_opts =3D NULL;
> +    ImgConvertState state;
>  =20
>       fmt =3D NULL;
>       out_fmt =3D "raw";
> @@ -1622,9 +1910,6 @@ static int img_convert(int argc, char **argv)
>       }
>       out_bs =3D blk_bs(out_blk);
>  =20
> -    bs_i =3D 0;
> -    bs_offset =3D 0;
> -
>       /* increase bufsectors from the default 4096 (2M) if opt_transfer=
_length
>        * or discard_alignment of the out_bs is greater. Limit to 32768 =
(16MB)
>        * as maximum. */
> @@ -1633,8 +1918,6 @@ static int img_convert(int argc, char **argv)
>                                            out_bs->bl.discard_alignment=
))
>                       );
>  =20
> -    buf =3D qemu_blockalign(out_bs, bufsectors * BDRV_SECTOR_SIZE);
> -
>       if (skip_create) {
>           int64_t output_sectors =3D bdrv_nb_sectors(out_bs);
>           if (output_sectors < 0) {
> @@ -1661,203 +1944,20 @@ static int img_convert(int argc, char **argv)
>           cluster_sectors =3D bdi.cluster_size / BDRV_SECTOR_SIZE;
>       }
>  =20
> -    if (compress) {
> -        if (cluster_sectors <=3D 0 || cluster_sectors > bufsectors) {
> -            error_report("invalid cluster size");
> -            ret =3D -1;
> -            goto out;
> -        }
> -        sector_num =3D 0;
> -
> -        nb_sectors =3D total_sectors;
> -
> -        for(;;) {
> -            int64_t bs_num;
> -            int remainder;
> -            uint8_t *buf2;
> -
> -            nb_sectors =3D total_sectors - sector_num;
> -            if (nb_sectors <=3D 0)
> -                break;
> -            if (nb_sectors >=3D cluster_sectors)
> -                n =3D cluster_sectors;
> -            else
> -                n =3D nb_sectors;
> -
> -            bs_num =3D sector_num - bs_offset;
> -            assert (bs_num >=3D 0);
> -            remainder =3D n;
> -            buf2 =3D buf;
> -            while (remainder > 0) {
> -                int nlow;
> -                while (bs_num =3D=3D bs_sectors[bs_i]) {
> -                    bs_offset +=3D bs_sectors[bs_i];
> -                    bs_i++;
> -                    assert (bs_i < bs_n);
> -                    bs_num =3D 0;
> -                    /* printf("changing part: sector_num=3D%" PRId64 "=
, "
> -                       "bs_i=3D%d, bs_offset=3D%" PRId64 ", bs_sectors=
=3D%" PRId64
> -                       "\n", sector_num, bs_i, bs_offset, bs_sectors[b=
s_i]); */
> -                }
> -                assert (bs_num < bs_sectors[bs_i]);
> -
> -                nlow =3D remainder > bs_sectors[bs_i] - bs_num
> -                    ? bs_sectors[bs_i] - bs_num : remainder;
> -
> -                ret =3D bdrv_read(bs[bs_i], bs_num, buf2, nlow);
> -                if (ret < 0) {
> -                    error_report("error while reading sector %" PRId64=
 ": %s",
> -                                 bs_num, strerror(-ret));
> -                    goto out;
> -                }
> -
> -                buf2 +=3D nlow * 512;
> -                bs_num +=3D nlow;
> -
> -                remainder -=3D nlow;
> -            }
> -            assert (remainder =3D=3D 0);
> -
> -            if (!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
> -                ret =3D bdrv_write_compressed(out_bs, sector_num, buf,=
 n);
> -                if (ret !=3D 0) {
> -                    error_report("error while compressing sector %" PR=
Id64
> -                                 ": %s", sector_num, strerror(-ret));
> -                    goto out;
> -                }
> -            }
> -            sector_num +=3D n;
> -            qemu_progress_print(100.0 * sector_num / total_sectors, 0)=
;
> -        }
> -        /* signal EOF to align */
> -        bdrv_write_compressed(out_bs, 0, NULL, 0);
> -    } else {
> -        int64_t sectors_to_read, sectors_read, sector_num_next_status;
> -        bool count_allocated_sectors;
> -        int has_zero_init =3D min_sparse ? bdrv_has_zero_init(out_bs) =
: 0;
> -
> -        if (!has_zero_init && bdrv_can_write_zeroes_with_unmap(out_bs)=
) {
> -            ret =3D bdrv_make_zero(out_bs, BDRV_REQ_MAY_UNMAP);
> -            if (ret < 0) {
> -                goto out;
> -            }
> -            has_zero_init =3D 1;
> -        }
> -
> -        sectors_to_read =3D total_sectors;
> -        count_allocated_sectors =3D progress && (out_baseimg || has_ze=
ro_init);
> -restart:
> -        sector_num =3D 0; // total number of sectors converted so far
> -        sectors_read =3D 0;
> -        sector_num_next_status =3D 0;
> -
> -        for(;;) {
> -            nb_sectors =3D total_sectors - sector_num;
> -            if (nb_sectors <=3D 0) {
> -                if (count_allocated_sectors) {
> -                    sectors_to_read =3D sectors_read;
> -                    count_allocated_sectors =3D false;
> -                    goto restart;
> -                }
> -                ret =3D 0;
> -                break;
> -            }
> -
> -            while (sector_num - bs_offset >=3D bs_sectors[bs_i]) {
> -                bs_offset +=3D bs_sectors[bs_i];
> -                bs_i ++;
> -                assert (bs_i < bs_n);
> -                /* printf("changing part: sector_num=3D%" PRId64 ", bs=
_i=3D%d, "
> -                  "bs_offset=3D%" PRId64 ", bs_sectors=3D%" PRId64 "\n=
",
> -                   sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
> -            }
> -
> -            if ((out_baseimg || has_zero_init) &&
> -                sector_num >=3D sector_num_next_status) {
> -                n =3D nb_sectors > INT_MAX ? INT_MAX : nb_sectors;
> -                ret =3D bdrv_get_block_status(bs[bs_i], sector_num - b=
s_offset,
> -                                            n, &n1);
> -                if (ret < 0) {
> -                    error_report("error while reading block status of =
sector %"
> -                                 PRId64 ": %s", sector_num - bs_offset=
,
> -                                 strerror(-ret));
> -                    goto out;
> -                }
> -                /* If the output image is zero initialized, we are not=
 working
> -                 * on a shared base and the input is zero we can skip =
the next
> -                 * n1 sectors */
> -                if (has_zero_init && !out_baseimg && (ret & BDRV_BLOCK=
_ZERO)) {
> -                    sector_num +=3D n1;
> -                    continue;
> -                }
> -                /* If the output image is being created as a copy on w=
rite
> -                 * image, assume that sectors which are unallocated in=
 the
> -                 * input image are present in both the output's and in=
put's
> -                 * base images (no need to copy them). */
> -                if (out_baseimg) {
> -                    if (!(ret & BDRV_BLOCK_DATA)) {
> -                        sector_num +=3D n1;
> -                        continue;
> -                    }
> -                    /* The next 'n1' sectors are allocated in the inpu=
t image.
> -                     * Copy only those as they may be followed by unal=
located
> -                     * sectors. */
> -                    nb_sectors =3D n1;
> -                }
> -                /* avoid redundant callouts to get_block_status */
> -                sector_num_next_status =3D sector_num + n1;
> -            }
> -
> -            n =3D MIN(nb_sectors, bufsectors);
> -
> -            /* round down request length to an aligned sector, but
> -             * do not bother doing this on short requests. They happen
> -             * when we found an all-zero area, and the next sector to
> -             * write will not be sector_num + n. */
> -            if (cluster_sectors > 0 && n >=3D cluster_sectors) {
> -                int64_t next_aligned_sector =3D (sector_num + n);
> -                next_aligned_sector -=3D next_aligned_sector % cluster=
_sectors;
> -                if (sector_num + n > next_aligned_sector) {
> -                    n =3D next_aligned_sector - sector_num;
> -                }
> -            }
> -
> -            n =3D MIN(n, bs_sectors[bs_i] - (sector_num - bs_offset));
> -
> -            sectors_read +=3D n;
> -            if (count_allocated_sectors) {
> -                sector_num +=3D n;
> -                continue;
> -            }
> +    state =3D (ImgConvertState) {
> +        .src                =3D bs,
> +        .src_sectors        =3D bs_sectors,
> +        .src_num            =3D bs_n,
> +        .total_sectors      =3D total_sectors,
> +        .target             =3D out_bs,
> +        .compressed         =3D compress,
> +        .out_backing        =3D (bool) out_baseimg,
> +        .min_sparse         =3D min_sparse,
> +        .cluster_sectors    =3D cluster_sectors,
> +        .buf_sectors        =3D bufsectors,
> +    };
> +    ret =3D convert_do_copy(&state);
>  =20
> -            n1 =3D n;
> -            ret =3D bdrv_read(bs[bs_i], sector_num - bs_offset, buf, n=
);
> -            if (ret < 0) {
> -                error_report("error while reading sector %" PRId64 ": =
%s",
> -                             sector_num - bs_offset, strerror(-ret));
> -                goto out;
> -            }
> -            /* NOTE: at the same time we convert, we do not write zero
> -               sectors to have a chance to compress the image. Ideally=
, we
> -               should add a specific call to have the info to go faste=
r */
> -            buf1 =3D buf;
> -            while (n > 0) {
> -                if (!has_zero_init ||
> -                    is_allocated_sectors_min(buf1, n, &n1, min_sparse)=
) {
> -                    ret =3D bdrv_write(out_bs, sector_num, buf1, n1);
> -                    if (ret < 0) {
> -                        error_report("error while writing sector %" PR=
Id64
> -                                     ": %s", sector_num, strerror(-ret=
));
> -                        goto out;
> -                    }
> -                }
> -                sector_num +=3D n1;
> -                n -=3D n1;
> -                buf1 +=3D n1 * 512;
> -            }
> -            qemu_progress_print(100.0 * sectors_read / sectors_to_read=
, 0);
> -        }
> -    }
>   out:
>       if (!ret) {
>           qemu_progress_print(100, 0);
> @@ -1865,7 +1965,6 @@ out:
>       qemu_progress_end();
>       qemu_opts_del(opts);
>       qemu_opts_free(create_opts);
> -    qemu_vfree(buf);
>       qemu_opts_del(sn_opts);
>       blk_unref(out_blk);
>       g_free(bs);