From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52092)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1WgDmo-0002p1-FS
	for qemu-devel@nongnu.org; Fri, 02 May 2014 09:46:04 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1WgDme-00008w-Kl
	for qemu-devel@nongnu.org; Fri, 02 May 2014 09:45:58 -0400
Received: from mx1.redhat.com ([209.132.183.28]:54806)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1WgDme-00008r-9S
	for qemu-devel@nongnu.org; Fri, 02 May 2014 09:45:48 -0400
Received: from int-mx10.intmail.prod.int.phx2.redhat.com
	(int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s42DjldB023884
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=OK)
	for <qemu-devel@nongnu.org>; Fri, 2 May 2014 09:45:47 -0400
Message-ID: <5363A17C.4000500@redhat.com>
Date: Fri, 02 May 2014 15:45:32 +0200
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1398784072-3319-1-git-send-email-mreitz@redhat.com>
	<1398784072-3319-4-git-send-email-mreitz@redhat.com>
	<5360268A.8080100@redhat.com>
In-Reply-To: <5360268A.8080100@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v6 03/14] qcow2: Optimize bdrv_make_empty()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>

On 30.04.2014 00:24, Eric Blake wrote:
> On 04/29/2014 09:07 AM, Max Reitz wrote:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by creating an empty L1
>> table and dropping all data clusters at once by recreating the refcoun=
t
>> structure accordingly instead of normally discarding all clusters.
>>
>> If there are snapshots, fall back to the simple implementation (discar=
d
>> all clusters).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2.c | 385 ++++++++++++++++++++++++++++++++++++++++++++++++=
+++++++---
>>   1 file changed, 370 insertions(+), 15 deletions(-)
> Wow!  This was a fun review, and took me quite some time.  I also had t=
o
> re-read docs/specs/qcow2.txt several times to make sure I understood al=
l
> your numbers, but everything in here made sense.
>
> The pdf helped; I didn't spot any math errors there (I did have to
> figure out the mapping between the one-letter variables in the pdf to
> the multi-char variables in your code; such as 'x' matching 'overhead').
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thank you very much. :-)

> More comments below, describing some of my thoughts while reviewing.
>
> This patch undoes anything accomplished earlier by preallocation.  I
> suppose that if a user wants to empty an image but still guarantee that
> it is preallocated large enough, they can do it in two steps; or they
> can do the 'commit -d' to skip emptying the image, and then just
> recreate a new qcow2 image with full preallocation and bypass the work
> of this patch.  I agree that it doesn't make any sense to complicate
> this patch even more to worry about restoring a level of preallocation.
>
>> +/*
>> + * Creates a reftable pointing to refblocks following right afterward=
s and an
>> + * empty L1 table at the given @offset. @refblocks is the number of r=
efblocks
>> + * to create.
>> + *
>> + * This combination of structures (reftable+refblocks+L1) is here cal=
led a
>> + * "blob".
> Indeed, the bare minimum qcow2 image is one consisting of a reftable an=
d
> refblocks for all host clusters, and an L1 table large enough for all
> guest clusters; making the three tables contiguous and acting on a blob
> makes sense.
>
>> + */
>> +static int create_refcount_l1(BlockDriverState *bs, uint64_t offset,
>> +                              uint64_t refblocks)
>> +{
>> +    BDRVQcowState *s =3D bs->opaque;
>> +    uint64_t *reftable =3D NULL;
>> +    uint16_t *refblock =3D NULL;
>> +    uint64_t reftable_clusters;
>> +    uint64_t rbi;
>> +    uint64_t blob_start, blob_end;
>> +    uint64_t l2_tables, l1_clusters, l1_size2;
>> +    uint8_t l1_size_and_offset[12];
>> +    uint64_t rt_offset;
> Good; most math is in uint64_t, so I don't have to think too hard about
> overflows.
>
>> +    int ret, i;
>> +
>> +    reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_size / 8=
);
> Each reftable entry is 8 bytes, so this is correct for how many entries
> are packed in a cluster.
>
>> +    l2_tables =3D DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors=
,
>> +                             s->cluster_size / 8);
> Each L2 entry is 8 bytes, so this is correct for how many L2 entries ar=
e
> packed in an L2 cluster.  It took me a while to figure out that you hav=
e
> to know how many L2 clusters would exist in a fully populated image in
> order to size the L1 table, even though you are NOT allocating L2 clust=
ers.
>
>> +    l1_clusters =3D DIV_ROUND_UP(l2_tables, s->cluster_size / 8);
> Each L1 entry is 8 bytes, so this is also correct.
>
>> +    l1_size2 =3D l1_clusters << s->cluster_bits;
>> +
>> +    blob_start =3D offset;
>> +    blob_end =3D offset + ((reftable_clusters + refblocks + l1_cluste=
rs) <<
>> +                s->cluster_bits);
>> +
>> +    ret =3D qcow2_pre_write_overlap_check(bs, 0, blob_start,
>> +                                        blob_end - blob_start);
> The caller allocated all three parts of the blob but has not yet put
> them to use; so this makes sense that we should have full access as the
> first user of each cluster in the range.
>
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    /* Create the reftable pointing with entries pointing consecutive=
ly to the
>> +     * following clusters */
>> +    reftable =3D g_malloc0_n(reftable_clusters, s->cluster_size);
> The zero initialization is overkill for all but the last cluster, if th=
e
> guest image size is not an even multiple of cluster size.  But it
> doesn't hurt to be safe.  You COULD have optimized further by leaving 0
> entries for refcount blocks that would all contain 0 entries, but that
> would make the code more complex; furthermore, you call this blob
> initialization function twice in the normal case, but only the first
> call will generate any refcount blocks containing all zeros.  In the
> second call, when the refcount block is now hoisted to the front of the
> file, all refcount blocks will be non-zero.
>
>> +
>> +    for (rbi =3D 0; rbi < refblocks; rbi++) {
>> +        reftable[rbi] =3D cpu_to_be64(offset + ((reftable_clusters + =
rbi) <<
>> +                        s->cluster_bits));
>> +    }
> Straightforward mapping.  All but the tail will now be mapped into
> consecutive refblocks.
>
>> +
>> +    ret =3D bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t=
 *)reftable,
>> +                     reftable_clusters * s->cluster_sectors);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    offset +=3D reftable_clusters << s->cluster_bits;
>> +
>> +    /* Keep the reftable, as we will need it for the BDRVQcowState an=
yway */
>> +
>> +    /* Now, create all the refblocks */
>> +    refblock =3D g_malloc(s->cluster_size);
>> +
>> +    for (rbi =3D 0; rbi < refblocks; rbi++) {
>> +        for (i =3D 0; i < s->cluster_size / 2; i++) {
>> +            uint64_t cluster_offset =3D ((rbi << (s->cluster_bits - 1=
)) + i)
>> +                                      << s->cluster_bits;
>> +
>> +            /* Only 0 and 1 are possible as refcounts here */
>> +            refblock[i] =3D cpu_to_be16(!cluster_offset ||
>> +                                      (cluster_offset >=3D blob_start=
 &&
>> +                                       cluster_offset <  blob_end));
> Caller passed in refblocks, but you are correct that each refblock entr=
y
> is 2 bytes long.
>
> The image header and all clusters within the blob will have a refcount
> of 1, all other clusters will have a refcount of 2.  Of course, at this
> point, this is just pre-populating the table, it is not actually in use=
 yet.
>
>> +        }
>> +
>> +        ret =3D bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS,
>> +                         (uint8_t *)refblock, s->cluster_sectors);
>> +        if (ret < 0) {
>> +            goto out;
>> +        }
>> +
>> +        offset +=3D s->cluster_size;
>> +    }
>> +
>> +    g_free(refblock);
>> +    refblock =3D NULL;
>> +
>> +    /* The L1 table is very simple */
>> +    ret =3D bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS,
>> +                            l1_clusters * s->cluster_sectors, 0);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
> Sets up the new L1 table to point to all L2 tables being unallocated.
> Any reason you aren't passing BDRV_REQ_MAY_UNMAP, to make the allocated
> size of the image even smaller when supported?

I just didn't think about it at the moment. You're right, it's probably=20
even better to unmap this area.

>> +
>> +    /* Now make sure all changes are stable and clear all caches */
>> +    bdrv_flush(bs);
>> +
>> +    /* This is probably not really necessary, but it cannot hurt, eit=
her */
>> +    ret =3D qcow2_mark_clean(bs);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    ret =3D qcow2_cache_empty(bs, s->l2_table_cache);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    ret =3D qcow2_cache_empty(bs, s->refcount_block_cache);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +
>> +    /* Modify the image header to point to the new blank L1 table. Th=
is will
>> +     * leak all currently existing data clusters, which is fine. */
>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
>> +
>> +    assert(l1_size2 / 8 <=3D UINT32_MAX);
>> +    cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8);
>> +    cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset);
>> +    ret =3D bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
>> +                           l1_size_and_offset, sizeof(l1_size_and_off=
set));
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
> Pivots the active L1 table; so now the guest sees the entire backing
> file due to all L2 entries being unallocated; the refcount table is
> still unpivoted.  This failure point is where the blob is leaked, but
> the old image is still consistent.
>
>> +
>> +    /* Adapt the in-memory L1 table accordingly */
>> +    s->l1_table_offset =3D offset;
>> +    s->l1_size         =3D l1_size2 / 8;
>> +
>> +    s->l1_table =3D g_realloc(s->l1_table, l1_size2);
>> +    memset(s->l1_table, 0, l1_size2);
>> +
>> +    /* Modify the image header to point to the refcount table. This w=
ill fix all
>> +     * leaks (unless an error occurs). */
>> +    BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE);
>> +
>> +    rt_offset =3D cpu_to_be64(blob_start);
>> +    ret =3D bdrv_pwrite_sync(bs->file,
>> +                           offsetof(QCowHeader, refcount_table_offset=
),
>> +                           &rt_offset, sizeof(rt_offset));
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
> Pivots the refcount table; the old table is now gone and the new table
> is now (potentially) self-consistent (we still require a truncate of an=
y
> clusters already in the image but beyond the size of this new refcount
> table before the refcount table once again correctly describes the
> entire file).
>
>> +
>> +    /* The image is now clean. The only thing left is to adapt the in=
-memory
>> +     * refcount table to match it. */
>> +    s->refcount_table_offset =3D blob_start;
>> +    s->refcount_table_size   =3D reftable_clusters << (s->cluster_bit=
s - 3);
>> +
>> +    for (rbi =3D 0; rbi < refblocks; rbi++) {
>> +        be64_to_cpus(&reftable[rbi]);
>> +    }
>> +
>> +    g_free(s->refcount_table);
>> +    s->refcount_table =3D reftable;
>> +    reftable =3D NULL;
>> +
>> +out:
>> +    g_free(refblock);
>> +    g_free(reftable);
>> +    return ret;
>> +}
> Looks correct.
>
>> +
>> +/*
>> + * Calculates the number of clusters required for an L1 table for an =
image with
>> + * the given parameters plus the reftable and the refblocks required =
to cover
>> + * themselves, the L1 table and a given number of clusters @overhead.
>> + *
>> + * @ts:       Total number of guest sectors the image provides
>> + * @cb:       1 << @cb is the cluster size in bytes
>> + * @spcb:     1 << @spcb is the number of clusters per sector
>> + * @overhead: Number of clusters which shall additionally be covered =
by the
>> + *            refcount structures
>> + */
>> +static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb,
>> +                                  uint64_t overhead)
>> +{
>> +    uint64_t cs  =3D UINT64_C(1) << cb;
>> +    uint64_t spc =3D UINT64_C(1) << spcb;
>> +
>> +    /* The following statement is a solution for this system of equat=
ions:
>> +     *
>> +     * Let req_cls be the required number of clusters required for th=
e reftable,
>> +     * all refblocks and the L1 table.
>> +     *
>> +     * rtc be the clusters required for the reftable, rbc the cluster=
s for all
>> +     * refblocks (i.e., the number of refblocks), l1c the clusters fo=
r the L1
>> +     * table and l2c the clusters for all L2 tables (i.e., the number=
 of L2
>> +     * tables).
>> +     *
>> +     * cs be the cluster size (in bytes), ts the total number of sect=
ors, spc
>> +     * the number of sectors per cluster and tc the total number of c=
lusters.
>> +     *
>> +     * overhead is a number of clusters which should additionally be =
covered by
>> +     * the refcount structures (i.e. all clusters before this blob of=
 req_cls
>> +     * clusters).
>> +     *
>> +     * req_cls >=3D rtc + rbc + l1c
>> +     *   -- should be obvious
> The blob must be large enough to hold all three tables.
>
>> +     * rbc      =3D ceil((overhead + req_cls) / (cs / 2))
>> +     *   -- as each refblock holds cs/2 entries
> The refblock table must be large enough to describe all host clusters
> (the overhead that occurs before the blob, as well as the blob itself),
> assuming the table is fully utilized, and rounded up to cluster size.
> We could have a smaller refblock table by omitting clusters with all-0
> entries, but it's not worth the complication.
>
>> +     * rtc      =3D ceil(rbc                  / (cs / 8))
>> +     *   -- as each reftable cluster holds cs/8 entries
> The reftable must be large enough to point to each refblock, and rounde=
d
> up to cluster size.
>
>> +     * tc       =3D ts / spc
>> +     *   -- ts should be a multiple of spc
> Hmm.  What guarantees that ts is a multiple of spc?  'qemu-img create -=
f
> qcow2 blah 512' shows an image with a virtual size of 512 but a cluster
> size of 64k; isn't that a case where ts=3D=3D1 but spc=3D=3D128 (spcb=3D=
=3D7), so
> tc=3D=3D0 even though the l2 table would need to describe 1 cluster?  B=
ut
> your math was done on ceil(ts), not on tc, so even though this comment
> is fishy, I don't think it invalidates your math.  I think all that you
> need is to write tc =3D ceil(ts / spc)

Okay, I just assumed it wouldn't make sense for ts not to be a multiple=20
of spc, but as qcow2 does specify its virtual disk size in bytes, it is=20
entirely possible. I'll adjust this comment.

>> +     * l2c      =3D ceil(tc  / (cs / 8))
>> +     *   -- as each L2 table holds cs/8 entries
> Makes sense, for correct tc
>
>> +     * l1c      =3D ceil(l2c / (cs / 8))
>> +     *   -- as each L1 table cluster holds cs/8 entries
> Makes sense, for correct l2c
>
>> +     *
>> +     * The following equation yields a result which satisfies the con=
straint.
>> +     * The result may be too high, but is never too low.
>> +     *
>> +     * The original calculation (without bitshifting) was:
>> +     *
>> +     * DIV_ROUND_UP(overhead * (1 + cs / 8) +
>> +     *              3 * cs * cs / 16 - 5 * cs / 8 - 1 +
>> +     *              4 * (ts + spc * cs / 8 - 1) / spc,
>> +     *              cs * cs / 16 - cs / 8 - 1)
> Yes, this matches your pdf followup.
>
>> +     *
>> +     */
>> +
>> +    return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) +
>> +                        (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 +
>> +                        (4 * (ts + (spc << (cb - 3)) - 1) >> spcb),
>> +                        (1 << (2 * cb - 4)) - cs / 8 - 1);
> And yes, this matches the comment written with division.  It could
> definitely be made tighter for the case of overhead large enough where
> we could omit an entire cluster of all-0 refblocks by writing 0 in the
> reftable; but it's not worth the complication.
>
>> +}
>> +
>>   static int qcow2_make_empty(BlockDriverState *bs)
>>   {
>> +    BDRVQcowState *s =3D bs->opaque;
>>       int ret =3D 0;
>> -    uint64_t start_sector;
>> -    int sector_step =3D INT_MAX / BDRV_SECTOR_SIZE;
>>  =20
>> -    for (start_sector =3D 0; start_sector < bs->total_sectors;
>> -         start_sector +=3D sector_step)
>> -    {
>> -        /* As this function is generally used after committing an ext=
ernal
>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, =
the
>> -         * default action for this kind of discard is to pass the dis=
card,
>> -         * which will ideally result in an actually smaller image fil=
e, as
>> -         * is probably desired. */
>> -        ret =3D qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR=
_SIZE,
>> -                                     MIN(sector_step,
>> -                                         bs->total_sectors - start_se=
ctor),
>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>> +    if (s->snapshots) {
>> +        uint64_t start_sector;
>> +        int sector_step =3D INT_MAX / BDRV_SECTOR_SIZE;
>> +
>> +        /* If there are snapshots, every active cluster has to be dis=
carded */
>> +
>> +        for (start_sector =3D 0; start_sector < bs->total_sectors;
>> +             start_sector +=3D sector_step)
>> +        {
>> +            /* As this function is generally used after committing an=
 external
>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Al=
so, the
>> +             * default action for this kind of discard is to pass the=
 discard,
>> +             * which will ideally result in an actually smaller image=
 file, as
>> +             * is probably desired. */
>> +            ret =3D qcow2_discard_clusters(bs, start_sector * BDRV_SE=
CTOR_SIZE,
>> +                                         MIN(sector_step,
>> +                                             bs->total_sectors - star=
t_sector),
>> +                                         QCOW2_DISCARD_SNAPSHOT, true=
);
>> +            if (ret < 0) {
>> +                break;
>> +            }
>> +        }
> The fallback is always safe, even if slower.
>
>> +    } else {
>> +        uint64_t min_size_1, min_size_2;
>> +        int64_t blob_start;
>> +        uint64_t blob_end, real_blob_size, clusters;
>> +        uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters=
;
>> +
>> +        /* If there are no snapshots, we basically want to create a n=
ew empty
>> +         * image. This function is however not for creating a new ima=
ge and
>> +         * renaming it so it shadows the existing but rather for empt=
ying an
>> +         * image, so do exactly that.
>> +         *
>> +         * Therefore, the image should be valid at all points in time=
 and may
>> +         * at worst leak clusters.
> I've convinced myself that you met that constraint - any early exit
> prior to pivoting the reftable probably has leaks, but leaks are easy t=
o
> clean, and the image is consistent at all times.
>
>> +         *
>> +         * Any valid qcow2 image requires an L1 table which is large =
enough to
>> +         * cover all of the guest cluster range, therefore it is impo=
ssible to
>> +         * simply drop the L1 table (make its size 0) and create a mi=
nimal
>> +         * refcount structure.
>> +         *
>> +         * Instead, allocate a blob of data which is large enough to =
hold a new
>> +         * refcount structure (refcount table and all blocks) coverin=
g the whole
>> +         * image until the end of that blob, and an empty L1 table co=
vering the
>> +         * whole guest cluster range. Then these structures are initi=
alized and
>> +         * the image header is modified to point to them.
>> +         *
>> +         * Generally, this blob will be allocated at the end of the i=
mage (that
>> +         * is, its offset will be greater than its size). If that is =
indeed the
>> +         * case, the same operation is repeated, but this time the ne=
w blob
>> +         * starts right after the image header which will then indeed=
 lead to a
>> +         * minimal image. If this is not the case, the image will be =
nearly
>> +         * minimal as well, as long as the underlying protocol suppor=
ts discard.
> Again, depending on whether the all-0 L1 table actually occupies disk
> space determines whether the image is truly minimal in that regards, bu=
t
> even if it occupies disk space this description makes sense as
> describing the bare minimum for a qcow2 file that is still valid and
> points to the backing file for all clusters, regardless of how much of
> the underlying file is allocated.
>
>> +         *
>> +         * Note that this implementation never frees allocated cluste=
rs. This is
>> +         * because in case of success, the current refcounts are inva=
lid anyway;
> Or more precisely, on success, the current refcounts are thrown away as
> we pivot to the new table of refcounts.
>
>> +         * and in case of failure, it would be too cumbersome to keep=
 track of
>> +         * all allocated cluster ranges and free them then.
>> +         *
>> +         * min_size_1 will contain the number of clusters minimally r=
equired for
>> +         * a blob that starts right after the image header; min_size_=
2 will
>> +         * contain the number of clusters minimally required for the =
blob which
>> +         * can be allocated based on the existing refcount structure.
>> +         */
>> +
>> +        /* Repeat until a blob could be allocated which is large enou=
gh to
>> +         * contain all structures necessary for describing itself. Al=
located
>> +         * clusters are not freed, even if they are not suitable, as =
this would
>> +         * result in exactly the same cluster range being returned du=
ring the
>> +         * retry (which is obviously not desirable). In case of succe=
ss, the
>> +         * current refcounts do not matter anyway; and in case of fai=
lure, the
>> +         * clusters are only leaked (which can be easily repaired). *=
/
>> +        do {
>> +            uint64_t fci =3D s->free_cluster_index;
>> +
>> +            /* TODO: Optimize, we do not need refblocks for other par=
ts of the
>> +             * image than the header and this blob */
> Ah, you caught on to the optimization of fewer refblocks.  Rather than
> wording this as a TODO, I'd just state that any overhead occupied by no=
t
> being minimal is reclaimed on a successful min_size_1 blob anyways, so
> it wasn't worth the complications to the code.
>
>> +            min_size_2 =3D minimal_blob_size(bs->total_sectors, s->cl=
uster_bits,
>> +                                           s->cluster_bits - BDRV_SEC=
TOR_BITS,
>> +                                           fci);
>> +
>> +            blob_start =3D qcow2_alloc_clusters(bs,
>> +                                              min_size_2 << s->cluste=
r_bits);
> This call has the side effect of increasing 'fci' for the next iteratio=
n
> of the do loop, if this iteration didn't guess correctly.
>
>> +            if (blob_start < 0) {
>> +                return blob_start;
>> +            }
>> +
>> +            clusters          =3D (blob_start >> s->cluster_bits) + m=
in_size_2;
>> +            refblocks         =3D DIV_ROUND_UP(clusters, s->cluster_s=
ize / 2);
>> +            reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_=
size / 8);
>> +            l2_tables         =3D DIV_ROUND_UP(bs->total_sectors /
>> +                                             s->cluster_sectors,
>> +                                             s->cluster_size / 8);
>> +            l1_clusters       =3D DIV_ROUND_UP(l2_tables, s->cluster_=
size / 8);
>> +
>> +            real_blob_size =3D reftable_clusters + refblocks + l1_clu=
sters;
>> +        } while (min_size_2 < real_blob_size);
> But in reality, will the while loop ever be executed more than once?
> Since min_size_2 was computed via a nice set of equations to be large
> enough, it looks like you are just being paranoid here.

Yes, it may be executed more than once. The calculation for the minimal=20
blob size assumes the blob is allocated at $fci. This is however not=20
necessarily the case if the chunk of free space available at $fci is not=20
large enough to hold the blob; in that case, the allocated area may not=20
be large enough (as the overhead (the offset of the allocated area) is=20
bigger than expected). Then, a new iteration is necessary.=20
s->free_cluster_index will have been increased by alloc_clusters_noref()=20
in qcow2-refcount.c to point somewhere after that last allocation so=20
eventually it will point at the end of the image, resulting in a=20
successful allocation with (blob_start >> s->cluster_bits) =3D=3D=20
s->free_cluster_index (the overhead is as expected).

>> +
>> +        ret =3D create_refcount_l1(bs, blob_start, refblocks);
>>           if (ret < 0) {
>> -            break;
>> +            return ret;
>>           }
>> +
>> +        /* The only overhead is the image header */
>> +        min_size_1 =3D minimal_blob_size(bs->total_sectors, s->cluste=
r_bits,
>> +                                       s->cluster_bits - BDRV_SECTOR_=
BITS, 1);
>> +
>> +        /* If we cannot create a new blob before the current one, jus=
t discard
>> +         * the sectors in between and return. Even if the discard doe=
s nothing,
>> +         * only up to min_size_1 clusters plus the refcount blocks fo=
r those
>> +         * are unused. The worst case is therefore an image of double=
 the size
>> +         * it needs to be, which is not too bad. */
>> +        if ((blob_start >> s->cluster_bits) < 1 + min_size_1) {
>> +            uint64_t sector, end;
>> +            int step =3D INT_MAX / BDRV_SECTOR_SIZE;
>> +
>> +            end =3D blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE=
);
>> +
>> +            /* skip the image header */
>> +            for (sector =3D s->cluster_sectors; sector < end; sector =
+=3D step) {
>> +                bdrv_discard(bs->file, sector, MIN(step, end - sector=
));
> Why bdrv_discard and not qcow2_discard_clusters(...,
> QCOW2_DISCARD_SNAPSHOT, true);?

qcow2_discard_clusters() works on guest offsets; the respective clusters=20
are not visible from the guest as the L1 table is empty. Therefore,=20
qcow2_discard_clusters() will not work. Furthermore, the clusters are=20
unallocated from qcow2's point of view and qcow2_discard_clusters(=E2=80=A6=
,=20
true) should work for allocated clusters only.

> Why can you ignore errors here?

You are correct, I cannot. Or rather, in theory I can as the image is=20
consistent and the only things left to do are discarding all unused=20
clusters and truncating the image (which are not vital); but it's better=20
to return an error anyway if one occurs. Thus, I'll include it in v7.

>> +            }
>> +
>> +            blob_end =3D (blob_start + real_blob_size) << s->cluster_=
bits;
>> +        } else {
>> +            clusters          =3D 1 + min_size_1;
>> +            refblocks         =3D DIV_ROUND_UP(clusters, s->cluster_s=
ize / 2);
>> +            reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_=
size / 8);
>> +            l2_tables         =3D DIV_ROUND_UP(bs->total_sectors /
>> +                                             s->cluster_sectors,
>> +                                             s->cluster_size / 8);
>> +            l1_clusters       =3D DIV_ROUND_UP(l2_tables, s->cluster_=
size / 8);
>> +
>> +            real_blob_size =3D reftable_clusters + refblocks + l1_clu=
sters;
>> +
>> +            assert(min_size_1 >=3D real_blob_size);
>> +
>> +            ret =3D create_refcount_l1(bs, s->cluster_size, refblocks=
);
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +
>> +            blob_end =3D (1 + real_blob_size) << s->cluster_bits;
>> +        }
>> +
>> +        ret =3D bdrv_truncate(bs->file, blob_end);
> The truncate is the key that finally makes the newly used refcount tabl=
e
> (whether from one call to create_refcount_l1 or two) accurate.

The refcount table is accurate even before the bdrv_truncate(). In fact,=20
it is accurate immediately after it is written. There is random=20
unreferenced data in the image, but judging from the new L1 table and=20
refcount data, these are only unused clusters which can be claimed by=20
allocating clusters (just as unused clusters in a file system may=20
contain any random data). qemu-img check will report a clean image right=20
after create_refcount_l1() returns.

> Note
> that we have a case where if the first attempt at allocating min_size_2
> manages to find room for a blob in the middle of the file rather than a=
t
> the end, and then the second create_refcount_l1() fails (or you don't
> attempt a second one but instead add in error handling for failed
> bdrv_discard), then resulting qcow2 file is inconsistent in that the
> reftable is smaller than the overall size of the qcow2 image.

Correct. However, it is my understanding that clusters not covered by=20
the refcount table should be treated as unallocated; at least, this is=20
how qemu does it. get_refcount() returns 0 if the index in the refcount=20
table exceeds the refcount table size.

But this is not covered of the qcow2 specification, so perhaps we should=20
add it.

> But as
> long as our image sanity checking code can handle that, and repair the
> image (if I understand correctly, anything beyond the refcount table is
> basically treated the same as any other leaked cluster),

They are not referenced and the reference count is assumed to be 0 (as=20
they are not covered by the refcount table). Therefore, they are not=20
leaked. From qcow2's point of view, they are simply not there.

> I don't think
> it is fatal to this patch.
>
>>       }
>>  =20
>>       return ret;
>>

Thank you again for working through this. I'll make the adjustments you=20
suggested and I confirmed in v7 (I guess I'll include a diff then ;-)).

Max