Re: [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Max Reitz <mreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment
Date: Wed, 12 Nov 2014 10:55:34 +0100	[thread overview]
Message-ID: <54632E96.4070806@redhat.com> (raw)
In-Reply-To: <5462DEE8.2010304@redhat.com>

On 2014-11-12 at 05:15, Eric Blake wrote:
> On 11/10/2014 06:45 AM, Max Reitz wrote:
>> Add a function qcow2_change_refcount_order() which allows changing the
>> refcount order of a qcow2 image.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-refcount.c | 424 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   block/qcow2.h          |   4 +
>>   2 files changed, 428 insertions(+)
> This is fairly big; you may want to get a second review from a
> maintainer rather than blindly trusting me.

I didn't really see the point in splitting it up. Introducing the static 
helper functions first would not be very helpful either, so I thought 
what I'd do as a reviewer, and that was "apply the patch and just read 
through all the code". Splitting it into multiple patches would not have 
helped there (and I don't see how I could split this patch into logical 
changes, where at the start we have some inefficient but simple 
implementation and it gets better over time).

I'll need a review from a maintainer anyway, but I won't get one without 
being able to show another review first...

> My review was not linear, but I left the email in linear order.  Feel
> free to ask for clarification if my presentation is too hard to follow.
>
>> +
>> +/**
>> + * This "operation" for walk_over_reftable() allocates the refblock on disk (if
>> + * it is not empty) and inserts its offset into the new reftable. The size of
>> + * this new reftable is increased as required.
>> + */
>> +static int alloc_refblock(BlockDriverState *bs, uint64_t **reftable,
>> +                          uint64_t reftable_index, uint64_t *reftable_size,
>> +                          void *refblock, bool refblock_empty, Error **errp)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t offset;
>> +
>> +    if (!refblock_empty && reftable_index >= *reftable_size) {
>> +        uint64_t *new_reftable;
>> +        uint64_t new_reftable_size;
>> +
>> +        new_reftable_size = ROUND_UP(reftable_index + 1,
>> +                                     s->cluster_size / sizeof(uint64_t));
>> +        if (new_reftable_size > QCOW_MAX_REFTABLE_SIZE / sizeof(uint64_t)) {
>> +            error_setg(errp,
>> +                       "This operation would make the refcount table grow "
>> +                       "beyond the maximum size supported by QEMU, aborting");
>> +            return -ENOTSUP;
>> +        }
>> +
>> +        new_reftable = g_try_realloc(*reftable, new_reftable_size *
>> +                                                sizeof(uint64_t));
> Safe from overflow based on checks a few lines earlier.  Good.
>
>> +        if (!new_reftable) {
>> +            error_setg(errp, "Failed to increase reftable buffer size");
>> +            return -ENOMEM;
>> +        }
>> +
>> +        memset(new_reftable + *reftable_size, 0,
>> +               (new_reftable_size - *reftable_size) * sizeof(uint64_t));
>> +
>> +        *reftable      = new_reftable;
>> +        *reftable_size = new_reftable_size;
> Just to check my math here:
>
> Suppose we have an image with 512-byte clusters, and are changing from
> 16-bit refcount (order 4) to 64-bit refcount.  Also, suppose the
> existing image has exactly filled a one-cluster refcount table (that is,
> there are 64 refcount blocks, each describing at a refblock with all 256
> refcount entries full, for a total image size of exactly 8M).  The
> original image occupies a header (1 cluster), L1 and L2 tables, and
> data; but 65 of the 16k clusters tied up in the image are dedicated to
> the refcount structures.
>
> Meanwhile, the new refcount table will have to point to 256 refcount
> blocks, each holding only 64 entries, which in turn implies that the
> refcount table now has to be at least 4 clusters long.  But as this
> requires at least 260 clusters to represent, then even if we were able
> to reuse the 65 clusters of the original table, we'd still be allocating
> at least 195 clusters; in reality, your code doesn't free any old
> clusters until after allocating the new, because it is easier to keep
> the old table live until the new table is populated.  The process of
> allocating the new clusters means we actually end up with a new refcount
> table of 5 clusters long, where not all 320 refblocks will be populated.
>   But as long as we are keeping the old table up-to-date for the refblock
> allocations, it ALSO means that we caused a rollover of the old table
> from 1 cluster into 2, which itself consumes several clusters (the
> larger table must be contiguous, and we must also set up a refblock to
> describe the larger table, so we've added at least three clusters
> associated to the original table during the course of preparing the new
> table).
>
> Hmm - that means I found a bug in your implementation.  See [3] below.
>
>> +    }
>> +
>> +    if (refblock_empty) {
>> +        if (reftable_index < *reftable_size) {
>> +            (*reftable)[reftable_index] = 0;
> Necessary since you used g_try_realloc which leaves the new reftable
> uninitialized.  Reasonable (rather than a memset) since the caller will
> be visiting every single refblock in the table anyways.
>
>> +        }
>> +    } else {
>> +        offset = qcow2_alloc_clusters(bs, s->cluster_size);
> As mentioned above, this action will potentially change
> s->refcount_table_size of the original table, which in turn makes the
> caller execute its loops more often to cover the increased allocation.
> Does qcow2_alloc_clusters() guarantee that the just-allocated cluster is
> zero-initialized (and/or should we add a flag to the function to allow
> the caller to choose whether to force zero allocation instead of leaving
> uninitialized)?  See [4] below for why I ask.
>
>> +        if (offset < 0) {
>> +            error_setg_errno(errp, -offset, "Failed to allocate refblock");
>> +            return offset;
>> +        }
>> +        (*reftable)[reftable_index++] = offset;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * This "operation" for walk_over_reftable() writes the refblock to disk at the
>> + * offset specified by the new reftable's entry. It does not modify the new
>> + * reftable or change any refcounts.
>> + */
>> +static int flush_refblock(BlockDriverState *bs, uint64_t **reftable,
>> +                          uint64_t reftable_index, uint64_t *reftable_size,
>> +                          void *refblock, bool refblock_empty, Error **errp)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t offset;
>> +    int ret;
>> +
>> +    if (refblock_empty) {
>> +        if (reftable_index < *reftable_size) {
>> +            assert((*reftable)[reftable_index] == 0);
>> +        }
>> +    } else {
>> +        /* The first pass with alloc_refblock() made the reftable large enough
>> +         */
>> +        assert(reftable_index < *reftable_size);
> Okay, I see why you couldn't hoist this assert outside of the if - the
> caller may call this with refblock_empty for any refblocks at the tail
> of the final partial reftable cluster.
>
>> +        offset = (*reftable)[reftable_index];
>> +        assert(offset != 0);
>> +
>> +        ret = qcow2_pre_write_overlap_check(bs, 0, offset, s->cluster_size);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Overlap check failed");
>> +            return ret;
>> +        }
>> +
>> +        ret = bdrv_pwrite(bs->file, offset, refblock, s->cluster_size);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Failed to write refblock");
>> +            return ret;
> If we fail here, do we leak all clusters written so far?  At least the
> image is still consistent.  After reading further, I think I answered
> myself at point [5].
>
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/**
>> + * This function walks over the existing reftable and every referenced refblock;
>> + * if @new_set_refcount is non-NULL, it is called for every refcount entry to
>> + * create an equal new entry in the passed @new_refblock. Once that
>> + * @new_refblock is completely filled, @operation will be called.
>> + *
>> + * @operation is expected to combine the @new_refblock and its entry in the new
>> + * reftable (which is described by the parameters starting with "reftable").
>> + * @refblock_empty is set if all entries in the refblock are zero.
>> + *
>> + * @status_cb and @cb_opaque are used for the amend operation's status callback.
>> + * @index is the index of the walk_over_reftable() calls and @total is the total
>> + * number of walk_over_reftable() calls per amend operation. Both are used for
>> + * calculating the parameters for the status callback.
> Nice writeup; I was referring to it frequently during review.
>
>> + */
>> +static int walk_over_reftable(BlockDriverState *bs, uint64_t **new_reftable,
>> +                              uint64_t *new_reftable_index,
>> +                              uint64_t *new_reftable_size,
>> +                              void *new_refblock, int new_refblock_size,
>> +                              int new_refcount_bits,
>> +                              int (*operation)(BlockDriverState *bs,
>> +                                               uint64_t **reftable,
>> +                                               uint64_t reftable_index,
>> +                                               uint64_t *reftable_size,
>> +                                               void *refblock,
>> +                                               bool refblock_empty,
>> +                                               Error **errp),
> Worth a typedef?  Maybe not; I managed.
>
>> +                              Qcow2SetRefcountFunc *new_set_refcount,
>> +                              BlockDriverAmendStatusCB *status_cb,
>> +                              void *cb_opaque, int index, int total,
>> +                              Error **errp)
> After several reads of the patch, I see that this walk function gets
> called twice - first with a NULL new_set_refcount (merely to figure out
> how big the new reftable should be, as well as allocating all necessary
> non-zero refcount blocks, but not committing the top-level reftable to
> any particular file location); the second walk then commits the new
> refcounts to disk (updating each non-zero entry in all the new refcount
> blocks to match their original counterparts, but no allocation
> required).  Pretty slick to ensure that we are sure that the new table
> is feasible before actually swapping over to it, while still allowing a
> fairly clean rollback on early failure.
>
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    uint64_t reftable_index;
>> +    bool new_refblock_empty = true;
>> +    int refblock_index;
>> +    int new_refblock_index = 0;
>> +    int ret;
>> +
>> +    for (reftable_index = 0; reftable_index < s->refcount_table_size;
>> +         reftable_index++)
> Outer loop - for each cluster of the top-level reference table, visit
> each child table and update the status callback.  On the first walk,
> s->refcount_table_size might be increasing during calls to operation().
>
>> +    {
>> +        uint64_t refblock_offset = s->refcount_table[reftable_index]
>> +                                 & REFT_OFFSET_MASK;
>> +
>> +        status_cb(bs, (uint64_t)index * s->refcount_table_size + reftable_index,
>> +                  (uint64_t)total * s->refcount_table_size, cb_opaque);
>> +
> This never quite reaches 100%, and the caller also never reaches 100%.
> I think you want one more call to status_cb() at the end of the loop (at
> either site [1] or [2]) that passes an equal index and total to make it
> obvious that this (portion of the) long-running conversion is complete.
>   Since s->refcount_table_size may grow during the loop, the callback
> does not necessarily have a constant total size; good thing we already
> documented that progress bars need not have a constant total.
>
>> +        if (refblock_offset) {
>> +            void *refblock;
>> +
>> +            if (offset_into_cluster(s, refblock_offset)) {
>> +                qcow2_signal_corruption(bs, true, -1, -1, "Refblock offset %#"
>> +                                        PRIx64 " unaligned (reftable index: %#"
>> +                                        PRIx64 ")", refblock_offset,
>> +                                        reftable_index);
>> +                error_setg(errp,
>> +                           "Image is corrupt (unaligned refblock offset)");
>> +                return -EIO;
>> +            }
>> +
>> +            ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offset,
>> +                                  &refblock);
>> +            if (ret < 0) {
>> +                error_setg_errno(errp, -ret, "Failed to retrieve refblock");
>> +                return ret;
>> +            }
>> +
>> +            for (refblock_index = 0; refblock_index < s->refcount_block_size;
>> +                 refblock_index++)
>> +            {
> If a child table (refcount block) exists, visit each refcount entry
> within the table (at least one refcount in that visit should be
> non-empty, otherwise we could garbage-collect the refblock and put a 0
> entry in the outer loop).
>
>> +                uint64_t refcount;
>> +
>> +                if (new_refblock_index >= new_refblock_size) {
>> +                    /* new_refblock is now complete */
>> +                    ret = operation(bs, new_reftable, *new_reftable_index,
>> +                                    new_reftable_size, new_refblock,
>> +                                    new_refblock_empty, errp);
> The new refcount table will either be filled faster than the original
> (when going from small to large refcount - calling operation() multiple
> times per inner loop) or will be filled slower than the original (when
> going from large to small; operation() will only be called after several
> outer loops).
>
>> +                    if (ret < 0) {
>> +                        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>> +                        return ret;
>> +                    }
>> +
>> +                    (*new_reftable_index)++;
>> +                    new_refblock_index = 0;
>> +                    new_refblock_empty = true;
>> +                }
>> +
>> +                refcount = s->get_refcount(refblock, refblock_index);
>> +                if (new_refcount_bits < 64 && refcount >> new_refcount_bits) {
> Technically, this get_refcount() call is dead code on the second walk,
> since the first walk already validated things, so you could push all of
> this code...
>
>> +                    uint64_t offset;
>> +
>> +                    qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>> +
>> +                    offset = ((reftable_index << s->refcount_block_bits)
>> +                              + refblock_index) << s->cluster_bits;
>> +
>> +                    error_setg(errp, "Cannot decrease refcount entry width to "
>> +                               "%i bits: Cluster at offset %#" PRIx64 " has a "
>> +                               "refcount of %" PRIu64, new_refcount_bits,
>> +                               offset, refcount);
>> +                    return -EINVAL;
>> +                }
>> +
>> +                if (new_set_refcount) {
>> +                    new_set_refcount(new_refblock, new_refblock_index++, refcount);
>> +                } else {
> ...here, in the branch only run on the first walk.

Well, yes, but I wanted to keep this function as agnostic to what the 
caller wants to do with it as possible. I'd rather decide depending on 
whether index == 0 because that's a better way of discerning the first walk.

>> +                    new_refblock_index++;
>> +                }
>> +                new_refblock_empty = new_refblock_empty && refcount == 0;
> Worth condensing to 'new_refblock_empty &= !refcount'?  Maybe not.

I personally would find that harder to read.

>> +            }
>> +
>> +            ret = qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
>> +            if (ret < 0) {
>> +                error_setg_errno(errp, -ret, "Failed to put refblock back into "
>> +                                 "the cache");
>> +                return ret;
>> +            }
>> +        } else {
>> +            /* No refblock means every refcount is 0 */
>> +            for (refblock_index = 0; refblock_index < s->refcount_block_size;
>> +                 refblock_index++)
> Again, visiting each (implied) entry for the given refcount block of the
> outer loop.  When enlarging the width, each of the new blocks will also
> be all zero; but when shrinking the width, even though all entries on
> this pass are zero, we may be combining this pass with another outer
> loop with non-zero data for a non-zero block in the resulting new table.
>
>> +            {
>> +                if (new_refblock_index >= new_refblock_size) {
>> +                    /* new_refblock is now complete */
>> +                    ret = operation(bs, new_reftable, *new_reftable_index,
>> +                                    new_reftable_size, new_refblock,
>> +                                    new_refblock_empty, errp);
>> +                    if (ret < 0) {
>> +                        return ret;
>> +                    }
>> +
>> +                    (*new_reftable_index)++;
>> +                    new_refblock_index = 0;
>> +                    new_refblock_empty = true;
>> +                }
>> +
>> +                if (new_set_refcount) {
>> +                    new_set_refcount(new_refblock, new_refblock_index++, 0);
> Would it be worth guaranteeing that every new refblock is 0-initialized
> when allocated, so that you can skip setting a refcount to 0?  This
> question depends on the answer about block allocation asked at [4] above.

This function sets a value in the buffer new_refblock, not in the 
cluster on disk. Therefore, in order to be able to omit this call, we'd 
have to call a memset() with 0 on new_refblock after each call to 
operation(). I don't think it's worth it. This is more explicit and 
won't cost much performance.

>> +                } else {
>> +                    new_refblock_index++;
>> +                }
>> +            }
>> +        }
>> +    }
>> +
>> +    if (new_refblock_index > 0) {
>> +        /* Complete the potentially existing partially filled final refblock */
>> +        if (new_set_refcount) {
>> +            for (; new_refblock_index < new_refblock_size;
>> +                 new_refblock_index++)
>> +            {
>> +                new_set_refcount(new_refblock, new_refblock_index, 0);
> Again, if you 0-initialize refblocks when allocated, you could skip this
> (another instance of [4] above).
>
>> +            }
>> +        }
>> +
>> +        ret = operation(bs, new_reftable, *new_reftable_index,
>> +                        new_reftable_size, new_refblock, new_refblock_empty,
>> +                        errp);
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +
>> +        (*new_reftable_index)++;
>> +    }
> site [1] mentioned above, as a good place to make a final status
> callback at 100%.  But if you do it here, it means that we call the
> status callback twice with the same values (the 100% value of the first
> loop is the 0% value of the second loop) - not the end of the world, but
> may impact any testsuite that tracks progress reports.

Well, why not.

>> +
>> +    return 0;
>> +}
>> +
>> +int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
>> +                                BlockDriverAmendStatusCB *status_cb,
>> +                                void *cb_opaque, Error **errp)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    Qcow2GetRefcountFunc *new_get_refcount;
>> +    Qcow2SetRefcountFunc *new_set_refcount;
>> +    void *new_refblock = qemu_blockalign(bs->file, s->cluster_size);
>> +    uint64_t *new_reftable = NULL, new_reftable_size = 0;
>> +    uint64_t *old_reftable, old_reftable_size, old_reftable_offset;
>> +    uint64_t new_reftable_index = 0;
>> +    uint64_t i;
>> +    int64_t new_reftable_offset;
>> +    int new_refblock_size, new_refcount_bits = 1 << refcount_order;
>> +    int old_refcount_order;
>> +    int ret;
>> +
>> +    assert(s->qcow_version >= 3);
>> +    assert(refcount_order >= 0 && refcount_order <= 6);
>> +
>> +    /* see qcow2_open() */
>> +    new_refblock_size = 1 << (s->cluster_bits - (refcount_order - 3));
> Safe (cluster_bits is always at least 9, and at most 21 in our current
> implementation, so we are shifting anywhere from 6 to 24 positions).
>
>> +
>> +    get_refcount_functions(refcount_order,
>> +                           &new_get_refcount, &new_set_refcount);
>> +
>> +
>> +    /* First, allocate the structures so they are present in the refcount
>> +     * structures */
>> +    ret = walk_over_reftable(bs, &new_reftable, &new_reftable_index,
>> +                             &new_reftable_size, NULL, new_refblock_size,
>> +                             new_refcount_bits, &alloc_refblock, NULL,
>> +                             status_cb, cb_opaque, 0, 2, errp);
>> +    if (ret < 0) {
>> +        goto done;
>> +    }
>> +
>> +    /* The new_reftable_size is now valid and will not be changed anymore,
>> +     * so we can now allocate the reftable */
>> +    new_reftable_offset = qcow2_alloc_clusters(bs, new_reftable_size *
>> +                                                   sizeof(uint64_t));
> And here is your bug, that I hinted at with the mention of [3] above.
> This allocation can potentially cause an overflow of the existing
> reftable to occupy one more cluster.

An additional bug is that the new reftable may be referenced by an 
existing refblock which was completely empty, though (or at least 
referenced by part of an existing refblock which was to be turned into a 
new refblock which was then completely empty, and thus omitted from 
allocation).

> Remember my thought experiment
> above, how an 8 megabyte image rolls from 1 to 2 clusters during the
> course of allocating refblocks for the new table?  What if the original
> image wasn't completely full, but things are perfectly sized with enough
> free clusters, then all of the refblock allocations done during the
> first walk will still fit, and it is only this final allocation of the
> new reftable that will cause the rollover, at which point we've failed
> to account for the new refblock size.  That is, I think I could craft an
> image that would trigger either an assertion failure or an out-of-bounds
> array access during the second walk.

You're completely right, this will be a pain to fix, though... The 
simplest way would probably to check whether the new_reftable_size 
should be increased due to this operation and if it did, rerun 
walk_over_reftable() with the alloc_refblock() function only allocating 
refblocks if the new reftable does not already point to a refblock for a 
certain index. This would be repeated until the new_reftable_size is 
constant. And the really simplest incarnation of this would be to have a 
flag whether any allocations were done and repeat until everything is fine.

Another way would be to somehow integrate the allocation of the new 
reftable into walk_over_reftable() and then to only mind the additional 
reftable entries.

But probably the first way is the correct one because due to 
reallocation of the old reftable, some intermediate refblocks which were 
empty before are now filled.

I'll have to craft some test images myself, not least to be able to 
include them in the iotest.

>> +    if (new_reftable_offset < 0) {
>> +        error_setg_errno(errp, -new_reftable_offset,
>> +                         "Failed to allocate the new reftable");
>> +        ret = new_reftable_offset;
>> +        goto done;
> If we fail here, do we leak allocations of the refblocks?  I guess not;
> based on another forward reference to point [5].
>
>> +    }
>> +
>> +    new_reftable_index = 0;
>> +
>> +    /* Second, write the new refblocks */
>> +    ret = walk_over_reftable(bs, &new_reftable, &new_reftable_index,
>> +                             &new_reftable_size, new_refblock,
>> +                             new_refblock_size, new_refcount_bits,
>> +                             &flush_refblock, new_set_refcount,
>> +                             status_cb, cb_opaque, 1, 2, errp);
>> +    if (ret < 0) {
>> +        goto done;
>> +    }
> If we fail here, it looks like we DO leak the clusters allocated for the
> new reftable (again, point [5]).
>
>> +
>> +
>> +    /* Write the new reftable */
>> +    ret = qcow2_pre_write_overlap_check(bs, 0, new_reftable_offset,
>> +                                        new_reftable_size * sizeof(uint64_t));
>> +    if (ret < 0) {
>> +        error_setg_errno(errp, -ret, "Overlap check failed");
>> +        goto done;
>> +    }
>> +
>> +    for (i = 0; i < new_reftable_size; i++) {
>> +        cpu_to_be64s(&new_reftable[i]);
>> +    }
>> +
>> +    ret = bdrv_pwrite(bs->file, new_reftable_offset, new_reftable,
>> +                      new_reftable_size * sizeof(uint64_t));
>> +
>> +    for (i = 0; i < new_reftable_size; i++) {
>> +        be64_to_cpus(&new_reftable[i]);
>> +    }
>> +
>> +    if (ret < 0) {
>> +        error_setg_errno(errp, -ret, "Failed to write the new reftable");
>> +        goto done;
>> +    }
> Looks like you correctly maintain the in-memory copy in preferred cpu
> byte order, while writing to disk in big-endian order.
>
>> +
>> +
>> +    /* Empty the refcount cache */
>> +    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
>> +    if (ret < 0) {
>> +        error_setg_errno(errp, -ret, "Failed to flush the refblock cache");
>> +        goto done;
>> +    }
>> +
>> +    /* Update the image header to point to the new reftable; this only updates
>> +     * the fields which are relevant to qcow2_update_header(); other fields
>> +     * such as s->refcount_table or s->refcount_bits stay stale for now
>> +     * (because we have to restore everything if qcow2_update_header() fails) */
>> +    old_refcount_order  = s->refcount_order;
>> +    old_reftable_size   = s->refcount_table_size;
>> +    old_reftable_offset = s->refcount_table_offset;
>> +
>> +    s->refcount_order        = refcount_order;
>> +    s->refcount_table_size   = new_reftable_size;
>> +    s->refcount_table_offset = new_reftable_offset;
>> +
>> +    ret = qcow2_update_header(bs);
>> +    if (ret < 0) {
>> +        s->refcount_order        = old_refcount_order;
>> +        s->refcount_table_size   = old_reftable_size;
>> +        s->refcount_table_offset = old_reftable_offset;
>> +        error_setg_errno(errp, -ret, "Failed to update the qcow2 header");
>> +        goto done;
>> +    }
> Failures up to here still have issues leaking the new reftable
> allocation (point [5]).
>
>> +
>> +    /* Now update the rest of the in-memory information */
>> +    old_reftable = s->refcount_table;
>> +    s->refcount_table = new_reftable;
>> +
>> +    /* For cleaning up all old refblocks */
>> +    new_reftable      = old_reftable;
>> +    new_reftable_size = old_reftable_size;
>> +
>> +    s->refcount_bits = 1 << refcount_order;
>> +    if (refcount_order < 6) {
>> +        s->refcount_max = (UINT64_C(1) << s->refcount_bits) - 1;
>> +    } else {
>> +        s->refcount_max = INT64_MAX;
>> +    }
> Is it worth factoring this computation into a common helper, since it
> appeared in an earlier patch as well?

Well, it does appear in qcow2_open(), as do the next two lines. The only 
reason to factor them out would be so that neither place is forgotten if 
one of them is changed; but the only reason I could imagine for this 
would be to replace the INT64_MAX by UINT64_MAX at some point in the 
future, but I guess that point is still far away (because there's no 
reasonable way someone would be needing the 64th bit, as you agreed on) 
, so it should be fine that way.

>> +
>> +    s->refcount_block_bits = s->cluster_bits - (refcount_order - 3);
>> +    s->refcount_block_size = 1 << s->refcount_block_bits;
>> +
>> +    s->get_refcount = new_get_refcount;
>> +    s->set_refcount = new_set_refcount;
>> +
>> +    /* And free the old reftable (the old refblocks are freed below the "done"
>> +     * label) */
>> +    qcow2_free_clusters(bs, old_reftable_offset,
>> +                        old_reftable_size * sizeof(uint64_t),
>> +                        QCOW2_DISCARD_NEVER);
> site [2] mentioned above, as a possible point where you might want to
> ensure the callback is called with equal progress and total values to
> ensure the caller knows the job is done.  Except this site doesn't have
> quite as much information as site [1] about what total size all the
> other status callbacks were using.
>
>> +
>> +done:
>> +    if (new_reftable) {
>> +        /* On success, new_reftable actually points to the old reftable (and
>> +         * new_reftable_size is the old reftable's size); but that is just
>> +         * fine */
>> +        for (i = 0; i < new_reftable_size; i++) {
>> +            uint64_t offset = new_reftable[i] & REFT_OFFSET_MASK;
>> +            if (offset) {
>> +                qcow2_free_clusters(bs, offset, s->cluster_size,
>> +                                    QCOW2_DISCARD_NEVER);
>> +            }
>> +        }
>> +        g_free(new_reftable);
> So here is point [5] - if we failed early, this tries to clean up all
> allocated refblocks associated with the new table.  It does NOT clean up
> any refblocks allocated due to resizing the old table to be slightly
> larger, but that should be fine (not a leak, so much as an image that is
> now a couple clusters larger than the minimum required size).  However,
> while you clean up the clusters associated with refblocks (layer 2), the
> cleanup of old clusters associated with the reftable (layer 1) happened
> before the done: label on success, but that means that on failure, you
> are NOT cleaning up the clusters associated with the new reftable.

Oops, right, will fix.

>> +    }
>> +
>> +    qemu_vfree(new_refblock);
>> +    return ret;
>> +}
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index fe12c54..5b96519 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -526,6 +526,10 @@ int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
>>   int qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t offset,
>>                                     int64_t size);
>>   
>> +int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
>> +                                BlockDriverAmendStatusCB *status_cb,
>> +                                void *cb_opaque, Error **errp);
>> +
>>   /* qcow2-cluster.c functions */
>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                           bool exact_size);
>>
> Interesting patch.  Hope my review helps you prepare a better v2.

If everything else fails, I'll just split the amend stuff from this 
series. But I'll work it out somehow. And your review will definitely 
help, thanks a lot!

Max

next prev parent reply	other threads:[~2014-11-12  9:56 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-10 13:45 [Qemu-devel] [PATCH 00/21] qcow2: Support refcount orders != 4 Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 01/21] qcow2: Add two new fields to BDRVQcowState Max Reitz
2014-11-10 19:00   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 02/21] qcow2: Add refcount_width to format-specific info Max Reitz
2014-11-10 19:06   ` Eric Blake
2014-11-11  8:11     ` Max Reitz
2014-11-11 15:49       ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 03/21] qcow2: Use 64 bits for refcount values Max Reitz
2014-11-10 20:59   ` Eric Blake
2014-11-11  8:12     ` Max Reitz
2014-11-11  9:22     ` Kevin Wolf
2014-11-11  9:25       ` Max Reitz
2014-11-11  9:26         ` Max Reitz
2014-11-11  9:36         ` Kevin Wolf
2014-11-10 13:45 ` [Qemu-devel] [PATCH 04/21] qcow2: Respect error in qcow2_alloc_bytes() Max Reitz
2014-11-10 21:05   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 05/21] qcow2: Refcount overflow and qcow2_alloc_bytes() Max Reitz
2014-11-10 21:12   ` Eric Blake
2014-11-11  8:22     ` Max Reitz
2014-11-11 16:13       ` Eric Blake
2014-11-11 16:18         ` Max Reitz
2014-11-11 19:49           ` Eric Blake
2014-11-12  8:52             ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 06/21] qcow2: Helper function for refcount modification Max Reitz
2014-11-10 22:30   ` Eric Blake
2014-11-11  8:35     ` Max Reitz
2014-11-11  9:43       ` Max Reitz
2014-11-11 10:56       ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 07/21] qcow2: Helper for refcount array size calculation Max Reitz
2014-11-10 22:49   ` Eric Blake
2014-11-11  8:37     ` Max Reitz
2014-11-11 10:08       ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 08/21] qcow2: More helpers for refcount modification Max Reitz
2014-11-11  0:29   ` Eric Blake
2014-11-11  8:42     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 09/21] qcow2: Open images with refcount order != 4 Max Reitz
2014-11-10 17:03   ` Eric Blake
2014-11-10 17:06     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 10/21] qcow2: refcount_order parameter for qcow2_create2 Max Reitz
2014-11-11  5:40   ` Eric Blake
2014-11-11  8:48     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 11/21] iotests: Prepare for refcount_width option Max Reitz
2014-11-11 17:57   ` Eric Blake
2014-11-12  8:41     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 12/21] qcow2: Allow creation with refcount order != 4 Max Reitz
2014-11-11 18:05   ` Eric Blake
2014-11-12  8:47     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 13/21] block: Add opaque value to the amend CB Max Reitz
2014-11-11 18:08   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 14/21] qcow2: Use error_report() in qcow2_amend_options() Max Reitz
2014-11-11 18:11   ` Eric Blake
2014-11-12  8:47     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 15/21] qcow2: Use abort() instead of assert(false) Max Reitz
2014-11-11 18:12   ` Eric Blake
2014-11-12  8:48     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 16/21] qcow2: Split upgrade/downgrade paths for amend Max Reitz
2014-11-11 18:14   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 17/21] qcow2: Use intermediate helper CB " Max Reitz
2014-11-11 21:05   ` Eric Blake
2014-11-12  9:10     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment Max Reitz
2014-11-12  4:15   ` Eric Blake
2014-11-12  9:55     ` Max Reitz [this message]
2014-11-12 13:50       ` Eric Blake
2014-11-12 14:19   ` Eric Blake
2014-11-12 14:21     ` Max Reitz
2014-11-12 17:45       ` Max Reitz
2014-11-12 20:21         ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 19/21] qcow2: Invoke refcount order amendment function Max Reitz
2014-11-12  4:36   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 20/21] qcow2: Point to amend function in check Max Reitz
2014-11-12  4:38   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 21/21] iotests: Add test for different refcount widths Max Reitz
2014-11-11 19:53   ` Eric Blake
2014-11-12  8:58     ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54632E96.4070806@redhat.com \
    --to=mreitz@redhat.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).