Re: [Qemu-devel] [PATCH v5 08/11] qcow2: Rebuild refcount structure during check

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Max Reitz <mreitz@redhat.com>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: "Kevin Wolf" <kwolf@redhat.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Benoît Canet" <benoit.canet@nodalink.com>
Subject: Re: [Qemu-devel] [PATCH v5 08/11] qcow2: Rebuild refcount structure during check
Date: Sat, 11 Oct 2014 12:17:20 +0200	[thread overview]
Message-ID: <543903B0.3080501@redhat.com> (raw)
In-Reply-To: <5435C414.5030203@redhat.com>

Am 09.10.2014 um 01:09 schrieb Eric Blake:
> On 08/29/2014 03:41 PM, Max Reitz wrote:
>> The previous commit introduced the "rebuild" variable to qcow2's
>> implementation of the image consistency check. Now make use of this by
>> adding a function which creates a completely new refcount structure
>> based solely on the in-memory information gathered before.
>>
>> The old refcount structure will be leaked, however.
> Might be worth mentioning in the commit message that a later commit will
> deal with the leak.
>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-refcount.c | 286 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 283 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 6300cec..318c152 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -1603,6 +1603,266 @@ static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>   }
>>   
>>   /*
>> + * Allocates a cluster using an in-memory refcount table (IMRT) in contrast to
>> + * the on-disk refcount structures.
>> + *
>> + * *first_free_cluster does not necessarily point to the first free cluster, but
>> + * may point to one cluster as close as possible before it. The offset returned
>> + * will never be before that cluster.
> Took me a couple reads of the comment and code to understand that.  If
> I'm correct, this alternative wording may be better:
>
> On input, *first_free_cluster tells where to start looking, and need not
> actually be a free cluster; the returned offset will not be before that
> cluster.  On output, *first_free_cluster points to the actual first free
> cluster found.
>
> Or, depending on the semantics you intended [1]:
>
> On input, *first_free_cluster tells where to start looking, and need not
> actually be a free cluster; the returned offset will not be before that
> cluster.  On output, *first_free_cluster points to the first gap found,
> even if that gap was too small to be used as the returned offset.

Yes, *first_free_cluster has nothing to do with the allocated cluster 
range. The offset of that allocated range will be returned by the 
function. *first_free_cluster should merely always point somewhere 
before or at the first gap (of any size) just so alloc_clusters_imrt() 
does not have to start searching at the beginning of the IMRT next time 
it is called. However, this also makes it useful to limit the search of 
the cluster range to the end of the IMRT (or even after the end) by the 
caller setting it to some arbitrary value, so it has a dual-use.

>> + *
>> + * Note that *first_free_cluster is a cluster index whereas the return value is
>> + * an offset.
>> + */
>> +static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>> +                                   int cluster_count,
>> +                                   uint16_t **refcount_table,
>> +                                   int64_t *nb_clusters,
>> +                                   int64_t *first_free_cluster)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t cluster = *first_free_cluster, i;
>> +    bool first_gap = true;
>> +    int contiguous_free_clusters;
>> +
>> +    /* Starting at *first_free_cluster, find a range of at least cluster_count
>> +     * continuously free clusters */
>> +    for (contiguous_free_clusters = 0;
>> +         cluster < *nb_clusters && contiguous_free_clusters < cluster_count;
>> +         cluster++)
>> +    {
>> +        if (!(*refcount_table)[cluster]) {
>> +            contiguous_free_clusters++;
>> +            if (first_gap) {
>> +                /* If this is the first free cluster found, update
>> +                 * *first_free_cluster accordingly */
>> +                *first_free_cluster = cluster;
>> +                first_gap = false;
>> +            }
>> +        } else if (contiguous_free_clusters) {
>> +            contiguous_free_clusters = 0;
>> +        }
> [1] Should you be resetting first_gap in the 'else'?  If you don't, then
> *first_free_cluster is NOT the start of the cluster just allocated, but
> the first free cluster encountered on the way to the eventual
> allocation.  I guess it depends on how the callers are using the
> information; since the function is static, I guess I'll find out later
> in my review.

Yes, *first_free_cluster is not that allocated cluster. I don't keep the 
offset in a dedicated variable, because it'll always be cluster - 
contiguous_free_clusters (see the next comment).

>> +    }
>> +
>> +    /* If contiguous_free_clusters is greater than zero, it contains the number
>> +     * of continuously free clusters until the current cluster; the first free
>> +     * cluster in the current "gap" is therefore
>> +     * cluster - contiguous_free_clusters */
>> +
>> +    /* If no such range could be found, grow the in-memory refcount table
>> +     * accordingly to append free clusters at the end of the image */
>> +    if (contiguous_free_clusters < cluster_count) {
>> +        int64_t old_nb_clusters = *nb_clusters;
>> +
>> +        /* There already is a gap of contiguous_free_clusters; we need
> s/gap/tail/, since we are at the end of the table?

Well, tail doesn't imply that it's empty. I could change it to 
"contiguous_free_clusters clusters are already empty at the image end".

>> +         * cluster_count clusters; therefore, we have to allocate
>> +         * cluster_count - contiguous_free_clusters new clusters at the end of
>> +         * the image (which is the current value of cluster; note that cluster
>> +         * may exceed old_nb_clusters if *first_free_cluster pointed beyond the
>> +         * image end) */
>> +        *nb_clusters = cluster + cluster_count - contiguous_free_clusters;
>> +        *refcount_table = g_try_realloc(*refcount_table,
>> +                                        *nb_clusters * sizeof(uint16_t));
>> +        if (!*refcount_table) {
>> +            return -ENOMEM;
>> +        }
>> +
>> +        memset(*refcount_table + old_nb_clusters, 0,
>> +               (*nb_clusters - old_nb_clusters) * sizeof(uint16_t));
> Is this calculation unnecessarily hard-coded to refcount_order==4?

Seems like it. Shame on me. ;-)

>> +    }
>> +
>> +    /* Go back to the first free cluster */
>> +    cluster -= contiguous_free_clusters;
>> +    for (i = 0; i < cluster_count; i++) {
>> +        (*refcount_table)[cluster + i] = 1;
>> +    }
>> +
>> +    return cluster << s->cluster_bits;
>> +}
>> +
>> +/*
>> + * Creates a new refcount structure based solely on the in-memory information
>> + * given through *refcount_table. All necessary allocations will be reflected
>> + * in that array.
>> + *
>> + * On success, the old refcount structure is leaked (it will be covered by the
>> + * new refcount structure).
>> + */
>> +static int rebuild_refcount_structure(BlockDriverState *bs,
>> +                                      BdrvCheckResult *res,
>> +                                      uint16_t **refcount_table,
>> +                                      int64_t *nb_clusters)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t first_free_cluster = 0, rt_ofs = -1, cluster = 0;
>> +    int64_t rb_ofs, rb_start, rb_index;
>> +    uint32_t reftable_size = 0;
>> +    uint64_t *reftable = NULL;
>> +    uint16_t *on_disk_rb;
>> +    int i, ret = 0;
> ret is 0...
>
>> +    struct {
>> +        uint64_t rt_offset;
>> +        uint32_t rt_clusters;
>> +    } QEMU_PACKED rt_offset_and_clusters;
>> +
>> +    qcow2_cache_empty(bs, s->refcount_block_cache);
>> +
>> +write_refblocks:
>> +    for (; cluster < *nb_clusters; cluster++) {
>> +        if (!(*refcount_table)[cluster]) {
>> +            continue;
>> +        }
>> +
>> +        rb_index = cluster >> s->refcount_block_bits;
>> +        rb_start = rb_index << s->refcount_block_bits;
>> +
>> +        /* Don't allocate a cluster in a refblock already written to disk */
>> +        if (first_free_cluster < rb_start) {
>> +            first_free_cluster = rb_start;
>> +        }
>> +        rb_ofs = alloc_clusters_imrt(bs, 1, refcount_table, nb_clusters,
>> +                                     &first_free_cluster);
> [1] looking back at my earlier question, you are starting each iteration
> no earlier than the current rb_start.  But if you end up jumping back to
> write_refblocks, are you guaranteed that rb_start is safely far enough
> into the file, even if first_free_cluster is pointing to a gap that was
> too small for an allocation?

"cluster" is never decreased, it only grows. Therefore, rb_start will 
also grow or at least stay the same. The if condition before the 
alloc_clusters_imrt() call should keep first_free_clusters far enough 
into the file, shouldn't it? Other than that, what do you mean by 
"safely"? All that's important is that we don't allocate a cluster 
before the current refblock ("Don't allocate a cluster in a refblock 
already written to disk"). Apart from that, all allocated offsets are 
fine; and alloc_clusters_imrt() will always return a correctly allocated 
range, at or after first_free_clusters, so there shouldn't be any problem.

>> +        if (rb_ofs < 0) {
>> +            fprintf(stderr, "ERROR allocating refblock: %s\n", strerror(-ret));
> ...but if we hit this error on the first time through the for loop,
> strerror(0) is NOT what you meant to print.  Did you mean
> strerror(-rb_ofs) here?

Yes, it should be -rb_ofs.

>> +            res->check_errors++;
>> +            ret = rb_ofs;
> Narrowing from int64_t to int; but I guess we know that if rb_ofs < 0,
> it is only -1, and not something weird like -0x100000000.  Is the goal
> that ret is -1/0, or are you trying to encode negative errno values in
> the return?

alloc_clusters_imrt() returns -errno (0/-ENOMEM only, actually), so this 
is -errno as well.

>> +            goto fail;
>> +        }
>> +
>> +        if (reftable_size <= rb_index) {
>> +            uint32_t old_rt_size = reftable_size;
>> +            reftable_size = ROUND_UP((rb_index + 1) * sizeof(uint64_t),
>> +                                     s->cluster_size) / sizeof(uint64_t);
>> +            reftable = g_try_realloc(reftable,
>> +                                     reftable_size * sizeof(uint64_t));
>> +            if (!reftable) {
>> +                res->check_errors++;
>> +                ret = -ENOMEM;
>> +                goto fail;
>> +            }
>> +
>> +            memset(reftable + old_rt_size, 0,
>> +                   (reftable_size - old_rt_size) * sizeof(uint64_t));
>> +
>> +            /* The offset we have for the reftable is now no longer valid;
>> +             * this will leak that range, but we can easily fix that by running
>> +             * a leak-fixing check after this rebuild operation */
>> +            rt_ofs = -1;
>> +        }
>> +        reftable[rb_index] = rb_ofs;
>> +
>> +        /* If this is apparently the last refblock (for now), try to squeeze the
>> +         * reftable in */
>> +        if (rb_index == (*nb_clusters - 1) >> s->refcount_block_bits &&
>> +            rt_ofs < 0)
>> +        {
>> +            rt_ofs = alloc_clusters_imrt(bs, size_to_clusters(s, reftable_size *
>> +                                                              sizeof(uint64_t)),
>> +                                         refcount_table, nb_clusters,
>> +                                         &first_free_cluster);
>> +            if (rt_ofs < 0) {
>> +                fprintf(stderr, "ERROR allocating reftable: %s\n",
>> +                        strerror(-ret));
> Again, -ret looks wrong here.

Yes, should be -rt_ofs.

>> +                res->check_errors++;
>> +                ret = rt_ofs;
>> +                goto fail;
>> +            }
>> +        }
>> +
>> +        ret = qcow2_pre_write_overlap_check(bs, 0, rb_ofs, s->cluster_size);
>> +        if (ret < 0) {
>> +            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> +            goto fail;
>> +        }
>> +
>> +        on_disk_rb = g_malloc0(s->cluster_size);
> Why g_try_malloc earlier, but abort()ing g_malloc0 here?

I use g_try_realloc/g_try_malloc for the reftable and g_malloc for the 
refblocks. The reftable can be arbitrarily large; a refblock is pretty 
limited in size (it's a cluster). If g_malloc fails because there's no 
room for a single cluster anymore, two things will happen: (1) The whole 
qcow2 driver will explode (because it still uses g_malloc for most if 
not all cluster allocations, afaik); (2) Linux will kill qemu because of 
OOM, we won't even be able to catch that by using g_try_malloc. As far 
as I've seen, using the try variants is only really useful if you may 
have some absolutely absurd size value.

>> +        for (i = 0; i < s->cluster_size / sizeof(uint16_t) &&
>> +                    rb_start + i < *nb_clusters; i++)
>> +        {
>> +            on_disk_rb[i] = cpu_to_be16((*refcount_table)[rb_start + i]);
>> +        }
>> +
>> +        ret = bdrv_write(bs->file, rb_ofs / BDRV_SECTOR_SIZE,
>> +                         (void *)on_disk_rb, s->cluster_sectors);
>> +        g_free(on_disk_rb);
>> +        if (ret < 0) {
>> +            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> +            goto fail;
>> +        }
>> +
>> +        /* Go to the end of this refblock */
>> +        cluster = rb_start + s->cluster_size / sizeof(uint16_t) - 1;
>> +    }
>> +
>> +    if (rt_ofs < 0) {
>> +        int64_t post_rb_start = ROUND_UP(*nb_clusters,
>> +                                         s->cluster_size / sizeof(uint16_t));
>> +
>> +        /* Not pretty but simple */
>> +        if (first_free_cluster < post_rb_start) {
>> +            first_free_cluster = post_rb_start;
>> +        }
>> +        rt_ofs = alloc_clusters_imrt(bs, size_to_clusters(s, reftable_size *
>> +                                                          sizeof(uint64_t)),
>> +                                     refcount_table, nb_clusters,
>> +                                     &first_free_cluster);
>> +        if (rt_ofs < 0) {
>> +            fprintf(stderr, "ERROR allocating reftable: %s\n", strerror(-ret));
> Another wrong -ret?

I guess it's just a habit to type strerror(-ret)...

next prev parent reply	other threads:[~2014-10-11 10:17 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-29 21:40 [Qemu-devel] [PATCH v5 00/11] qcow2: Fix image repairing Max Reitz
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 01/11] qcow2: Calculate refcount block entry count Max Reitz
2014-08-29 23:03   ` Eric Blake
2014-09-02 18:56     ` Max Reitz
2014-10-10 12:29   ` Benoît Canet
2014-10-11 10:27     ` Max Reitz
2014-10-20 14:25   ` Kevin Wolf
2014-10-20 14:39     ` Max Reitz
2014-10-20 14:48       ` Kevin Wolf
2014-10-21 16:26         ` Max Reitz
2014-10-22  8:27           ` Kevin Wolf
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 02/11] qcow2: Fix leaks in dirty images Max Reitz
2014-10-20 14:28   ` Kevin Wolf
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 03/11] qcow2: Split qcow2_check_refcounts() Max Reitz
2014-10-20 14:45   ` Kevin Wolf
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 04/11] qcow2: Pull check_refblocks() up Max Reitz
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 05/11] qcow2: Reuse refcount table in calculate_refcounts() Max Reitz
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 06/11] qcow2: Fix refcount blocks beyond image end Max Reitz
2014-08-29 21:40 ` [Qemu-devel] [PATCH v5 07/11] qcow2: Do not perform potentially damaging repairs Max Reitz
2014-08-29 21:41 ` [Qemu-devel] [PATCH v5 08/11] qcow2: Rebuild refcount structure during check Max Reitz
2014-10-08 23:09   ` Eric Blake
2014-10-11 10:17     ` Max Reitz [this message]
2014-10-16 15:27     ` Max Reitz
2014-10-16 15:33       ` Eric Blake
2014-10-10 12:44   ` Benoît Canet
2014-10-11 10:27     ` Max Reitz
2014-10-11 18:56   ` Benoît Canet
2014-10-12  7:32     ` Max Reitz
2014-08-29 21:41 ` [Qemu-devel] [PATCH v5 09/11] qcow2: Clean up after refcount rebuild Max Reitz
2014-08-29 21:41 ` [Qemu-devel] [PATCH v5 10/11] iotests: Fix test outputs Max Reitz
2014-09-02 19:49   ` Eric Blake
2014-08-29 21:41 ` [Qemu-devel] [PATCH v5 11/11] iotests: Add test for potentially damaging repairs Max Reitz
2014-09-02 19:52   ` Eric Blake
2014-10-08 19:25 ` [Qemu-devel] [PATCH v5 00/11] qcow2: Fix image repairing Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=543903B0.3080501@redhat.com \
    --to=mreitz@redhat.com \
    --cc=benoit.canet@nodalink.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).