Re: [Qemu-devel] [PATCH v6 08/11] qcow2: Rebuild refcount structure during check

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Max Reitz <mreitz@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, "Stefan Hajnoczi" <stefanha@redhat.com>,
	"Benoît Canet" <benoit.canet@nodalink.com>
Subject: Re: [Qemu-devel] [PATCH v6 08/11] qcow2: Rebuild refcount structure during check
Date: Tue, 21 Oct 2014 11:52:41 +0200	[thread overview]
Message-ID: <54462CE9.4020909@redhat.com> (raw)
In-Reply-To: <20141021093125.GC4409@noname.redhat.com>

On 2014-10-21 at 11:31, Kevin Wolf wrote:
> Am 20.10.2014 um 16:35 hat Max Reitz geschrieben:
>> The previous commit introduced the "rebuild" variable to qcow2's
>> implementation of the image consistency check. Now make use of this by
>> adding a function which creates a completely new refcount structure
>> based solely on the in-memory information gathered before.
>>
>> The old refcount structure will be leaked, however. This leak will be
>> dealt with in a follow-up commit.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-refcount.c | 296 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 293 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 183fc5b..75e726b 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -1642,6 +1642,276 @@ static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>   }
>>   
>>   /*
>> + * Allocates a cluster using an in-memory refcount table (IMRT) in contrast to
> s/a cluster/clusters/
>
>> + * the on-disk refcount structures.
>> + *
>> + * On input, *first_free_cluster tells where to start looking, and need not
>> + * actually be a free cluster; the returned offset will not be before that
>> + * cluster.  On output, *first_free_cluster points to the first gap found, even
>> + * if that gap was too small to be used as the returned offset.
>> + *
>> + * Note that *first_free_cluster is a cluster index whereas the return value is
>> + * an offset.
>> + */
>> +static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>> +                                   int cluster_count,
>> +                                   uint16_t **refcount_table,
>> +                                   int64_t *nb_clusters,
> Having cluster_count and nb_clusters at the same time is confusing.
>
> Maybe imrt_nb_clusters for this one?

Sounds good.

>> +                                   int64_t *first_free_cluster)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t cluster = *first_free_cluster, i;
>> +    bool first_gap = true;
>> +    int contiguous_free_clusters;
>> +
>> +    /* Starting at *first_free_cluster, find a range of at least cluster_count
>> +     * continuously free clusters */
>> +    for (contiguous_free_clusters = 0;
>> +         cluster < *nb_clusters && contiguous_free_clusters < cluster_count;
>> +         cluster++)
>> +    {
>> +        if (!(*refcount_table)[cluster]) {
>> +            contiguous_free_clusters++;
>> +            if (first_gap) {
>> +                /* If this is the first free cluster found, update
>> +                 * *first_free_cluster accordingly */
>> +                *first_free_cluster = cluster;
>> +                first_gap = false;
>> +            }
>> +        } else if (contiguous_free_clusters) {
>> +            contiguous_free_clusters = 0;
>> +        }
>> +    }
>> +
>> +    /* If contiguous_free_clusters is greater than zero, it contains the number
>> +     * of continuously free clusters until the current cluster; the first free
>> +     * cluster in the current "gap" is therefore
>> +     * cluster - contiguous_free_clusters */
>> +
>> +    /* If no such range could be found, grow the in-memory refcount table
>> +     * accordingly to append free clusters at the end of the image */
>> +    if (contiguous_free_clusters < cluster_count) {
>> +        int64_t old_nb_clusters = *nb_clusters;
>> +
>> +        /* contiguous_free_clusters clusters are already empty at the image end;
>> +         * we need cluster_count clusters; therefore, we have to allocate
>> +         * cluster_count - contiguous_free_clusters new clusters at the end of
>> +         * the image (which is the current value of cluster; note that cluster
>> +         * may exceed old_nb_clusters if *first_free_cluster pointed beyond the
>> +         * image end) */
>> +        *nb_clusters = cluster + cluster_count - contiguous_free_clusters;
>> +        *refcount_table = g_try_realloc(*refcount_table,
>> +                                        *nb_clusters * sizeof(uint16_t));
>> +        if (!*refcount_table) {
>> +            return -ENOMEM;
> This means that on failure the passed refcount table pointer may be
> overwritten by NULL. This is a surprising interface and should at least
> be mentioned in the function comment.
>
> But as the original IMRT is leaked here, too, it's probably better to
> change the interface to leave *refcount_table alone if the failure case.

Okay, I had to look it up, but realloc() does not free the given pointer 
on failure (and I assume g_try_realloc() does not either). Therefore, 
yes, we should just leave *refcount_table as it is.

>> +        }
>> +
>> +        memset(*refcount_table + old_nb_clusters, 0,
>> +               (*nb_clusters - old_nb_clusters) * sizeof(uint16_t));
>> +    }
>> +
>> +    /* Go back to the first free cluster */
>> +    cluster -= contiguous_free_clusters;
>> +    for (i = 0; i < cluster_count; i++) {
>> +        (*refcount_table)[cluster + i] = 1;
>> +    }
>> +
>> +    return cluster << s->cluster_bits;
>> +}
>> +
>> +/*
>> + * Creates a new refcount structure based solely on the in-memory information
>> + * given through *refcount_table. All necessary allocations will be reflected
>> + * in that array.
>> + *
>> + * On success, the old refcount structure is leaked (it will be covered by the
>> + * new refcount structure).
>> + */
>> +static int rebuild_refcount_structure(BlockDriverState *bs,
>> +                                      BdrvCheckResult *res,
>> +                                      uint16_t **refcount_table,
>> +                                      int64_t *nb_clusters)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int64_t first_free_cluster = 0, reftable_offset = -1, cluster = 0;
>> +    int64_t refblock_offset, refblock_start, refblock_index;
>> +    uint32_t reftable_size = 0;
>> +    uint64_t *reftable = NULL;
> refcount_table and reftable? Seriously?

Reviewing would have been too easy otherwise. *cough*

One option is s/refcount_table/imrt/. I like it more than the second 
option, but it simply goes against the current naming convention.

The other option would be s/reftable/on_disk_reftable/, following the 
convention used in the next line.

>> +    uint16_t *on_disk_refblock;
>> +    int i, ret = 0;
>> +    struct {
>> +        uint64_t reftable_offset;
>> +        uint32_t reftable_clusters;
>> +    } QEMU_PACKED reftable_offset_and_clusters;
>> +
>> +    qcow2_cache_empty(bs, s->refcount_block_cache);
>> +
>> +write_refblocks:
>> +    for (; cluster < *nb_clusters; cluster++) {
>> +        if (!(*refcount_table)[cluster]) {
>> +            continue;
>> +        }
>> +
>> +        refblock_index = cluster >> s->refcount_block_bits;
>> +        refblock_start = refblock_index << s->refcount_block_bits;
>> +
>> +        /* Don't allocate a cluster in a refblock already written to disk */
>> +        if (first_free_cluster < refblock_start) {
>> +            first_free_cluster = refblock_start;
>> +        }
>> +        refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
>> +                                              nb_clusters, &first_free_cluster);
>> +        if (refblock_offset < 0) {
>> +            fprintf(stderr, "ERROR allocating refblock: %s\n",
>> +                    strerror(-refblock_offset));
>> +            res->check_errors++;
>> +            ret = refblock_offset;
>> +            goto fail;
>> +        }
>> +
>> +        if (reftable_size <= refblock_index) {
>> +            uint32_t old_rt_size = reftable_size;
>> +            reftable_size = ROUND_UP((refblock_index + 1) * sizeof(uint64_t),
>> +                                     s->cluster_size) / sizeof(uint64_t);
>> +            reftable = g_try_realloc(reftable,
>> +                                     reftable_size * sizeof(uint64_t));
> Another leak here.

Will fix.

>> +            if (!reftable) {
>> +                res->check_errors++;
>> +                ret = -ENOMEM;
>> +                goto fail;
>> +            }
>> +
>> +            memset(reftable + old_rt_size, 0,
>> +                   (reftable_size - old_rt_size) * sizeof(uint64_t));
>> +
>> +            /* The offset we have for the reftable is now no longer valid;
>> +             * this will leak that range, but we can easily fix that by running
>> +             * a leak-fixing check after this rebuild operation */
>> +            reftable_offset = -1;
>> +        }
>> +        reftable[refblock_index] = refblock_offset;
>> +
>> +        /* If this is apparently the last refblock (for now), try to squeeze the
>> +         * reftable in */
>> +        if (refblock_index == (*nb_clusters - 1) >> s->refcount_block_bits &&
>> +            reftable_offset < 0)
>> +        {
>> +            uint64_t reftable_clusters = size_to_clusters(s, reftable_size *
>> +                                                          sizeof(uint64_t));
>> +            reftable_offset = alloc_clusters_imrt(bs, reftable_clusters,
>> +                                                  refcount_table, nb_clusters,
>> +                                                  &first_free_cluster);
>> +            if (reftable_offset < 0) {
>> +                fprintf(stderr, "ERROR allocating reftable: %s\n",
>> +                        strerror(-reftable_offset));
>> +                res->check_errors++;
>> +                ret = reftable_offset;
>> +                goto fail;
>> +            }
>> +        }
>> +
>> +        ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
>> +                                            s->cluster_size);
>> +        if (ret < 0) {
>> +            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> +            goto fail;
>> +        }
>> +
>> +        on_disk_refblock = g_malloc0(s->cluster_size);
> qemu_blockalign?

Meh, I think I have to write a qemu_blockalign0() some time...

If I do so, it's one more patch you have to review; if I just use 
qemu_blockalign(), I find the memset() directly afterwards rather ugly, 
but would be fine with it. I guess it's up to you, then.

>> +        for (i = 0; i < s->cluster_size / sizeof(uint16_t) &&
>> +                    refblock_start + i < *nb_clusters; i++)
>> +        {
>> +            on_disk_refblock[i] =
>> +                cpu_to_be16((*refcount_table)[refblock_start + i]);
>> +        }
>> +
>> +        ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE,
>> +                         (void *)on_disk_refblock, s->cluster_sectors);
>> +        g_free(on_disk_refblock);
>> +        if (ret < 0) {
>> +            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
>> +            goto fail;
>> +        }
>> +
>> +        /* Go to the end of this refblock */
>> +        cluster = refblock_start + s->cluster_size / sizeof(uint16_t) - 1;
>> +    }
>> +
>> +    if (reftable_offset < 0) {
>> +        uint64_t post_refblock_start, reftable_clusters;
>> +
>> +        post_refblock_start = ROUND_UP(*nb_clusters,
>> +                                       s->cluster_size / sizeof(uint16_t));
>> +        reftable_clusters = size_to_clusters(s,
>> +                                             reftable_size * sizeof(uint64_t));
>> +        /* Not pretty but simple */
>> +        if (first_free_cluster < post_refblock_start) {
>> +            first_free_cluster = post_refblock_start;
>> +        }
>> +        reftable_offset = alloc_clusters_imrt(bs, reftable_clusters,
>> +                                              refcount_table, nb_clusters,
>> +                                              &first_free_cluster);
>> +        if (reftable_offset < 0) {
>> +            fprintf(stderr, "ERROR allocating reftable: %s\n",
>> +                    strerror(-reftable_offset));
>> +            res->check_errors++;
>> +            ret = reftable_offset;
>> +            goto fail;
>> +        }
>> +
>> +        goto write_refblocks;
>> +    }
> Ouch. :-)
>
> Well, it should work.
>
>> +    assert(reftable);
>> +
>> +    for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) {
>> +        cpu_to_be64s(&reftable[refblock_index]);
>> +    }
>> +
>> +    ret = qcow2_pre_write_overlap_check(bs, 0, reftable_offset,
>> +                                        reftable_size * sizeof(uint64_t));
>> +    if (ret < 0) {
>> +        fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret));
>> +        goto fail;
>> +    }
>> +
>> +    ret = bdrv_write(bs->file, reftable_offset / BDRV_SECTOR_SIZE,
>> +                     (void *)reftable,
>> +                     reftable_size * sizeof(uint64_t) / BDRV_SECTOR_SIZE);
> Why not bdrv_pwrite when you only have byte offset and length?

I don't like pwrite probably only because it takes an int byte length. 
reftable_size * sizeof(uint64_t) should be well below INT_MAX, but I 
don't see why we should use pwrite if we are writing full sectors to a 
sector-aligned offset.

>> +    if (ret < 0) {
>> +        fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret));
>> +        goto fail;
>> +    }
>> +
>> +    /* Enter new reftable into the image header */
>> +    cpu_to_be64w(&reftable_offset_and_clusters.reftable_offset,
>> +                 reftable_offset);
>> +    cpu_to_be32w(&reftable_offset_and_clusters.reftable_clusters,
>> +                 size_to_clusters(s, reftable_size * sizeof(uint64_t)));
>> +    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader,
>> +                                              refcount_table_offset),
>> +                           &reftable_offset_and_clusters,
>> +                           sizeof(reftable_offset_and_clusters));
>> +    if (ret < 0) {
>> +        fprintf(stderr, "ERROR setting reftable: %s\n", strerror(-ret));
>> +        goto fail;
>> +    }
>> +
>> +    for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) {
>> +        be64_to_cpus(&reftable[refblock_index]);
>> +    }
>> +    s->refcount_table = reftable;
>> +    s->refcount_table_offset = reftable_offset;
>> +    s->refcount_table_size = reftable_size;
>> +
>> +    return 0;
>> +
>> +fail:
>> +    g_free(reftable);
>> +    return ret;
>> +}
>> +
>> +/*
>>    * Checks an image for refcount consistency.
>>    *
>>    * Returns 0 if no errors are found, the number of errors in case the image is
>> @@ -1651,6 +1921,7 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>                             BdrvCheckMode fix)
>>   {
>>       BDRVQcowState *s = bs->opaque;
>> +    BdrvCheckResult pre_compare_res;
>>       int64_t size, highest_cluster, nb_clusters;
>>       uint16_t *refcount_table = NULL;
>>       bool rebuild = false;
>> @@ -1677,11 +1948,30 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>           goto fail;
>>       }
>>   
>> -    compare_refcounts(bs, res, fix, &rebuild, &highest_cluster, refcount_table,
>> +    /* In case we don't need to rebuild the refcount structure (but want to fix
>> +     * something), this function is immediately called again, in which case the
>> +     * result should be ignored */
>> +    pre_compare_res = *res;
>> +    compare_refcounts(bs, res, 0, &rebuild, &highest_cluster, refcount_table,
>>                         nb_clusters);
>>   
>> -    if (rebuild) {
>> -        fprintf(stderr, "ERROR need to rebuild refcount structures\n");
>> +    if (rebuild && (fix & BDRV_FIX_ERRORS)) {
>> +        fprintf(stderr, "Rebuilding refcount structure\n");
>> +        ret = rebuild_refcount_structure(bs, res, &refcount_table,
>> +                                         &nb_clusters);
>> +        if (ret < 0) {
>> +            goto fail;
>> +        }
>> +    } else if (fix) {
>> +        if (rebuild) {
>> +            fprintf(stderr, "ERROR need to rebuild refcount structures\n");
> Is it safe in this case to continue with fixing leaks? Should some error
> counter be increased?

Good point. We should bail out here. res->corruptions should be enough, 
but incrementing res->check_errors won't hurt.

>> +        }
>> +
>> +        if (res->leaks || res->corruptions) {
>> +            *res = pre_compare_res;
>> +            compare_refcounts(bs, res, fix, &rebuild, &highest_cluster,
>> +                              refcount_table, nb_clusters);
>> +        }
>>       }
> Kevin

Thanks for you review!

Max

next prev parent reply	other threads:[~2014-10-21  9:53 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-20 14:35 [Qemu-devel] [PATCH v6 00/11] qcow2: Fix image repairing Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 01/11] qcow2: Calculate refcount block entry count Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 02/11] qcow2: Fix leaks in dirty images Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 03/11] qcow2: Split qcow2_check_refcounts() Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 04/11] qcow2: Pull check_refblocks() up Max Reitz
2014-10-20 15:07   ` Kevin Wolf
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 05/11] qcow2: Reuse refcount table in calculate_refcounts() Max Reitz
2014-10-20 15:09   ` Kevin Wolf
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 06/11] qcow2: Fix refcount blocks beyond image end Max Reitz
2014-10-20 16:44   ` Kevin Wolf
2014-10-21  7:14     ` Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 07/11] qcow2: Do not perform potentially damaging repairs Max Reitz
2014-10-21  7:52   ` Kevin Wolf
2014-10-21 10:10     ` Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 08/11] qcow2: Rebuild refcount structure during check Max Reitz
2014-10-21  9:31   ` Kevin Wolf
2014-10-21  9:52     ` Max Reitz [this message]
2014-10-21 10:12       ` Kevin Wolf
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 09/11] qcow2: Clean up after refcount rebuild Max Reitz
2014-10-21  9:59   ` Kevin Wolf
2014-10-21 10:16     ` Max Reitz
2014-10-21 14:55       ` Max Reitz
2014-10-21 15:11         ` Kevin Wolf
2014-10-21 15:17           ` Max Reitz
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 10/11] iotests: Fix test outputs Max Reitz
2014-10-21 13:42   ` Kevin Wolf
2014-10-20 14:35 ` [Qemu-devel] [PATCH v6 11/11] iotests: Add test for potentially damaging repairs Max Reitz
2014-10-21 14:12   ` Kevin Wolf
2014-10-21 14:20     ` Max Reitz
2014-10-21 15:16       ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54462CE9.4020909@redhat.com \
    --to=mreitz@redhat.com \
    --cc=benoit.canet@nodalink.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.