From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41065) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XgW7a-0004dB-6Y for qemu-devel@nongnu.org; Tue, 21 Oct 2014 05:53:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XgW7T-0002XY-UB for qemu-devel@nongnu.org; Tue, 21 Oct 2014 05:52:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55685) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XgW7T-0002XQ-Ki for qemu-devel@nongnu.org; Tue, 21 Oct 2014 05:52:47 -0400 Message-ID: <54462CE9.4020909@redhat.com> Date: Tue, 21 Oct 2014 11:52:41 +0200 From: Max Reitz MIME-Version: 1.0 References: <1413815733-22829-1-git-send-email-mreitz@redhat.com> <1413815733-22829-9-git-send-email-mreitz@redhat.com> <20141021093125.GC4409@noname.redhat.com> In-Reply-To: <20141021093125.GC4409@noname.redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v6 08/11] qcow2: Rebuild refcount structure during check List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Stefan Hajnoczi , =?windows-1252?Q?Beno=EEt_Canet?= On 2014-10-21 at 11:31, Kevin Wolf wrote: > Am 20.10.2014 um 16:35 hat Max Reitz geschrieben: >> The previous commit introduced the "rebuild" variable to qcow2's >> implementation of the image consistency check. Now make use of this by >> adding a function which creates a completely new refcount structure >> based solely on the in-memory information gathered before. >> >> The old refcount structure will be leaked, however. This leak will be >> dealt with in a follow-up commit. >> >> Signed-off-by: Max Reitz >> --- >> block/qcow2-refcount.c | 296 ++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 293 insertions(+), 3 deletions(-) >> >> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c >> index 183fc5b..75e726b 100644 >> --- a/block/qcow2-refcount.c >> +++ b/block/qcow2-refcount.c >> @@ -1642,6 +1642,276 @@ static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res, >> } >> >> /* >> + * Allocates a cluster using an in-memory refcount table (IMRT) in contrast to > s/a cluster/clusters/ > >> + * the on-disk refcount structures. >> + * >> + * On input, *first_free_cluster tells where to start looking, and need not >> + * actually be a free cluster; the returned offset will not be before that >> + * cluster. On output, *first_free_cluster points to the first gap found, even >> + * if that gap was too small to be used as the returned offset. >> + * >> + * Note that *first_free_cluster is a cluster index whereas the return value is >> + * an offset. >> + */ >> +static int64_t alloc_clusters_imrt(BlockDriverState *bs, >> + int cluster_count, >> + uint16_t **refcount_table, >> + int64_t *nb_clusters, > Having cluster_count and nb_clusters at the same time is confusing. > > Maybe imrt_nb_clusters for this one? Sounds good. >> + int64_t *first_free_cluster) >> +{ >> + BDRVQcowState *s = bs->opaque; >> + int64_t cluster = *first_free_cluster, i; >> + bool first_gap = true; >> + int contiguous_free_clusters; >> + >> + /* Starting at *first_free_cluster, find a range of at least cluster_count >> + * continuously free clusters */ >> + for (contiguous_free_clusters = 0; >> + cluster < *nb_clusters && contiguous_free_clusters < cluster_count; >> + cluster++) >> + { >> + if (!(*refcount_table)[cluster]) { >> + contiguous_free_clusters++; >> + if (first_gap) { >> + /* If this is the first free cluster found, update >> + * *first_free_cluster accordingly */ >> + *first_free_cluster = cluster; >> + first_gap = false; >> + } >> + } else if (contiguous_free_clusters) { >> + contiguous_free_clusters = 0; >> + } >> + } >> + >> + /* If contiguous_free_clusters is greater than zero, it contains the number >> + * of continuously free clusters until the current cluster; the first free >> + * cluster in the current "gap" is therefore >> + * cluster - contiguous_free_clusters */ >> + >> + /* If no such range could be found, grow the in-memory refcount table >> + * accordingly to append free clusters at the end of the image */ >> + if (contiguous_free_clusters < cluster_count) { >> + int64_t old_nb_clusters = *nb_clusters; >> + >> + /* contiguous_free_clusters clusters are already empty at the image end; >> + * we need cluster_count clusters; therefore, we have to allocate >> + * cluster_count - contiguous_free_clusters new clusters at the end of >> + * the image (which is the current value of cluster; note that cluster >> + * may exceed old_nb_clusters if *first_free_cluster pointed beyond the >> + * image end) */ >> + *nb_clusters = cluster + cluster_count - contiguous_free_clusters; >> + *refcount_table = g_try_realloc(*refcount_table, >> + *nb_clusters * sizeof(uint16_t)); >> + if (!*refcount_table) { >> + return -ENOMEM; > This means that on failure the passed refcount table pointer may be > overwritten by NULL. This is a surprising interface and should at least > be mentioned in the function comment. > > But as the original IMRT is leaked here, too, it's probably better to > change the interface to leave *refcount_table alone if the failure case. Okay, I had to look it up, but realloc() does not free the given pointer on failure (and I assume g_try_realloc() does not either). Therefore, yes, we should just leave *refcount_table as it is. >> + } >> + >> + memset(*refcount_table + old_nb_clusters, 0, >> + (*nb_clusters - old_nb_clusters) * sizeof(uint16_t)); >> + } >> + >> + /* Go back to the first free cluster */ >> + cluster -= contiguous_free_clusters; >> + for (i = 0; i < cluster_count; i++) { >> + (*refcount_table)[cluster + i] = 1; >> + } >> + >> + return cluster << s->cluster_bits; >> +} >> + >> +/* >> + * Creates a new refcount structure based solely on the in-memory information >> + * given through *refcount_table. All necessary allocations will be reflected >> + * in that array. >> + * >> + * On success, the old refcount structure is leaked (it will be covered by the >> + * new refcount structure). >> + */ >> +static int rebuild_refcount_structure(BlockDriverState *bs, >> + BdrvCheckResult *res, >> + uint16_t **refcount_table, >> + int64_t *nb_clusters) >> +{ >> + BDRVQcowState *s = bs->opaque; >> + int64_t first_free_cluster = 0, reftable_offset = -1, cluster = 0; >> + int64_t refblock_offset, refblock_start, refblock_index; >> + uint32_t reftable_size = 0; >> + uint64_t *reftable = NULL; > refcount_table and reftable? Seriously? Reviewing would have been too easy otherwise. *cough* One option is s/refcount_table/imrt/. I like it more than the second option, but it simply goes against the current naming convention. The other option would be s/reftable/on_disk_reftable/, following the convention used in the next line. >> + uint16_t *on_disk_refblock; >> + int i, ret = 0; >> + struct { >> + uint64_t reftable_offset; >> + uint32_t reftable_clusters; >> + } QEMU_PACKED reftable_offset_and_clusters; >> + >> + qcow2_cache_empty(bs, s->refcount_block_cache); >> + >> +write_refblocks: >> + for (; cluster < *nb_clusters; cluster++) { >> + if (!(*refcount_table)[cluster]) { >> + continue; >> + } >> + >> + refblock_index = cluster >> s->refcount_block_bits; >> + refblock_start = refblock_index << s->refcount_block_bits; >> + >> + /* Don't allocate a cluster in a refblock already written to disk */ >> + if (first_free_cluster < refblock_start) { >> + first_free_cluster = refblock_start; >> + } >> + refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table, >> + nb_clusters, &first_free_cluster); >> + if (refblock_offset < 0) { >> + fprintf(stderr, "ERROR allocating refblock: %s\n", >> + strerror(-refblock_offset)); >> + res->check_errors++; >> + ret = refblock_offset; >> + goto fail; >> + } >> + >> + if (reftable_size <= refblock_index) { >> + uint32_t old_rt_size = reftable_size; >> + reftable_size = ROUND_UP((refblock_index + 1) * sizeof(uint64_t), >> + s->cluster_size) / sizeof(uint64_t); >> + reftable = g_try_realloc(reftable, >> + reftable_size * sizeof(uint64_t)); > Another leak here. Will fix. >> + if (!reftable) { >> + res->check_errors++; >> + ret = -ENOMEM; >> + goto fail; >> + } >> + >> + memset(reftable + old_rt_size, 0, >> + (reftable_size - old_rt_size) * sizeof(uint64_t)); >> + >> + /* The offset we have for the reftable is now no longer valid; >> + * this will leak that range, but we can easily fix that by running >> + * a leak-fixing check after this rebuild operation */ >> + reftable_offset = -1; >> + } >> + reftable[refblock_index] = refblock_offset; >> + >> + /* If this is apparently the last refblock (for now), try to squeeze the >> + * reftable in */ >> + if (refblock_index == (*nb_clusters - 1) >> s->refcount_block_bits && >> + reftable_offset < 0) >> + { >> + uint64_t reftable_clusters = size_to_clusters(s, reftable_size * >> + sizeof(uint64_t)); >> + reftable_offset = alloc_clusters_imrt(bs, reftable_clusters, >> + refcount_table, nb_clusters, >> + &first_free_cluster); >> + if (reftable_offset < 0) { >> + fprintf(stderr, "ERROR allocating reftable: %s\n", >> + strerror(-reftable_offset)); >> + res->check_errors++; >> + ret = reftable_offset; >> + goto fail; >> + } >> + } >> + >> + ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset, >> + s->cluster_size); >> + if (ret < 0) { >> + fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret)); >> + goto fail; >> + } >> + >> + on_disk_refblock = g_malloc0(s->cluster_size); > qemu_blockalign? Meh, I think I have to write a qemu_blockalign0() some time... If I do so, it's one more patch you have to review; if I just use qemu_blockalign(), I find the memset() directly afterwards rather ugly, but would be fine with it. I guess it's up to you, then. >> + for (i = 0; i < s->cluster_size / sizeof(uint16_t) && >> + refblock_start + i < *nb_clusters; i++) >> + { >> + on_disk_refblock[i] = >> + cpu_to_be16((*refcount_table)[refblock_start + i]); >> + } >> + >> + ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE, >> + (void *)on_disk_refblock, s->cluster_sectors); >> + g_free(on_disk_refblock); >> + if (ret < 0) { >> + fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret)); >> + goto fail; >> + } >> + >> + /* Go to the end of this refblock */ >> + cluster = refblock_start + s->cluster_size / sizeof(uint16_t) - 1; >> + } >> + >> + if (reftable_offset < 0) { >> + uint64_t post_refblock_start, reftable_clusters; >> + >> + post_refblock_start = ROUND_UP(*nb_clusters, >> + s->cluster_size / sizeof(uint16_t)); >> + reftable_clusters = size_to_clusters(s, >> + reftable_size * sizeof(uint64_t)); >> + /* Not pretty but simple */ >> + if (first_free_cluster < post_refblock_start) { >> + first_free_cluster = post_refblock_start; >> + } >> + reftable_offset = alloc_clusters_imrt(bs, reftable_clusters, >> + refcount_table, nb_clusters, >> + &first_free_cluster); >> + if (reftable_offset < 0) { >> + fprintf(stderr, "ERROR allocating reftable: %s\n", >> + strerror(-reftable_offset)); >> + res->check_errors++; >> + ret = reftable_offset; >> + goto fail; >> + } >> + >> + goto write_refblocks; >> + } > Ouch. :-) > > Well, it should work. > >> + assert(reftable); >> + >> + for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) { >> + cpu_to_be64s(&reftable[refblock_index]); >> + } >> + >> + ret = qcow2_pre_write_overlap_check(bs, 0, reftable_offset, >> + reftable_size * sizeof(uint64_t)); >> + if (ret < 0) { >> + fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret)); >> + goto fail; >> + } >> + >> + ret = bdrv_write(bs->file, reftable_offset / BDRV_SECTOR_SIZE, >> + (void *)reftable, >> + reftable_size * sizeof(uint64_t) / BDRV_SECTOR_SIZE); > Why not bdrv_pwrite when you only have byte offset and length? I don't like pwrite probably only because it takes an int byte length. reftable_size * sizeof(uint64_t) should be well below INT_MAX, but I don't see why we should use pwrite if we are writing full sectors to a sector-aligned offset. >> + if (ret < 0) { >> + fprintf(stderr, "ERROR writing reftable: %s\n", strerror(-ret)); >> + goto fail; >> + } >> + >> + /* Enter new reftable into the image header */ >> + cpu_to_be64w(&reftable_offset_and_clusters.reftable_offset, >> + reftable_offset); >> + cpu_to_be32w(&reftable_offset_and_clusters.reftable_clusters, >> + size_to_clusters(s, reftable_size * sizeof(uint64_t))); >> + ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, >> + refcount_table_offset), >> + &reftable_offset_and_clusters, >> + sizeof(reftable_offset_and_clusters)); >> + if (ret < 0) { >> + fprintf(stderr, "ERROR setting reftable: %s\n", strerror(-ret)); >> + goto fail; >> + } >> + >> + for (refblock_index = 0; refblock_index < reftable_size; refblock_index++) { >> + be64_to_cpus(&reftable[refblock_index]); >> + } >> + s->refcount_table = reftable; >> + s->refcount_table_offset = reftable_offset; >> + s->refcount_table_size = reftable_size; >> + >> + return 0; >> + >> +fail: >> + g_free(reftable); >> + return ret; >> +} >> + >> +/* >> * Checks an image for refcount consistency. >> * >> * Returns 0 if no errors are found, the number of errors in case the image is >> @@ -1651,6 +1921,7 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res, >> BdrvCheckMode fix) >> { >> BDRVQcowState *s = bs->opaque; >> + BdrvCheckResult pre_compare_res; >> int64_t size, highest_cluster, nb_clusters; >> uint16_t *refcount_table = NULL; >> bool rebuild = false; >> @@ -1677,11 +1948,30 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res, >> goto fail; >> } >> >> - compare_refcounts(bs, res, fix, &rebuild, &highest_cluster, refcount_table, >> + /* In case we don't need to rebuild the refcount structure (but want to fix >> + * something), this function is immediately called again, in which case the >> + * result should be ignored */ >> + pre_compare_res = *res; >> + compare_refcounts(bs, res, 0, &rebuild, &highest_cluster, refcount_table, >> nb_clusters); >> >> - if (rebuild) { >> - fprintf(stderr, "ERROR need to rebuild refcount structures\n"); >> + if (rebuild && (fix & BDRV_FIX_ERRORS)) { >> + fprintf(stderr, "Rebuilding refcount structure\n"); >> + ret = rebuild_refcount_structure(bs, res, &refcount_table, >> + &nb_clusters); >> + if (ret < 0) { >> + goto fail; >> + } >> + } else if (fix) { >> + if (rebuild) { >> + fprintf(stderr, "ERROR need to rebuild refcount structures\n"); > Is it safe in this case to continue with fixing leaks? Should some error > counter be increased? Good point. We should bail out here. res->corruptions should be enough, but incrementing res->check_errors won't hurt. >> + } >> + >> + if (res->leaks || res->corruptions) { >> + *res = pre_compare_res; >> + compare_refcounts(bs, res, fix, &rebuild, &highest_cluster, >> + refcount_table, nb_clusters); >> + } >> } > Kevin Thanks for you review! Max