From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53344)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1VF130-0004MV-3m
	for qemu-devel@nongnu.org; Thu, 29 Aug 2013 08:10:04 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1VF12r-00089q-P9
	for qemu-devel@nongnu.org; Thu, 29 Aug 2013 08:09:57 -0400
Received: from mx1.redhat.com ([209.132.183.28]:28123)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1VF12r-000899-C0
	for qemu-devel@nongnu.org; Thu, 29 Aug 2013 08:09:49 -0400
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
	(int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r7TC9lrT007009
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Thu, 29 Aug 2013 08:09:47 -0400
Message-ID: <521F3A07.2000002@redhat.com>
Date: Thu, 29 Aug 2013 14:09:43 +0200
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1377701706-965-1-git-send-email-mreitz@redhat.com>
	<1377701706-965-5-git-send-email-mreitz@redhat.com>
	<20130829113654.GG2961@dhcp-200-207.str.redhat.com>
In-Reply-To: <20130829113654.GG2961@dhcp-200-207.str.redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v2 4/5] qcow2: More complete consistency
	check
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

Am 29.08.2013 13:36, schrieb Kevin Wolf:
> Am 28.08.2013 um 16:55 hat Max Reitz geschrieben:
>> The qcow2_check_refcounts function has been extended to be able to fix
>> OFLAG_COPIED errors and multiple references on refcount blocks.
>>
>> If no corruptions remain after an image repair (and no errors have been
>> encountered), clear the corrupt flag in qcow2_check.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> This would be easier to review if you kept the code changes of the
> actual check (mostly code movement, I guess) and the repair of each type
> of errors in separate patches.
>
>>   block/qcow2-cluster.c  |   4 +-
>>   block/qcow2-refcount.c | 249 ++++++++++++++++++++++++++++++++++++++++++-------
>>   block/qcow2.c          |   6 +-
>>   block/qcow2.h          |   1 +
>>   include/block/block.h  |   1 +
>>   5 files changed, 222 insertions(+), 39 deletions(-)
>> @@ -1240,7 +1243,8 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>   {
>>       BDRVQcowState *s = bs->opaque;
>>       int64_t size, i, highest_cluster;
>> -    int nb_clusters, refcount1, refcount2;
>> +    uint64_t *l2_table = NULL;
>> +    int nb_clusters, refcount1, refcount2, j;
>>       QCowSnapshot *sn;
>>       uint16_t *refcount_table;
>>       int ret;
>> @@ -1305,10 +1309,85 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>               inc_refcounts(bs, res, refcount_table, nb_clusters,
>>                   offset, s->cluster_size);
>>               if (refcount_table[cluster] != 1) {
>> -                fprintf(stderr, "ERROR refcount block %" PRId64
>> +                fprintf(stderr, "%s refcount block %" PRId64
>>                       " refcount=%d\n",
>> +                    fix & BDRV_FIX_ERRORS ? "Repairing" :
>> +                                            "ERROR",
>>                       i, refcount_table[cluster]);
>> -                res->corruptions++;
>> +                if (fix & BDRV_FIX_ERRORS) {
> This is a quite long if block. Probably worth splitting into its own
> function. (The res->corruptions++ part could be in a common fail: part
> then.)
Seems reasonable.

>> +                    int64_t new_offset;
>> +                    void *refcount_block;
>> +
>> +                    /* allocate new refcount block */
>> +                    new_offset = qcow2_alloc_clusters(bs, s->cluster_size);
> How trustworthy is qcow2_alloc_clusters() when we know that our refcount
> information is corrupted? Couldn't we end up accidentally overwriting
> things?
I personally don't really see any other way to do this than to just fail 
without any attempt on repairing. If the refcounts are completely 
broken, I don't see a reasonable way of allocating a free cluster. 
Furthermore, the following pre-write check should at least prevent this 
from overwriting any metadata.

>> +                    if (new_offset < 0) {
>> +                        fprintf(stderr, "Could not allocate new cluster\n");
>> +                        res->corruptions++;
>> +                        continue;
>> +                    }
>> +                    /* fetch current content */
>> +                    ret = qcow2_cache_get(bs, s->refcount_block_cache, offset,
>> +                            &refcount_block);
>> +                    if (ret < 0) {
>> +                        fprintf(stderr, "Could not fetch refcount block\n");
> strerror(-ret) would be useful information in almost all of the error
> cases.
Yes, right.

>> +                        qcow2_free_clusters(bs, new_offset, s->cluster_size,
>> +                                QCOW2_DISCARD_ALWAYS);
>> +                        res->corruptions++;
>> +                        continue;
>> +                    }
>> +                    /* new block has not yet been entered into refcount table,
>> +                     * therefore it is no refcount block yet (regarding this
>> +                     * check) */
>> +                    ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_DEFAULT,
>> +                            new_offset, s->cluster_sectors * BDRV_SECTOR_SIZE);
>> +                    if (ret < 0) {
>> +                        fprintf(stderr, "Could not write refcount block (would "
>> +                                "overlap with existing metadata)\n");
>> +                        /* the image will be marked corrupt here, so don't even
>> +                         * attempt on freeing the cluster */
>> +                        res->corruptions++;
>> +                        goto fail;
>> +                    }
>> +                    /* write to new block */
>> +                    ret = bdrv_write(bs->file, new_offset >> BDRV_SECTOR_BITS,
>> +                            refcount_block, s->cluster_sectors);
>> +                    if (ret < 0) {
>> +                        fprintf(stderr, "Could not write refcount block\n");
>> +                        qcow2_free_clusters(bs, new_offset, s->cluster_size,
>> +                                QCOW2_DISCARD_ALWAYS);
>> +                        res->corruptions++;
>> +                        continue;
>> +                    }
>> +                    /* update refcount table */
>> +                    assert(!(new_offset & (s->cluster_size - 1)));
>> +                    s->refcount_table[i] = new_offset;
>> +                    ret = write_reftable_entry(bs, i);
>> +                    if (ret < 0) {
>> +                        fprintf(stderr, "Could not update refcount table\n");
>> +                        s->refcount_table[i] = offset;
>> +                        qcow2_free_clusters(bs, new_offset, s->cluster_size,
>> +                                QCOW2_DISCARD_ALWAYS);
>> +                        res->corruptions++;
>> +                        continue;
>> +                    }
>> +                    qcow2_cache_put(bs, s->refcount_block_cache,
>> +                            &refcount_block);
>> +                    /* update refcounts */
>> +                    if ((new_offset >> s->cluster_bits) >= nb_clusters) {
>> +                        /* increase refcount_table size if necessary */
>> +                        int old_nb_clusters = nb_clusters;
>> +                        nb_clusters = (new_offset >> s->cluster_bits) + 1;
>> +                        refcount_table = g_realloc(refcount_table,
>> +                                nb_clusters * sizeof(uint16_t));
>> +                        memset(&refcount_table[old_nb_clusters], 0, (nb_clusters
>> +                                - old_nb_clusters) * sizeof(uint16_t));
>> +                    }
>> +                    refcount_table[cluster]--;
>> +                    inc_refcounts(bs, res, refcount_table, nb_clusters,
>> +                            new_offset, s->cluster_size);
> res->corruptions_fixed++?
Oh, yes, I forgot that.

>> +                } else {
>> +                    res->corruptions++;
>> +                }
>>               }
>>           }
>>       }
>> @@ -1364,10 +1443,108 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>           }
>>       }
>>   
>> +    l2_table = g_malloc(s->l2_size * sizeof(uint64_t));
> qemu_blockalign() is better than g_malloc() because it avoids using a
> bounce buffer for O_DIRECT images.
Ah, okay. I'll include that in my other series as well.

>> +
>> +    /* check OFLAG_COPIED */
>> +    for (i = 0; i < s->l1_size; i++) {
>> +        uint64_t l1_entry = s->l1_table[i];
>> +        uint64_t l2_offset = l1_entry & L1E_OFFSET_MASK;
>> +        bool l2_dirty = false;
>> +        int refcount;
>> +
>> +        if (!l2_offset) {
>> +            continue;
>> +        }
>> +
>> +        refcount = get_refcount(bs, l2_offset >> s->cluster_bits);
>> +        if (refcount < 0) {
>> +            /* don't print message nor increment check_errors, since the above
>> +             * loop will have done this already */
>> +            continue;
>> +        }
>> +        if ((refcount == 1) != ((l1_entry & QCOW_OFLAG_COPIED) != 0)) {
>> +            fprintf(stderr, "%s OFLAG_COPIED L2 cluster: l1_entry=%" PRIx64
>> +                    " refcount=%d\n",
>> +                    fix & BDRV_FIX_ERRORS ? "Repairing" :
>> +                                            "ERROR",
>> +                    l1_entry, refcount);
>> +            if (fix & BDRV_FIX_ERRORS) {
>> +                s->l1_table[i] = refcount == 1
>> +                               ? l1_entry |  QCOW_OFLAG_COPIED
>> +                               : l1_entry & ~QCOW_OFLAG_COPIED;
>> +                ret = qcow2_write_l1_entry(bs, i);
>> +                if (ret < 0) {
>> +                    res->check_errors++;
>> +                    goto fail;
>> +                }
> else res->corruptions_fixed++?
>
>> +            } else {
>> +                res->corruptions++;
>> +            }
>> +        }
>> +
>> +        ret = bdrv_pread(bs->file, l2_offset, l2_table,
>> +                s->l2_size * sizeof(uint64_t));
>> +        if (ret != s->l2_size * sizeof(uint64_t)) {
>> +            fprintf(stderr, "ERROR: Could not read L2 table\n");
>> +            res->check_errors++;
>> +            if (ret >= 0) {
>> +                ret = -EIO;
>> +            }
> Doesn't happen. You can just check ret < 0 instead of ret != ... in the
> first place, bdrv_pread() never returns short reads.
Okay, I think I just saw it that way in some other code.

>> +            goto fail;
>> +        }
>> +
>> +        for (j = 0; j < s->l2_size; j++) {
>> +            uint64_t l2_entry = be64_to_cpu(l2_table[j]);
>> +            uint64_t data_offset;
>> +
>> +            if (qcow2_get_cluster_type(l2_entry) != QCOW2_CLUSTER_NORMAL) {
>> +                continue;
>> +            }
>> +
>> +            data_offset = l2_entry & L2E_OFFSET_MASK;
>> +
>> +            refcount = get_refcount(bs, data_offset >> s->cluster_bits);
>> +            if (refcount < 0) {
>> +                /* don't print message nor increment check_errors */
> Why?
I didn't want to duplicate the whole comment from above, but maybe it 
makes sense if it's not obvious:

/* don't print message nor increment check_errors, since the above
  * loop will have done this already */

>> +                continue;
>> +            }
>> +            if ((refcount == 1) != ((l2_entry & QCOW_OFLAG_COPIED) != 0)) {
>> +                fprintf(stderr, "%s OFLAG_COPIED data cluster: l2_entry=%"
>> +                        PRIx64 " refcount=%d\n",
>> +                        fix & BDRV_FIX_ERRORS ? "Repairing" :
>> +                                                "ERROR",
>> +                        l2_entry, refcount);
>> +                if (fix & BDRV_FIX_ERRORS) {
>> +                    l2_table[j] = cpu_to_be64(refcount == 1
>> +                                ? l2_entry |  QCOW_OFLAG_COPIED
>> +                                : l2_entry & ~QCOW_OFLAG_COPIED);
>> +                    l2_dirty = true;
> res->corruptions_fixed++
>
>> +                } else {
>> +                    res->corruptions++;
>> +                }
>> +            }
>> +        }
>> +
>> +        if (l2_dirty) {
>> +            ret = bdrv_pwrite(bs->file, l2_offset, l2_table,
>> +                    s->l2_size * sizeof(uint64_t));
>> +            if (ret != s->l2_size * sizeof(uint64_t)) {
>> +                fprintf(stderr, "ERROR: Could not write L2 table\n");
>> +                res->check_errors++;
>> +                if (ret >= 0) {
>> +                    ret = -EIO;
>> +                }
> bdrv_pwrite() also doesn't return short writes.
>
>> +                goto fail;
>> +            }
>> +        }
>> +    }
>> +
>> +
>>       res->image_end_offset = (highest_cluster + 1) * s->cluster_size;
>>       ret = 0;
>>   
>>   fail:
>> +    g_free(l2_table);
>>       g_free(refcount_table);
>>   
>>       return ret;
> Kevin
Max