From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40905)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo6vS-0006ux-F4
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 03:35:52 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo6vM-0001bE-9E
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 03:35:46 -0500
Received: from mx1.redhat.com ([209.132.183.28]:56125)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo6vL-0001b6-W6
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 03:35:40 -0500
Message-ID: <5461CA51.2080403@redhat.com>
Date: Tue, 11 Nov 2014 09:35:29 +0100
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1415627159-15941-1-git-send-email-mreitz@redhat.com>
	<1415627159-15941-7-git-send-email-mreitz@redhat.com>
	<54613C7E.4020707@redhat.com>
In-Reply-To: <54613C7E.4020707@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 06/21] qcow2: Helper function for refcount
	modification
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Stefan Hajnoczi <stefanha@redhat.com>

On 2014-11-10 at 23:30, Eric Blake wrote:
> On 11/10/2014 06:45 AM, Max Reitz wrote:
>> Since refcounts do not always have to be a uint16_t, all refcount blocks
>> and arrays in memory should not have a specific type (thus they become
>> pointers to void) and for accessing them, two helper functions are used
>> (a getter and a setter). Those functions are called indirectly through
>> function pointers in the BDRVQcowState so they may later be exchanged
>> for different refcount orders.
>>
>> At the same time, replace all sizeof(**refcount_table) etc. in the qcow2
>> check code by s->refcount_bits / 8. Note that this might lead to wrong
>> values due to truncating division, but currently s->refcount_bits is
>> always 16, and before the upcoming patch which removes this limitation
>> another patch will make the division round up correctly.
> Thanks for pointing out that this transition is still in progress, and
> needs more patches.  I agree that for this patch, the division is safe.
>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-refcount.c | 152 +++++++++++++++++++++++++++++--------------------
>>   block/qcow2.h          |   8 +++
>>   2 files changed, 98 insertions(+), 62 deletions(-)
>>
>> @@ -116,20 +137,24 @@ int64_t qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index)
>>       }
>>   
>>       ret = qcow2_cache_get(bs, s->refcount_block_cache, refcount_block_offset,
>> -        (void**) &refcount_block);
>> +                          &refcount_block);
>>       if (ret < 0) {
>>           return ret;
>>       }
>>   
>>       block_index = cluster_index & (s->refcount_block_size - 1);
>> -    refcount = be16_to_cpu(refcount_block[block_index]);
>> +    refcount = s->get_refcount(refcount_block, block_index);
>>   
>> -    ret = qcow2_cache_put(bs, s->refcount_block_cache,
>> -        (void**) &refcount_block);
>> +    ret = qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
>>       if (ret < 0) {
>>           return ret;
>>       }
>>   
>> +    if (refcount < 0) {
>> +        /* overflow */
>> +        return -ERANGE;
>> +    }
> Should you be checking for overflow prior to calling qcow2_cache_put?

I don't think so; we want to free the cache entry reference even if the 
refblock seems unusable.

>> @@ -362,7 +387,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
>>           s->cluster_size;
>>       uint64_t table_offset = meta_offset + blocks_clusters * s->cluster_size;
>>       uint64_t *new_table = g_try_new0(uint64_t, table_size);
>> -    uint16_t *new_blocks = g_try_malloc0(blocks_clusters * s->cluster_size);
>> +    void *new_blocks = g_try_malloc0(blocks_clusters * s->cluster_size);
> Can this multiplication ever overflow?  Would we be better off with a
> g_try_new0 approach?

Yes, you're right. Patch 7 introduces a helper function which will allow 
checking such overflows in a central place, but I haven't done so in 
this version.

>> @@ -1118,12 +1142,13 @@ fail:
>>    */
>>   static int inc_refcounts(BlockDriverState *bs,
>>                            BdrvCheckResult *res,
>> -                         uint16_t **refcount_table,
>> +                         void **refcount_table,
>>                            int64_t *refcount_table_size,
>>                            int64_t offset, int64_t size)
>>   {
>>       BDRVQcowState *s = bs->opaque;
>> -    uint64_t start, last, cluster_offset, k;
>> +    uint64_t start, last, cluster_offset, k, refcount;
> Why uint64_t, when you limit to int64_t in other patches?

Because we don't need to check for errors here. But I guess we should 
really use int64_t everywhere to be consistent (so that if something 
breaks because a cluster has more than INT64_MAX references, it breaks 
everywhere).

>> +    int64_t i;
>>   
>>       if (size <= 0) {
>>           return 0;
>> @@ -1136,12 +1161,12 @@ static int inc_refcounts(BlockDriverState *bs,
>>           k = cluster_offset >> s->cluster_bits;
>>           if (k >= *refcount_table_size) {
>>               int64_t old_refcount_table_size = *refcount_table_size;
>> -            uint16_t *new_refcount_table;
>> +            void *new_refcount_table;
>>   
>>               *refcount_table_size = k + 1;
>>               new_refcount_table = g_try_realloc(*refcount_table,
>>                                                  *refcount_table_size *
>> -                                               sizeof(**refcount_table));
>> +                                               s->refcount_bits / 8);
> This multiplies before dividing.  Can it ever overflow, where writing
> *refcount_table_size * (s->refcount_bits / 8) would be safer?  Also, is
> it better to use a malloc variant that checks for overflow (I think it
> is g_try_renew?) instead of open-coding the multiply?
>
>>               if (!new_refcount_table) {
>>                   *refcount_table_size = old_refcount_table_size;
>>                   res->check_errors++;
>> @@ -1149,16 +1174,19 @@ static int inc_refcounts(BlockDriverState *bs,
>>               }
>>               *refcount_table = new_refcount_table;
>>   
>> -            memset(*refcount_table + old_refcount_table_size, 0,
>> -                   (*refcount_table_size - old_refcount_table_size) *
>> -                   sizeof(**refcount_table));
>> +            for (i = old_refcount_table_size; i < *refcount_table_size; i++) {
>> +                s->set_refcount(*refcount_table, i, 0);
>> +            }
> This feels slower than memset.

It is, yes. But this is the check function, I don't think performance is 
all that important here (especially not operations in RAM).

> Any chance we can add an optimization
> that brings back the speed of memset (may require an additional callback
> in addition to the getter and setter)?

For sub-byte refcount widths, we would have to manually set all non-byte 
aligned entries to 0 and then use memset() on the rest. Not impossible, 
but I think too complicated for a place where performance is not critical.

>> @@ -1178,7 +1206,7 @@ enum {
>>    * error occurred.
>>    */
>>   static int check_refcounts_l2(BlockDriverState *bs, BdrvCheckResult *res,
>> -    uint16_t **refcount_table, int64_t *refcount_table_size, int64_t l2_offset,
>> +    void **refcount_table, int64_t *refcount_table_size, int64_t l2_offset,
>>       int flags)
> I noticed you cleaned up indentation in a lot of the patch, but not
> here.  Any reason?

Maybe I remembered touching that header once already and not fixing up 
the indentation. Will do in v2.

>> @@ -1541,7 +1569,7 @@ static int check_refblocks(BlockDriverState *bs, BdrvCheckResult *res,
>>   
>>                   new_refcount_table = g_try_realloc(*refcount_table,
>>                                                      *nb_clusters *
>> -                                                   sizeof(**refcount_table));
>> +                                                   s->refcount_bits / 8);
> Another possible overflow or g_try_renew site?
>
>>                   if (!new_refcount_table) {
>>                       *nb_clusters = old_nb_clusters;
>>                       res->check_errors++;
>> @@ -1549,9 +1577,9 @@ static int check_refblocks(BlockDriverState *bs, BdrvCheckResult *res,
>>                   }
>>                   *refcount_table = new_refcount_table;
>>   
>> -                memset(*refcount_table + old_nb_clusters, 0,
>> -                       (*nb_clusters - old_nb_clusters) *
>> -                       sizeof(**refcount_table));
>> +                for (j = old_nb_clusters; j < *nb_clusters; j++) {
>> +                    s->set_refcount(*refcount_table, j, 0);
>> +                }
> Another memset pessimation.  Maybe even having a callback to expand the
> table, and factor out more of the common code of reallocating the table
> and clearing all new entries.

That sounds useful, yes. I'll look into it, but I'll probably still go 
without memset().

>> @@ -1611,7 +1640,7 @@ static int calculate_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>       int ret;
>>   
>>       if (!*refcount_table) {
>> -        *refcount_table = g_try_new0(uint16_t, *nb_clusters);
>> +        *refcount_table = g_try_malloc0(*nb_clusters * s->refcount_bits / 8);
> Feels like a step backwards in overflow detection?
>
>> @@ -1787,22 +1816,22 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>>           *imrt_nb_clusters = cluster + cluster_count - contiguous_free_clusters;
>>           new_refcount_table = g_try_realloc(*refcount_table,
>>                                              *imrt_nb_clusters *
>> -                                           sizeof(**refcount_table));
>> +                                           s->refcount_bits / 8);
> Another possible overflow
>
>>           if (!new_refcount_table) {
>>               *imrt_nb_clusters = old_imrt_nb_clusters;
>>               return -ENOMEM;
>>           }
>>           *refcount_table = new_refcount_table;
>>   
>> -        memset(*refcount_table + old_imrt_nb_clusters, 0,
>> -               (*imrt_nb_clusters - old_imrt_nb_clusters) *
>> -               sizeof(**refcount_table));
>> +        for (i = old_imrt_nb_clusters; i < *imrt_nb_clusters; i++) {
>> +            s->set_refcount(*refcount_table, i, 0);
>> +        }
>>       }
> and another resize where we pessimize memset
>
>
>> @@ -1911,12 +1940,11 @@ write_refblocks:
>>           }
>>   
>>           on_disk_refblock = qemu_blockalign0(bs->file, s->cluster_size);
>> -        for (i = 0; i < s->refcount_block_size &&
>> -                    refblock_start + i < *nb_clusters; i++)
>> -        {
>> -            on_disk_refblock[i] =
>> -                cpu_to_be16((*refcount_table)[refblock_start + i]);
>> -        }
>> +
>> +        memcpy(on_disk_refblock, (void *)((uintptr_t)*refcount_table +
>> +                                 (refblock_index << s->refcount_block_bits)),
>> +               MIN(s->refcount_block_size, *nb_clusters - refblock_start)
>> +               * s->refcount_bits / 8);
>>   
> This one's different in that you move TO a memcpy instead of open-coded
> loop.

Yes, because the set_refcount() and get_refcount() helpers store the 
refcounts already in big endian, so now we can directly use the data 
without conversion.

> But I still worry if multiply before /8 could be a problem.
>
>> @@ -2064,7 +2092,7 @@ int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>           /* Because the old reftable has been exchanged for a new one the
>>            * references have to be recalculated */
>>           rebuild = false;
>> -        memset(refcount_table, 0, nb_clusters * sizeof(uint16_t));
>> +        memset(refcount_table, 0, nb_clusters * s->refcount_bits / 8);
> Another /8 possible overflow.
>
>>           ret = calculate_refcounts(bs, res, 0, &rebuild, &refcount_table,
>>                                     &nb_clusters);
>>           if (ret < 0) {
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 0f8eb15..1c63221 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -213,6 +213,11 @@ typedef struct Qcow2DiscardRegion {
>>       QTAILQ_ENTRY(Qcow2DiscardRegion) next;
>>   } Qcow2DiscardRegion;
>>   
>> +typedef uint64_t Qcow2GetRefcountFunc(const void *refcount_array,
>> +                                      uint64_t index);
>> +typedef void Qcow2SetRefcountFunc(void *refcount_array,
>> +                                  uint64_t index, uint64_t value);
> Do you want int64_t for any of the types here, to make it obvious that
> you can't exceed 2^63?

Yes, I do.

> Looks like you are on track to a sane conversion, but I'm worried enough
> about the math that it probably needs a respin (either comments stating
> why we know we don't overflow, or else safer constructs).

I'll add a comment in the commit message why overflows do not really get 
more probable than they were before, but the real overflow prevention 
will happen in patch 7.

I think I'll factor out the refcount array resize which automatically 
sets the newly allocated entries to zero. Now that I think about it... I 
can actually get away without sub-byte operations. Since .*realloc() 
will only allocate full bytes anyway, I only need to set that newly 
allocated area to zero (with memset()). So it'll be back to memset(). 
(I'm still wondering why there's no g_try_realloc0() or something; 
probably because the glib does not require the libc heap manager to know 
the exact size of each heap object which would be necessary for that to 
work)

Max