From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53078)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo7yr-0001Uq-2P
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 04:43:27 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo7yk-0007s2-QM
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 04:43:21 -0500
Received: from mx1.redhat.com ([209.132.183.28]:37391)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1Xo7yk-0007ru-Gy
	for qemu-devel@nongnu.org; Tue, 11 Nov 2014 04:43:14 -0500
Message-ID: <5461DA2A.6030303@redhat.com>
Date: Tue, 11 Nov 2014 10:43:06 +0100
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1415627159-15941-1-git-send-email-mreitz@redhat.com>
	<1415627159-15941-7-git-send-email-mreitz@redhat.com>
	<54613C7E.4020707@redhat.com> <5461CA51.2080403@redhat.com>
In-Reply-To: <5461CA51.2080403@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 06/21] qcow2: Helper function for refcount
	modification
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Eric Blake <eblake@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>, Stefan Hajnoczi <stefanha@redhat.com>

On 2014-11-11 at 09:35, Max Reitz wrote:
> On 2014-11-10 at 23:30, Eric Blake wrote:
>> On 11/10/2014 06:45 AM, Max Reitz wrote:
>>> Since refcounts do not always have to be a uint16_t, all refcount 
>>> blocks
>>> and arrays in memory should not have a specific type (thus they become
>>> pointers to void) and for accessing them, two helper functions are used
>>> (a getter and a setter). Those functions are called indirectly through
>>> function pointers in the BDRVQcowState so they may later be exchanged
>>> for different refcount orders.
>>>
>>> At the same time, replace all sizeof(**refcount_table) etc. in the 
>>> qcow2
>>> check code by s->refcount_bits / 8. Note that this might lead to wrong
>>> values due to truncating division, but currently s->refcount_bits is
>>> always 16, and before the upcoming patch which removes this limitation
>>> another patch will make the division round up correctly.
>> Thanks for pointing out that this transition is still in progress, and
>> needs more patches.  I agree that for this patch, the division is safe.
>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   block/qcow2-refcount.c | 152 
>>> +++++++++++++++++++++++++++++--------------------
>>>   block/qcow2.h          |   8 +++
>>>   2 files changed, 98 insertions(+), 62 deletions(-)
>>>
>>> @@ -116,20 +137,24 @@ int64_t qcow2_get_refcount(BlockDriverState 
>>> *bs, int64_t cluster_index)
>>>       }
>>>         ret = qcow2_cache_get(bs, s->refcount_block_cache, 
>>> refcount_block_offset,
>>> -        (void**) &refcount_block);
>>> +                          &refcount_block);
>>>       if (ret < 0) {
>>>           return ret;
>>>       }
>>>         block_index = cluster_index & (s->refcount_block_size - 1);
>>> -    refcount = be16_to_cpu(refcount_block[block_index]);
>>> +    refcount = s->get_refcount(refcount_block, block_index);
>>>   -    ret = qcow2_cache_put(bs, s->refcount_block_cache,
>>> -        (void**) &refcount_block);
>>> +    ret = qcow2_cache_put(bs, s->refcount_block_cache, 
>>> &refcount_block);
>>>       if (ret < 0) {
>>>           return ret;
>>>       }
>>>   +    if (refcount < 0) {
>>> +        /* overflow */
>>> +        return -ERANGE;
>>> +    }
>> Should you be checking for overflow prior to calling qcow2_cache_put?
>
> I don't think so; we want to free the cache entry reference even if 
> the refblock seems unusable.
>
>>> @@ -362,7 +387,7 @@ static int alloc_refcount_block(BlockDriverState 
>>> *bs,
>>>           s->cluster_size;
>>>       uint64_t table_offset = meta_offset + blocks_clusters * 
>>> s->cluster_size;
>>>       uint64_t *new_table = g_try_new0(uint64_t, table_size);
>>> -    uint16_t *new_blocks = g_try_malloc0(blocks_clusters * 
>>> s->cluster_size);
>>> +    void *new_blocks = g_try_malloc0(blocks_clusters * 
>>> s->cluster_size);
>> Can this multiplication ever overflow?  Would we be better off with a
>> g_try_new0 approach?
>
> Yes, you're right. Patch 7 introduces a helper function which will 
> allow checking such overflows in a central place, but I haven't done 
> so in this version.
>
>>> @@ -1118,12 +1142,13 @@ fail:
>>>    */
>>>   static int inc_refcounts(BlockDriverState *bs,
>>>                            BdrvCheckResult *res,
>>> -                         uint16_t **refcount_table,
>>> +                         void **refcount_table,
>>>                            int64_t *refcount_table_size,
>>>                            int64_t offset, int64_t size)
>>>   {
>>>       BDRVQcowState *s = bs->opaque;
>>> -    uint64_t start, last, cluster_offset, k;
>>> +    uint64_t start, last, cluster_offset, k, refcount;
>> Why uint64_t, when you limit to int64_t in other patches?
>
> Because we don't need to check for errors here. But I guess we should 
> really use int64_t everywhere to be consistent (so that if something 
> breaks because a cluster has more than INT64_MAX references, it breaks 
> everywhere).
>
>>> +    int64_t i;
>>>         if (size <= 0) {
>>>           return 0;
>>> @@ -1136,12 +1161,12 @@ static int inc_refcounts(BlockDriverState *bs,
>>>           k = cluster_offset >> s->cluster_bits;
>>>           if (k >= *refcount_table_size) {
>>>               int64_t old_refcount_table_size = *refcount_table_size;
>>> -            uint16_t *new_refcount_table;
>>> +            void *new_refcount_table;
>>>                 *refcount_table_size = k + 1;
>>>               new_refcount_table = g_try_realloc(*refcount_table,
>>> *refcount_table_size *
>>> - sizeof(**refcount_table));
>>> + s->refcount_bits / 8);
>> This multiplies before dividing.  Can it ever overflow, where writing
>> *refcount_table_size * (s->refcount_bits / 8) would be safer?  Also, is
>> it better to use a malloc variant that checks for overflow (I think it
>> is g_try_renew?) instead of open-coding the multiply?
>>
>>>               if (!new_refcount_table) {
>>>                   *refcount_table_size = old_refcount_table_size;
>>>                   res->check_errors++;
>>> @@ -1149,16 +1174,19 @@ static int inc_refcounts(BlockDriverState *bs,
>>>               }
>>>               *refcount_table = new_refcount_table;
>>>   -            memset(*refcount_table + old_refcount_table_size, 0,
>>> -                   (*refcount_table_size - old_refcount_table_size) *
>>> -                   sizeof(**refcount_table));
>>> +            for (i = old_refcount_table_size; i < 
>>> *refcount_table_size; i++) {
>>> +                s->set_refcount(*refcount_table, i, 0);
>>> +            }
>> This feels slower than memset.
>
> It is, yes. But this is the check function, I don't think performance 
> is all that important here (especially not operations in RAM).
>
>> Any chance we can add an optimization
>> that brings back the speed of memset (may require an additional callback
>> in addition to the getter and setter)?
>
> For sub-byte refcount widths, we would have to manually set all 
> non-byte aligned entries to 0 and then use memset() on the rest. Not 
> impossible, but I think too complicated for a place where performance 
> is not critical.
>
>>> @@ -1178,7 +1206,7 @@ enum {
>>>    * error occurred.
>>>    */
>>>   static int check_refcounts_l2(BlockDriverState *bs, 
>>> BdrvCheckResult *res,
>>> -    uint16_t **refcount_table, int64_t *refcount_table_size, 
>>> int64_t l2_offset,
>>> +    void **refcount_table, int64_t *refcount_table_size, int64_t 
>>> l2_offset,
>>>       int flags)
>> I noticed you cleaned up indentation in a lot of the patch, but not
>> here.  Any reason?
>
> Maybe I remembered touching that header once already and not fixing up 
> the indentation. Will do in v2.
>
>>> @@ -1541,7 +1569,7 @@ static int check_refblocks(BlockDriverState 
>>> *bs, BdrvCheckResult *res,
>>>                     new_refcount_table = g_try_realloc(*refcount_table,
>>> *nb_clusters *
>>> - sizeof(**refcount_table));
>>> + s->refcount_bits / 8);
>> Another possible overflow or g_try_renew site?
>>
>>>                   if (!new_refcount_table) {
>>>                       *nb_clusters = old_nb_clusters;
>>>                       res->check_errors++;
>>> @@ -1549,9 +1577,9 @@ static int check_refblocks(BlockDriverState 
>>> *bs, BdrvCheckResult *res,
>>>                   }
>>>                   *refcount_table = new_refcount_table;
>>>   -                memset(*refcount_table + old_nb_clusters, 0,
>>> -                       (*nb_clusters - old_nb_clusters) *
>>> -                       sizeof(**refcount_table));
>>> +                for (j = old_nb_clusters; j < *nb_clusters; j++) {
>>> +                    s->set_refcount(*refcount_table, j, 0);
>>> +                }
>> Another memset pessimation.  Maybe even having a callback to expand the
>> table, and factor out more of the common code of reallocating the table
>> and clearing all new entries.
>
> That sounds useful, yes. I'll look into it, but I'll probably still go 
> without memset().
>
>>> @@ -1611,7 +1640,7 @@ static int 
>>> calculate_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
>>>       int ret;
>>>         if (!*refcount_table) {
>>> -        *refcount_table = g_try_new0(uint16_t, *nb_clusters);
>>> +        *refcount_table = g_try_malloc0(*nb_clusters * 
>>> s->refcount_bits / 8);
>> Feels like a step backwards in overflow detection?
>>
>>> @@ -1787,22 +1816,22 @@ static int64_t 
>>> alloc_clusters_imrt(BlockDriverState *bs,
>>>           *imrt_nb_clusters = cluster + cluster_count - 
>>> contiguous_free_clusters;
>>>           new_refcount_table = g_try_realloc(*refcount_table,
>>>                                              *imrt_nb_clusters *
>>> - sizeof(**refcount_table));
>>> + s->refcount_bits / 8);
>> Another possible overflow
>>
>>>           if (!new_refcount_table) {
>>>               *imrt_nb_clusters = old_imrt_nb_clusters;
>>>               return -ENOMEM;
>>>           }
>>>           *refcount_table = new_refcount_table;
>>>   -        memset(*refcount_table + old_imrt_nb_clusters, 0,
>>> -               (*imrt_nb_clusters - old_imrt_nb_clusters) *
>>> -               sizeof(**refcount_table));
>>> +        for (i = old_imrt_nb_clusters; i < *imrt_nb_clusters; i++) {
>>> +            s->set_refcount(*refcount_table, i, 0);
>>> +        }
>>>       }
>> and another resize where we pessimize memset
>>
>>
>>> @@ -1911,12 +1940,11 @@ write_refblocks:
>>>           }
>>>             on_disk_refblock = qemu_blockalign0(bs->file, 
>>> s->cluster_size);
>>> -        for (i = 0; i < s->refcount_block_size &&
>>> -                    refblock_start + i < *nb_clusters; i++)
>>> -        {
>>> -            on_disk_refblock[i] =
>>> -                cpu_to_be16((*refcount_table)[refblock_start + i]);
>>> -        }
>>> +
>>> +        memcpy(on_disk_refblock, (void *)((uintptr_t)*refcount_table +
>>> +                                 (refblock_index << 
>>> s->refcount_block_bits)),
>>> +               MIN(s->refcount_block_size, *nb_clusters - 
>>> refblock_start)
>>> +               * s->refcount_bits / 8);
>> This one's different in that you move TO a memcpy instead of open-coded
>> loop.
>
> Yes, because the set_refcount() and get_refcount() helpers store the 
> refcounts already in big endian, so now we can directly use the data 
> without conversion.
>
>> But I still worry if multiply before /8 could be a problem.
>>
>>> @@ -2064,7 +2092,7 @@ int qcow2_check_refcounts(BlockDriverState 
>>> *bs, BdrvCheckResult *res,
>>>           /* Because the old reftable has been exchanged for a new 
>>> one the
>>>            * references have to be recalculated */
>>>           rebuild = false;
>>> -        memset(refcount_table, 0, nb_clusters * sizeof(uint16_t));
>>> +        memset(refcount_table, 0, nb_clusters * s->refcount_bits / 8);
>> Another /8 possible overflow.
>>
>>>           ret = calculate_refcounts(bs, res, 0, &rebuild, 
>>> &refcount_table,
>>>                                     &nb_clusters);
>>>           if (ret < 0) {
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 0f8eb15..1c63221 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -213,6 +213,11 @@ typedef struct Qcow2DiscardRegion {
>>>       QTAILQ_ENTRY(Qcow2DiscardRegion) next;
>>>   } Qcow2DiscardRegion;
>>>   +typedef uint64_t Qcow2GetRefcountFunc(const void *refcount_array,
>>> +                                      uint64_t index);
>>> +typedef void Qcow2SetRefcountFunc(void *refcount_array,
>>> +                                  uint64_t index, uint64_t value);
>> Do you want int64_t for any of the types here, to make it obvious that
>> you can't exceed 2^63?
>
> Yes, I do.

To be honest, I just read your reply and not the original code (which 
I'm doing now while working on v2). I think I don't want int64_t after 
all. I generally do want to use int64_t for all refcount values, but in 
this case, as these are just helpers for directly accessing refcount 
arrays (which I think should not fail, see my reply for patch 8), I'd 
rather keep them using uint64_t.

Max

>> Looks like you are on track to a sane conversion, but I'm worried enough
>> about the math that it probably needs a respin (either comments stating
>> why we know we don't overflow, or else safer constructs).
>
> I'll add a comment in the commit message why overflows do not really 
> get more probable than they were before, but the real overflow 
> prevention will happen in patch 7.
>
> I think I'll factor out the refcount array resize which automatically 
> sets the newly allocated entries to zero. Now that I think about it... 
> I can actually get away without sub-byte operations. Since .*realloc() 
> will only allocate full bytes anyway, I only need to set that newly 
> allocated area to zero (with memset()). So it'll be back to memset(). 
> (I'm still wondering why there's no g_try_realloc0() or something; 
> probably because the glib does not require the libc heap manager to 
> know the exact size of each heap object which would be necessary for 
> that to work)
>
> Max