From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55261)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XhDsc-0006lX-9Y
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XhDsR-00072E-Aq
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:22 -0400
Received: from mx1.redhat.com ([209.132.183.28]:15585)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1XhDsR-000720-1R
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:11 -0400
Received: from int-mx13.intmail.prod.int.phx2.redhat.com
	(int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9N8a994026397
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=FAIL)
	for <qemu-devel@nongnu.org>; Thu, 23 Oct 2014 04:36:09 -0400
Message-ID: <5448BDF6.9080406@redhat.com>
Date: Thu, 23 Oct 2014 10:36:06 +0200
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1413982283-10186-1-git-send-email-mreitz@redhat.com>
	<1413982283-10186-4-git-send-email-mreitz@redhat.com>
	<20141022183516.GA6975@noname.redhat.com>
	<5448B263.4050408@redhat.com>
	<20141023082957.GC3522@noname.redhat.com>
In-Reply-To: <20141023082957.GC3522@noname.redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

On 2014-10-23 at 10:29, Kevin Wolf wrote:
> Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
>> On 2014-10-22 at 20:35, Kevin Wolf wrote:
>>> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
>>>> bdrv_make_empty() is currently only called if the current image
>>>> represents an external snapshot that has been committed to its base
>>>> image; it is therefore unlikely to have internal snapshots. In this
>>>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
>>>> refcount table (while having the dirty flag set, which only works for
>>>> compat=1.1) and creating a trivial refcount structure.
>>>>
>>>> If there are snapshots or for compat=0.10, fall back to the simple
>>>> implementation (discard all clusters).
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> Hey, this feels actually reviewable this time. :-)
>> I'm still unsure which version I like more. If it wasn't for the
>> math, I'd prefer the other version.
>>
>>>> diff --git a/block/blkdebug.c b/block/blkdebug.c
>>>> index e046b92..862d93b 100644
>>>> --- a/block/blkdebug.c
>>>> +++ b/block/blkdebug.c
>>>> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
>>>>       [BLKDBG_PWRITEV]                        = "pwritev",
>>>>       [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
>>>>       [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
>>>> +
>>>> +    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
>>>>   };
>>>>   static int get_event_by_name(const char *name, BlkDebugEvent *event)
>>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>>> index 1ef3a5f..16dece2 100644
>>>> --- a/block/qcow2.c
>>>> +++ b/block/qcow2.c
>>>> @@ -2232,24 +2232,137 @@ fail:
>>>>   static int qcow2_make_empty(BlockDriverState *bs)
>>>>   {
>>>> +    BDRVQcowState *s = bs->opaque;
>>>>       int ret = 0;
>>>> -    uint64_t start_sector;
>>>> -    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>> -    for (start_sector = 0; start_sector < bs->total_sectors;
>>>> -         start_sector += sector_step)
>>>> -    {
>>>> -        /* As this function is generally used after committing an external
>>>> -         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>> -         * default action for this kind of discard is to pass the discard,
>>>> -         * which will ideally result in an actually smaller image file, as
>>>> -         * is probably desired. */
>>>> -        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>> -                                     MIN(sector_step,
>>>> -                                         bs->total_sectors - start_sector),
>>>> -                                     QCOW2_DISCARD_SNAPSHOT, true);
>>>> +    if (s->snapshots || s->qcow_version < 3) {
>>>> +        uint64_t start_sector;
>>>> +        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
>>>> +
>>>> +        /* If there are snapshots, every active cluster has to be discarded; and
>>>> +         * because compat=0.10 does not support setting the dirty flag, we have
>>>> +         * to use this fallback there as well */
>>>> +
>>>> +        for (start_sector = 0; start_sector < bs->total_sectors;
>>>> +             start_sector += sector_step)
>>>> +        {
>>>> +            /* As this function is generally used after committing an external
>>>> +             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
>>>> +             * default action for this kind of discard is to pass the discard,
>>>> +             * which will ideally result in an actually smaller image file, as
>>>> +             * is probably desired. */
>>>> +            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
>>>> +                                         MIN(sector_step,
>>>> +                                             bs->total_sectors - start_sector),
>>>> +                                         QCOW2_DISCARD_SNAPSHOT, true);
>>>> +            if (ret < 0) {
>>>> +                break;
>>>> +            }
>>>> +        }
>>> My first though was to add a return here, so the indentation level for
>>> the rest is one less. Probably a matter of taste, though.
>> I'd rather put the second branch into an own function.
> Works for me.
>
>>>> +    } else {
>>>> +        int l1_clusters;
>>>> +        int64_t offset;
>>>> +        uint64_t *new_reftable;
>>>> +        uint64_t rt_entry;
>>>> +        struct {
>>>> +            uint64_t l1_offset;
>>>> +            uint64_t reftable_offset;
>>>> +            uint32_t reftable_clusters;
>>>> +        } QEMU_PACKED l1_ofs_rt_ofs_cls;
>>>> +
>>>> +        ret = qcow2_cache_empty(bs, s->l2_table_cache);
>>>>           if (ret < 0) {
>>>> -            break;
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        /* Refcounts will be broken utterly */
>>>> +        ret = qcow2_mark_dirty(bs);
>>>> +        if (ret < 0) {
>>>> +            return ret;
>>>> +        }
>>>> +
>>>> +        l1_clusters = DIV_ROUND_UP(s->l1_size,
>>>> +                                   s->cluster_size / sizeof(uint64_t));
>>>> +        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
>>>> +        if (!new_reftable) {
>>>> +            return -ENOMEM;
>>>> +        }
>>>> +
>>>> +        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
>>> Until here, the failure cases are trivially okay. The worst thing that
>>> could happen is that the image is needlessly marked as dirty.
>>>
>>>> +        /* Overwrite enough clusters at the beginning of the sectors to place
>>>> +         * the refcount table, a refcount block and the L1 table in; this may
>>>> +         * overwrite parts of the existing refcount and L1 table, which is not
>>>> +         * an issue because the dirty flag is set, complete data loss is in fact
>>>> +         * desired and partial data loss is consequently fine as well */
>>>> +        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
>>>> +                                (2 + l1_clusters) * s->cluster_size /
>>>> +                                BDRV_SECTOR_SIZE, 0);
>>> If we crash at this point, we're _not_ okay any more. --verbose follows:
>>>
>>> On disk, we may have overwritten a refcount table or a refcount block
>>> with zeros. This is fine, we have the dirty flag set, so destroying any
>>> refcounting structure does no harm.
>>>
>>> We may also have overwritten an L1 or L2 table. As the commit correctly
>>> explains, this is doing partially what the function is supposed to do
>>> for the whole image. Affected data clusters are now read from the
>>> backing file. Good.
>>>
>>> However, we may also have overwritten data clusters that are still
>>> accessible using an L1/L2 table that hasn't been hit by this write
>>> operation. We're reading corrupted (zeroed out) data now instead of
>>> going to the backing file. Bug!
>> Oh, right, I forgot about the L1 table not always being at the start
>> of the file.
>>
>>> In my original suggestion I had an item where the L1 table was zeroed
>>> out first before the start of the image is zeroed. This would have
>>> avoided the bug.
>>>
>>>> +        if (ret < 0) {
>>>> +            g_free(new_reftable);
>>>> +            return ret;
>>>> +        }
>>> If we fail here (without crashing), the first clusters could be in their
>>> original state or partially zeroed. Assuming that you fixed the above
>>> bug, the on-disk state would be okay if we opened the image now because
>>> the dirty flag would trigger an image repair; but we don't get the
>>> repair when taking this failure path and we may have zeroed a refcount
>>> table/block. This is probably a problem and we may have to make the BDS
>>> unusable.
>>>
>>> The in-memory state of the L1 table is hopefully zeroed out, so it's
>>> consistent with what is on disk.
>>>
>>> The in-memory state of the refcount table looks like it's not in sync
>>> with the on-disk state. Note that while the dirty flag allows that the
>>> on-disk state can be anything, the in-memory state is what we keep using
>>> after a failure. The in-memory state isn't accurate at this point, but
>>> we only create leaks. Lots of them, because we zeroed the L1 table, but
>>> that's good enough. If refcounts are updated later, the old offsets
>>> should still be valid.
>> If we set at least parts of the in-memory reftable to zero,
>> everything probably breaks. Try to allocate a new cluster while the
>> beginning of the reftable is zero. So we cannot take the on-disk
>> reftable into memory.
>>
>> Doing it the other way around, writing the in-memory reftable to
>> disk on error won't work either. The refblocks may have been zeroed
>> out, so we have exactly the same problem.
>>
>> Therefore, to make the BDS usable after error, we have to (in the
>> error path) read the on-disk reftable into memory, call the qcow2
>> repair function and hope for the best.
>>
>> Or, you know, we could go back to v11 which had my other version of
>> this patch which always kept everything consistent. :-P
> I'm not sure how important it is to keep the BDS usable after such an
> unlikely error.
>
> I started to write up a suggestion that we could do without
> qcow2_alloc_clusters(), but just build up the first refcount block
> ourselves. Doesn't really help us, because the new state is not what we
> need here, but it made me aware that it assumes that the L1 table is
> small enough to fit in the first refcount block and we would have to
> fall back to discard if it isn't. Come to think of it, I suspect you
> already make the same assumption without checking it.
>
>
> Anyway, if we want to keep the BDS usable what is the right state?
> We need references for the header, for the L1 table, for the refcount
> table and all refcount blocks. If we decide to build a new in-memory
> refcount table, the refcount blocks are only the refcount blocks that
> are still referenced there, i.e. we don't have to worry about the
> location of refcount blocks on the disk.
>
> In fact, we can also safely change the refcount table offset to
> cluster 1 and handle it together with the header. We'll then need one
> refcount block for header/reftable and potentially another one for the
> L1 table, which still must be in its original place (adding the
> assumption that the L1 table doesn't cross a refblock boundary :-/).
>
> Our refblock cache can hold 4 tables at least, so creating two refblocks
> purely in memory is doable.
>
> Leaves the question, is it worth the hassle?

Probably not. Would you be fine with setting bs->drv to NULL in the 
problematic error paths?

Max