From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55918)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1XhDxT-00081Z-Nq
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1XhDxN-000056-HL
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:23 -0400
Received: from mx1.redhat.com ([209.132.183.28]:52386)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1XhDxN-00004y-7B
	for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:17 -0400
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
	(int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9N8fGXA001291
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=FAIL)
	for <qemu-devel@nongnu.org>; Thu, 23 Oct 2014 04:41:16 -0400
Date: Thu, 23 Oct 2014 10:41:13 +0200
From: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20141023084113.GD3522@noname.redhat.com>
References: <1413982283-10186-1-git-send-email-mreitz@redhat.com>
	<1413982283-10186-4-git-send-email-mreitz@redhat.com>
	<20141022183516.GA6975@noname.redhat.com>
	<5448B263.4050408@redhat.com>
	<20141023082957.GC3522@noname.redhat.com>
	<5448BDF6.9080406@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5448BDF6.9080406@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Max Reitz <mreitz@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

Am 23.10.2014 um 10:36 hat Max Reitz geschrieben:
> On 2014-10-23 at 10:29, Kevin Wolf wrote:
> >Am 23.10.2014 um 09:46 hat Max Reitz geschrieben:
> >>On 2014-10-22 at 20:35, Kevin Wolf wrote:
> >>>Am 22.10.2014 um 14:51 hat Max Reitz geschrieben:
> >>>>bdrv_make_empty() is currently only called if the current image
> >>>>represents an external snapshot that has been committed to its base
> >>>>image; it is therefore unlikely to have internal snapshots. In this
> >>>>case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
> >>>>refcount table (while having the dirty flag set, which only works for
> >>>>compat=1.1) and creating a trivial refcount structure.
> >>>>
> >>>>If there are snapshots or for compat=0.10, fall back to the simple
> >>>>implementation (discard all clusters).
> >>>>
> >>>>Signed-off-by: Max Reitz <mreitz@redhat.com>
> >>>Hey, this feels actually reviewable this time. :-)
> >>I'm still unsure which version I like more. If it wasn't for the
> >>math, I'd prefer the other version.
> >>
> >>>>diff --git a/block/blkdebug.c b/block/blkdebug.c
> >>>>index e046b92..862d93b 100644
> >>>>--- a/block/blkdebug.c
> >>>>+++ b/block/blkdebug.c
> >>>>@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = {
> >>>>      [BLKDBG_PWRITEV]                        = "pwritev",
> >>>>      [BLKDBG_PWRITEV_ZERO]                   = "pwritev_zero",
> >>>>      [BLKDBG_PWRITEV_DONE]                   = "pwritev_done",
> >>>>+
> >>>>+    [BLKDBG_EMPTY_IMAGE_PREPARE]            = "empty_image_prepare",
> >>>>  };
> >>>>  static int get_event_by_name(const char *name, BlkDebugEvent *event)
> >>>>diff --git a/block/qcow2.c b/block/qcow2.c
> >>>>index 1ef3a5f..16dece2 100644
> >>>>--- a/block/qcow2.c
> >>>>+++ b/block/qcow2.c
> >>>>@@ -2232,24 +2232,137 @@ fail:
> >>>>  static int qcow2_make_empty(BlockDriverState *bs)
> >>>>  {
> >>>>+    BDRVQcowState *s = bs->opaque;
> >>>>      int ret = 0;
> >>>>-    uint64_t start_sector;
> >>>>-    int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>-    for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>-         start_sector += sector_step)
> >>>>-    {
> >>>>-        /* As this function is generally used after committing an external
> >>>>-         * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>>>-         * default action for this kind of discard is to pass the discard,
> >>>>-         * which will ideally result in an actually smaller image file, as
> >>>>-         * is probably desired. */
> >>>>-        ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>>>-                                     MIN(sector_step,
> >>>>-                                         bs->total_sectors - start_sector),
> >>>>-                                     QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+    if (s->snapshots || s->qcow_version < 3) {
> >>>>+        uint64_t start_sector;
> >>>>+        int sector_step = INT_MAX / BDRV_SECTOR_SIZE;
> >>>>+
> >>>>+        /* If there are snapshots, every active cluster has to be discarded; and
> >>>>+         * because compat=0.10 does not support setting the dirty flag, we have
> >>>>+         * to use this fallback there as well */
> >>>>+
> >>>>+        for (start_sector = 0; start_sector < bs->total_sectors;
> >>>>+             start_sector += sector_step)
> >>>>+        {
> >>>>+            /* As this function is generally used after committing an external
> >>>>+             * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the
> >>>>+             * default action for this kind of discard is to pass the discard,
> >>>>+             * which will ideally result in an actually smaller image file, as
> >>>>+             * is probably desired. */
> >>>>+            ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE,
> >>>>+                                         MIN(sector_step,
> >>>>+                                             bs->total_sectors - start_sector),
> >>>>+                                         QCOW2_DISCARD_SNAPSHOT, true);
> >>>>+            if (ret < 0) {
> >>>>+                break;
> >>>>+            }
> >>>>+        }
> >>>My first though was to add a return here, so the indentation level for
> >>>the rest is one less. Probably a matter of taste, though.
> >>I'd rather put the second branch into an own function.
> >Works for me.
> >
> >>>>+    } else {
> >>>>+        int l1_clusters;
> >>>>+        int64_t offset;
> >>>>+        uint64_t *new_reftable;
> >>>>+        uint64_t rt_entry;
> >>>>+        struct {
> >>>>+            uint64_t l1_offset;
> >>>>+            uint64_t reftable_offset;
> >>>>+            uint32_t reftable_clusters;
> >>>>+        } QEMU_PACKED l1_ofs_rt_ofs_cls;
> >>>>+
> >>>>+        ret = qcow2_cache_empty(bs, s->l2_table_cache);
> >>>>          if (ret < 0) {
> >>>>-            break;
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        ret = qcow2_cache_empty(bs, s->refcount_block_cache);
> >>>>+        if (ret < 0) {
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        /* Refcounts will be broken utterly */
> >>>>+        ret = qcow2_mark_dirty(bs);
> >>>>+        if (ret < 0) {
> >>>>+            return ret;
> >>>>+        }
> >>>>+
> >>>>+        l1_clusters = DIV_ROUND_UP(s->l1_size,
> >>>>+                                   s->cluster_size / sizeof(uint64_t));
> >>>>+        new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t));
> >>>>+        if (!new_reftable) {
> >>>>+            return -ENOMEM;
> >>>>+        }
> >>>>+
> >>>>+        BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE);
> >>>Until here, the failure cases are trivially okay. The worst thing that
> >>>could happen is that the image is needlessly marked as dirty.
> >>>
> >>>>+        /* Overwrite enough clusters at the beginning of the sectors to place
> >>>>+         * the refcount table, a refcount block and the L1 table in; this may
> >>>>+         * overwrite parts of the existing refcount and L1 table, which is not
> >>>>+         * an issue because the dirty flag is set, complete data loss is in fact
> >>>>+         * desired and partial data loss is consequently fine as well */
> >>>>+        ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE,
> >>>>+                                (2 + l1_clusters) * s->cluster_size /
> >>>>+                                BDRV_SECTOR_SIZE, 0);
> >>>If we crash at this point, we're _not_ okay any more. --verbose follows:
> >>>
> >>>On disk, we may have overwritten a refcount table or a refcount block
> >>>with zeros. This is fine, we have the dirty flag set, so destroying any
> >>>refcounting structure does no harm.
> >>>
> >>>We may also have overwritten an L1 or L2 table. As the commit correctly
> >>>explains, this is doing partially what the function is supposed to do
> >>>for the whole image. Affected data clusters are now read from the
> >>>backing file. Good.
> >>>
> >>>However, we may also have overwritten data clusters that are still
> >>>accessible using an L1/L2 table that hasn't been hit by this write
> >>>operation. We're reading corrupted (zeroed out) data now instead of
> >>>going to the backing file. Bug!
> >>Oh, right, I forgot about the L1 table not always being at the start
> >>of the file.
> >>
> >>>In my original suggestion I had an item where the L1 table was zeroed
> >>>out first before the start of the image is zeroed. This would have
> >>>avoided the bug.
> >>>
> >>>>+        if (ret < 0) {
> >>>>+            g_free(new_reftable);
> >>>>+            return ret;
> >>>>+        }
> >>>If we fail here (without crashing), the first clusters could be in their
> >>>original state or partially zeroed. Assuming that you fixed the above
> >>>bug, the on-disk state would be okay if we opened the image now because
> >>>the dirty flag would trigger an image repair; but we don't get the
> >>>repair when taking this failure path and we may have zeroed a refcount
> >>>table/block. This is probably a problem and we may have to make the BDS
> >>>unusable.
> >>>
> >>>The in-memory state of the L1 table is hopefully zeroed out, so it's
> >>>consistent with what is on disk.
> >>>
> >>>The in-memory state of the refcount table looks like it's not in sync
> >>>with the on-disk state. Note that while the dirty flag allows that the
> >>>on-disk state can be anything, the in-memory state is what we keep using
> >>>after a failure. The in-memory state isn't accurate at this point, but
> >>>we only create leaks. Lots of them, because we zeroed the L1 table, but
> >>>that's good enough. If refcounts are updated later, the old offsets
> >>>should still be valid.
> >>If we set at least parts of the in-memory reftable to zero,
> >>everything probably breaks. Try to allocate a new cluster while the
> >>beginning of the reftable is zero. So we cannot take the on-disk
> >>reftable into memory.
> >>
> >>Doing it the other way around, writing the in-memory reftable to
> >>disk on error won't work either. The refblocks may have been zeroed
> >>out, so we have exactly the same problem.
> >>
> >>Therefore, to make the BDS usable after error, we have to (in the
> >>error path) read the on-disk reftable into memory, call the qcow2
> >>repair function and hope for the best.
> >>
> >>Or, you know, we could go back to v11 which had my other version of
> >>this patch which always kept everything consistent. :-P
> >I'm not sure how important it is to keep the BDS usable after such an
> >unlikely error.
> >
> >I started to write up a suggestion that we could do without
> >qcow2_alloc_clusters(), but just build up the first refcount block
> >ourselves. Doesn't really help us, because the new state is not what we
> >need here, but it made me aware that it assumes that the L1 table is
> >small enough to fit in the first refcount block and we would have to
> >fall back to discard if it isn't. Come to think of it, I suspect you
> >already make the same assumption without checking it.
> >
> >
> >Anyway, if we want to keep the BDS usable what is the right state?
> >We need references for the header, for the L1 table, for the refcount
> >table and all refcount blocks. If we decide to build a new in-memory
> >refcount table, the refcount blocks are only the refcount blocks that
> >are still referenced there, i.e. we don't have to worry about the
> >location of refcount blocks on the disk.
> >
> >In fact, we can also safely change the refcount table offset to
> >cluster 1 and handle it together with the header. We'll then need one
> >refcount block for header/reftable and potentially another one for the
> >L1 table, which still must be in its original place (adding the
> >assumption that the L1 table doesn't cross a refblock boundary :-/).
> >
> >Our refblock cache can hold 4 tables at least, so creating two refblocks
> >purely in memory is doable.
> >
> >Leaves the question, is it worth the hassle?
> 
> Probably not. Would you be fine with setting bs->drv to NULL in the
> problematic error paths?

By calling qcow2_signal_corruption(), right? I think that's fine.

What about the assumption that 3 + l1_size fits in one refcount block?
Do we need to check it even now?

Kevin