From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55918) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XhDxT-00081Z-Nq for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XhDxN-000056-HL for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:23 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52386) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XhDxN-00004y-7B for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:41:17 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9N8fGXA001291 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Thu, 23 Oct 2014 04:41:16 -0400 Date: Thu, 23 Oct 2014 10:41:13 +0200 From: Kevin Wolf Message-ID: <20141023084113.GD3522@noname.redhat.com> References: <1413982283-10186-1-git-send-email-mreitz@redhat.com> <1413982283-10186-4-git-send-email-mreitz@redhat.com> <20141022183516.GA6975@noname.redhat.com> <5448B263.4050408@redhat.com> <20141023082957.GC3522@noname.redhat.com> <5448BDF6.9080406@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5448BDF6.9080406@redhat.com> Subject: Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz Cc: qemu-devel@nongnu.org, Stefan Hajnoczi Am 23.10.2014 um 10:36 hat Max Reitz geschrieben: > On 2014-10-23 at 10:29, Kevin Wolf wrote: > >Am 23.10.2014 um 09:46 hat Max Reitz geschrieben: > >>On 2014-10-22 at 20:35, Kevin Wolf wrote: > >>>Am 22.10.2014 um 14:51 hat Max Reitz geschrieben: > >>>>bdrv_make_empty() is currently only called if the current image > >>>>represents an external snapshot that has been committed to its base > >>>>image; it is therefore unlikely to have internal snapshots. In this > >>>>case, bdrv_make_empty() can be greatly sped up by emptying the L1 and > >>>>refcount table (while having the dirty flag set, which only works for > >>>>compat=1.1) and creating a trivial refcount structure. > >>>> > >>>>If there are snapshots or for compat=0.10, fall back to the simple > >>>>implementation (discard all clusters). > >>>> > >>>>Signed-off-by: Max Reitz > >>>Hey, this feels actually reviewable this time. :-) > >>I'm still unsure which version I like more. If it wasn't for the > >>math, I'd prefer the other version. > >> > >>>>diff --git a/block/blkdebug.c b/block/blkdebug.c > >>>>index e046b92..862d93b 100644 > >>>>--- a/block/blkdebug.c > >>>>+++ b/block/blkdebug.c > >>>>@@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = { > >>>> [BLKDBG_PWRITEV] = "pwritev", > >>>> [BLKDBG_PWRITEV_ZERO] = "pwritev_zero", > >>>> [BLKDBG_PWRITEV_DONE] = "pwritev_done", > >>>>+ > >>>>+ [BLKDBG_EMPTY_IMAGE_PREPARE] = "empty_image_prepare", > >>>> }; > >>>> static int get_event_by_name(const char *name, BlkDebugEvent *event) > >>>>diff --git a/block/qcow2.c b/block/qcow2.c > >>>>index 1ef3a5f..16dece2 100644 > >>>>--- a/block/qcow2.c > >>>>+++ b/block/qcow2.c > >>>>@@ -2232,24 +2232,137 @@ fail: > >>>> static int qcow2_make_empty(BlockDriverState *bs) > >>>> { > >>>>+ BDRVQcowState *s = bs->opaque; > >>>> int ret = 0; > >>>>- uint64_t start_sector; > >>>>- int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > >>>>- for (start_sector = 0; start_sector < bs->total_sectors; > >>>>- start_sector += sector_step) > >>>>- { > >>>>- /* As this function is generally used after committing an external > >>>>- * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the > >>>>- * default action for this kind of discard is to pass the discard, > >>>>- * which will ideally result in an actually smaller image file, as > >>>>- * is probably desired. */ > >>>>- ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, > >>>>- MIN(sector_step, > >>>>- bs->total_sectors - start_sector), > >>>>- QCOW2_DISCARD_SNAPSHOT, true); > >>>>+ if (s->snapshots || s->qcow_version < 3) { > >>>>+ uint64_t start_sector; > >>>>+ int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > >>>>+ > >>>>+ /* If there are snapshots, every active cluster has to be discarded; and > >>>>+ * because compat=0.10 does not support setting the dirty flag, we have > >>>>+ * to use this fallback there as well */ > >>>>+ > >>>>+ for (start_sector = 0; start_sector < bs->total_sectors; > >>>>+ start_sector += sector_step) > >>>>+ { > >>>>+ /* As this function is generally used after committing an external > >>>>+ * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the > >>>>+ * default action for this kind of discard is to pass the discard, > >>>>+ * which will ideally result in an actually smaller image file, as > >>>>+ * is probably desired. */ > >>>>+ ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, > >>>>+ MIN(sector_step, > >>>>+ bs->total_sectors - start_sector), > >>>>+ QCOW2_DISCARD_SNAPSHOT, true); > >>>>+ if (ret < 0) { > >>>>+ break; > >>>>+ } > >>>>+ } > >>>My first though was to add a return here, so the indentation level for > >>>the rest is one less. Probably a matter of taste, though. > >>I'd rather put the second branch into an own function. > >Works for me. > > > >>>>+ } else { > >>>>+ int l1_clusters; > >>>>+ int64_t offset; > >>>>+ uint64_t *new_reftable; > >>>>+ uint64_t rt_entry; > >>>>+ struct { > >>>>+ uint64_t l1_offset; > >>>>+ uint64_t reftable_offset; > >>>>+ uint32_t reftable_clusters; > >>>>+ } QEMU_PACKED l1_ofs_rt_ofs_cls; > >>>>+ > >>>>+ ret = qcow2_cache_empty(bs, s->l2_table_cache); > >>>> if (ret < 0) { > >>>>- break; > >>>>+ return ret; > >>>>+ } > >>>>+ > >>>>+ ret = qcow2_cache_empty(bs, s->refcount_block_cache); > >>>>+ if (ret < 0) { > >>>>+ return ret; > >>>>+ } > >>>>+ > >>>>+ /* Refcounts will be broken utterly */ > >>>>+ ret = qcow2_mark_dirty(bs); > >>>>+ if (ret < 0) { > >>>>+ return ret; > >>>>+ } > >>>>+ > >>>>+ l1_clusters = DIV_ROUND_UP(s->l1_size, > >>>>+ s->cluster_size / sizeof(uint64_t)); > >>>>+ new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t)); > >>>>+ if (!new_reftable) { > >>>>+ return -ENOMEM; > >>>>+ } > >>>>+ > >>>>+ BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE); > >>>Until here, the failure cases are trivially okay. The worst thing that > >>>could happen is that the image is needlessly marked as dirty. > >>> > >>>>+ /* Overwrite enough clusters at the beginning of the sectors to place > >>>>+ * the refcount table, a refcount block and the L1 table in; this may > >>>>+ * overwrite parts of the existing refcount and L1 table, which is not > >>>>+ * an issue because the dirty flag is set, complete data loss is in fact > >>>>+ * desired and partial data loss is consequently fine as well */ > >>>>+ ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE, > >>>>+ (2 + l1_clusters) * s->cluster_size / > >>>>+ BDRV_SECTOR_SIZE, 0); > >>>If we crash at this point, we're _not_ okay any more. --verbose follows: > >>> > >>>On disk, we may have overwritten a refcount table or a refcount block > >>>with zeros. This is fine, we have the dirty flag set, so destroying any > >>>refcounting structure does no harm. > >>> > >>>We may also have overwritten an L1 or L2 table. As the commit correctly > >>>explains, this is doing partially what the function is supposed to do > >>>for the whole image. Affected data clusters are now read from the > >>>backing file. Good. > >>> > >>>However, we may also have overwritten data clusters that are still > >>>accessible using an L1/L2 table that hasn't been hit by this write > >>>operation. We're reading corrupted (zeroed out) data now instead of > >>>going to the backing file. Bug! > >>Oh, right, I forgot about the L1 table not always being at the start > >>of the file. > >> > >>>In my original suggestion I had an item where the L1 table was zeroed > >>>out first before the start of the image is zeroed. This would have > >>>avoided the bug. > >>> > >>>>+ if (ret < 0) { > >>>>+ g_free(new_reftable); > >>>>+ return ret; > >>>>+ } > >>>If we fail here (without crashing), the first clusters could be in their > >>>original state or partially zeroed. Assuming that you fixed the above > >>>bug, the on-disk state would be okay if we opened the image now because > >>>the dirty flag would trigger an image repair; but we don't get the > >>>repair when taking this failure path and we may have zeroed a refcount > >>>table/block. This is probably a problem and we may have to make the BDS > >>>unusable. > >>> > >>>The in-memory state of the L1 table is hopefully zeroed out, so it's > >>>consistent with what is on disk. > >>> > >>>The in-memory state of the refcount table looks like it's not in sync > >>>with the on-disk state. Note that while the dirty flag allows that the > >>>on-disk state can be anything, the in-memory state is what we keep using > >>>after a failure. The in-memory state isn't accurate at this point, but > >>>we only create leaks. Lots of them, because we zeroed the L1 table, but > >>>that's good enough. If refcounts are updated later, the old offsets > >>>should still be valid. > >>If we set at least parts of the in-memory reftable to zero, > >>everything probably breaks. Try to allocate a new cluster while the > >>beginning of the reftable is zero. So we cannot take the on-disk > >>reftable into memory. > >> > >>Doing it the other way around, writing the in-memory reftable to > >>disk on error won't work either. The refblocks may have been zeroed > >>out, so we have exactly the same problem. > >> > >>Therefore, to make the BDS usable after error, we have to (in the > >>error path) read the on-disk reftable into memory, call the qcow2 > >>repair function and hope for the best. > >> > >>Or, you know, we could go back to v11 which had my other version of > >>this patch which always kept everything consistent. :-P > >I'm not sure how important it is to keep the BDS usable after such an > >unlikely error. > > > >I started to write up a suggestion that we could do without > >qcow2_alloc_clusters(), but just build up the first refcount block > >ourselves. Doesn't really help us, because the new state is not what we > >need here, but it made me aware that it assumes that the L1 table is > >small enough to fit in the first refcount block and we would have to > >fall back to discard if it isn't. Come to think of it, I suspect you > >already make the same assumption without checking it. > > > > > >Anyway, if we want to keep the BDS usable what is the right state? > >We need references for the header, for the L1 table, for the refcount > >table and all refcount blocks. If we decide to build a new in-memory > >refcount table, the refcount blocks are only the refcount blocks that > >are still referenced there, i.e. we don't have to worry about the > >location of refcount blocks on the disk. > > > >In fact, we can also safely change the refcount table offset to > >cluster 1 and handle it together with the header. We'll then need one > >refcount block for header/reftable and potentially another one for the > >L1 table, which still must be in its original place (adding the > >assumption that the L1 table doesn't cross a refblock boundary :-/). > > > >Our refblock cache can hold 4 tables at least, so creating two refblocks > >purely in memory is doable. > > > >Leaves the question, is it worth the hassle? > > Probably not. Would you be fine with setting bs->drv to NULL in the > problematic error paths? By calling qcow2_signal_corruption(), right? I think that's fine. What about the assumption that 3 + l1_size fits in one refcount block? Do we need to check it even now? Kevin