From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40411) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xh18z-0007pR-HX for qemu-devel@nongnu.org; Wed, 22 Oct 2014 15:00:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Xh18t-00065v-7n for qemu-devel@nongnu.org; Wed, 22 Oct 2014 15:00:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:17149) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Xh18s-00065W-Qa for qemu-devel@nongnu.org; Wed, 22 Oct 2014 15:00:19 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9MJ0IaU016675 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Wed, 22 Oct 2014 15:00:18 -0400 Date: Wed, 22 Oct 2014 20:35:16 +0200 From: Kevin Wolf Message-ID: <20141022183516.GA6975@noname.redhat.com> References: <1413982283-10186-1-git-send-email-mreitz@redhat.com> <1413982283-10186-4-git-send-email-mreitz@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1413982283-10186-4-git-send-email-mreitz@redhat.com> Subject: Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz Cc: qemu-devel@nongnu.org, Stefan Hajnoczi Am 22.10.2014 um 14:51 hat Max Reitz geschrieben: > bdrv_make_empty() is currently only called if the current image > represents an external snapshot that has been committed to its base > image; it is therefore unlikely to have internal snapshots. In this > case, bdrv_make_empty() can be greatly sped up by emptying the L1 and > refcount table (while having the dirty flag set, which only works for > compat=1.1) and creating a trivial refcount structure. > > If there are snapshots or for compat=0.10, fall back to the simple > implementation (discard all clusters). > > Signed-off-by: Max Reitz Hey, this feels actually reviewable this time. :-) > diff --git a/block/blkdebug.c b/block/blkdebug.c > index e046b92..862d93b 100644 > --- a/block/blkdebug.c > +++ b/block/blkdebug.c > @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = { > [BLKDBG_PWRITEV] = "pwritev", > [BLKDBG_PWRITEV_ZERO] = "pwritev_zero", > [BLKDBG_PWRITEV_DONE] = "pwritev_done", > + > + [BLKDBG_EMPTY_IMAGE_PREPARE] = "empty_image_prepare", > }; > > static int get_event_by_name(const char *name, BlkDebugEvent *event) > diff --git a/block/qcow2.c b/block/qcow2.c > index 1ef3a5f..16dece2 100644 > --- a/block/qcow2.c > +++ b/block/qcow2.c > @@ -2232,24 +2232,137 @@ fail: > > static int qcow2_make_empty(BlockDriverState *bs) > { > + BDRVQcowState *s = bs->opaque; > int ret = 0; > - uint64_t start_sector; > - int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > > - for (start_sector = 0; start_sector < bs->total_sectors; > - start_sector += sector_step) > - { > - /* As this function is generally used after committing an external > - * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the > - * default action for this kind of discard is to pass the discard, > - * which will ideally result in an actually smaller image file, as > - * is probably desired. */ > - ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, > - MIN(sector_step, > - bs->total_sectors - start_sector), > - QCOW2_DISCARD_SNAPSHOT, true); > + if (s->snapshots || s->qcow_version < 3) { > + uint64_t start_sector; > + int sector_step = INT_MAX / BDRV_SECTOR_SIZE; > + > + /* If there are snapshots, every active cluster has to be discarded; and > + * because compat=0.10 does not support setting the dirty flag, we have > + * to use this fallback there as well */ > + > + for (start_sector = 0; start_sector < bs->total_sectors; > + start_sector += sector_step) > + { > + /* As this function is generally used after committing an external > + * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the > + * default action for this kind of discard is to pass the discard, > + * which will ideally result in an actually smaller image file, as > + * is probably desired. */ > + ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, > + MIN(sector_step, > + bs->total_sectors - start_sector), > + QCOW2_DISCARD_SNAPSHOT, true); > + if (ret < 0) { > + break; > + } > + } My first though was to add a return here, so the indentation level for the rest is one less. Probably a matter of taste, though. > + } else { > + int l1_clusters; > + int64_t offset; > + uint64_t *new_reftable; > + uint64_t rt_entry; > + struct { > + uint64_t l1_offset; > + uint64_t reftable_offset; > + uint32_t reftable_clusters; > + } QEMU_PACKED l1_ofs_rt_ofs_cls; > + > + ret = qcow2_cache_empty(bs, s->l2_table_cache); > if (ret < 0) { > - break; > + return ret; > + } > + > + ret = qcow2_cache_empty(bs, s->refcount_block_cache); > + if (ret < 0) { > + return ret; > + } > + > + /* Refcounts will be broken utterly */ > + ret = qcow2_mark_dirty(bs); > + if (ret < 0) { > + return ret; > + } > + > + l1_clusters = DIV_ROUND_UP(s->l1_size, > + s->cluster_size / sizeof(uint64_t)); > + new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t)); > + if (!new_reftable) { > + return -ENOMEM; > + } > + > + BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE); Until here, the failure cases are trivially okay. The worst thing that could happen is that the image is needlessly marked as dirty. > + /* Overwrite enough clusters at the beginning of the sectors to place > + * the refcount table, a refcount block and the L1 table in; this may > + * overwrite parts of the existing refcount and L1 table, which is not > + * an issue because the dirty flag is set, complete data loss is in fact > + * desired and partial data loss is consequently fine as well */ > + ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE, > + (2 + l1_clusters) * s->cluster_size / > + BDRV_SECTOR_SIZE, 0); If we crash at this point, we're _not_ okay any more. --verbose follows: On disk, we may have overwritten a refcount table or a refcount block with zeros. This is fine, we have the dirty flag set, so destroying any refcounting structure does no harm. We may also have overwritten an L1 or L2 table. As the commit correctly explains, this is doing partially what the function is supposed to do for the whole image. Affected data clusters are now read from the backing file. Good. However, we may also have overwritten data clusters that are still accessible using an L1/L2 table that hasn't been hit by this write operation. We're reading corrupted (zeroed out) data now instead of going to the backing file. Bug! In my original suggestion I had an item where the L1 table was zeroed out first before the start of the image is zeroed. This would have avoided the bug. > + if (ret < 0) { > + g_free(new_reftable); > + return ret; > + } If we fail here (without crashing), the first clusters could be in their original state or partially zeroed. Assuming that you fixed the above bug, the on-disk state would be okay if we opened the image now because the dirty flag would trigger an image repair; but we don't get the repair when taking this failure path and we may have zeroed a refcount table/block. This is probably a problem and we may have to make the BDS unusable. The in-memory state of the L1 table is hopefully zeroed out, so it's consistent with what is on disk. The in-memory state of the refcount table looks like it's not in sync with the on-disk state. Note that while the dirty flag allows that the on-disk state can be anything, the in-memory state is what we keep using after a failure. The in-memory state isn't accurate at this point, but we only create leaks. Lots of them, because we zeroed the L1 table, but that's good enough. If refcounts are updated later, the old offsets should still be valid. All of the following code shouldn't change what a guest is seeing, but only fix up the refcounts. > + BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE); > + BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE); > + > + /* "Create" an empty reftable (one cluster) directly after the image > + * header and an empty L1 table three clusters after the image header; > + * the cluster between those two will be used as the first refblock */ > + cpu_to_be64w(&l1_ofs_rt_ofs_cls.l1_offset, 3 * s->cluster_size); > + cpu_to_be64w(&l1_ofs_rt_ofs_cls.reftable_offset, s->cluster_size); > + cpu_to_be32w(&l1_ofs_rt_ofs_cls.reftable_clusters, 1); > + ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_table_offset), > + &l1_ofs_rt_ofs_cls, sizeof(l1_ofs_rt_ofs_cls)); If we crash here, we have a new, but still an all-zero L1 table, which ensures that the guest sees the right thing. We also get a new refcount table, which is invalid anyway. It is completely zeroed out, image repair should be able to cope with it. > + if (ret < 0) { > + g_free(new_reftable); > + return ret; > + } Failure without a crash. It should be safe to assume that either the whole header has been updated (and the flush failed) or it's in its old state. Probably the same problem as above with potentially overwritten refcount blocks while we don't get a repair on this failure path. Without accurate in-memory refcounts, we can't keep using the BDS. > + s->l1_table_offset = 3 * s->cluster_size; > + memset(s->l1_table, 0, s->l1_size * sizeof(uint64_t)); This line has been moved to above by the assumed L1 fix. > + s->refcount_table_offset = s->cluster_size; > + s->refcount_table_size = s->cluster_size / sizeof(uint64_t); > + > + g_free(s->refcount_table); > + s->refcount_table = new_reftable; > + > + BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC); > + > + /* Enter the first refblock into the reftable */ > + rt_entry = cpu_to_be64(2 * s->cluster_size); > + ret = bdrv_pwrite_sync(bs->file, s->cluster_size, > + &rt_entry, sizeof(rt_entry)); On-disk status: L1 table empty, refcount table has a refblock, but still no references. Dirty flag ensures we're safe. > + if (ret < 0) { > + return ret; > + } Failure path without crash: s->refcount_table_offset/size have been updated, but we still don't have accurate in-memory information. Still can't keep using the BDS after failing here. > + s->refcount_table[0] = 2 * s->cluster_size; > + > + ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size); > + if (ret < 0) { > + return ret; > + } Nothing interesting happens here (except that this is the heart of the functionality :-)). The on-disk state didn't reference any of the truncated clusters any more. The in-memory state isn't fixed yet. > + s->free_cluster_index = 0; > + offset = qcow2_alloc_clusters(bs, 3 * s->cluster_size + > + s->l1_size * sizeof(uint64_t)); > + if (offset < 0) { > + return offset; BDS still unusable here. > + } else if (offset > 0) { > + error_report("First cluster in emptied image is in use"); > + abort(); > + } Finally, at this point we have correct in-memory information again. Not sure if it's worth the effort of trying to move this further up. We could at least easily move the bdrv_truncate() to below this, which would make it's error path less bad. > + ret = qcow2_mark_clean(bs); > + if (ret < 0) { > + return ret; > } And this writes them to disk. If we fail, we can keep going on, as the in-memory information is correct and after a qemu restart, the dirty flag will trigger a repair. > } > > diff --git a/include/block/block.h b/include/block/block.h > index 341054d..b1f4385 100644 > --- a/include/block/block.h > +++ b/include/block/block.h > @@ -498,6 +498,8 @@ typedef enum { > BLKDBG_PWRITEV_ZERO, > BLKDBG_PWRITEV_DONE, > > + BLKDBG_EMPTY_IMAGE_PREPARE, > + > BLKDBG_EVENT_MAX, > } BlkDebugEvent; Kevin