From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55261) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XhDsc-0006lX-9Y for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XhDsR-00072E-Aq for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:22 -0400 Received: from mx1.redhat.com ([209.132.183.28]:15585) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XhDsR-000720-1R for qemu-devel@nongnu.org; Thu, 23 Oct 2014 04:36:11 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9N8a994026397 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for ; Thu, 23 Oct 2014 04:36:09 -0400 Message-ID: <5448BDF6.9080406@redhat.com> Date: Thu, 23 Oct 2014 10:36:06 +0200 From: Max Reitz MIME-Version: 1.0 References: <1413982283-10186-1-git-send-email-mreitz@redhat.com> <1413982283-10186-4-git-send-email-mreitz@redhat.com> <20141022183516.GA6975@noname.redhat.com> <5448B263.4050408@redhat.com> <20141023082957.GC3522@noname.redhat.com> In-Reply-To: <20141023082957.GC3522@noname.redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v13 03/14] qcow2: Optimize bdrv_make_empty() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Stefan Hajnoczi On 2014-10-23 at 10:29, Kevin Wolf wrote: > Am 23.10.2014 um 09:46 hat Max Reitz geschrieben: >> On 2014-10-22 at 20:35, Kevin Wolf wrote: >>> Am 22.10.2014 um 14:51 hat Max Reitz geschrieben: >>>> bdrv_make_empty() is currently only called if the current image >>>> represents an external snapshot that has been committed to its base >>>> image; it is therefore unlikely to have internal snapshots. In this >>>> case, bdrv_make_empty() can be greatly sped up by emptying the L1 and >>>> refcount table (while having the dirty flag set, which only works for >>>> compat=1.1) and creating a trivial refcount structure. >>>> >>>> If there are snapshots or for compat=0.10, fall back to the simple >>>> implementation (discard all clusters). >>>> >>>> Signed-off-by: Max Reitz >>> Hey, this feels actually reviewable this time. :-) >> I'm still unsure which version I like more. If it wasn't for the >> math, I'd prefer the other version. >> >>>> diff --git a/block/blkdebug.c b/block/blkdebug.c >>>> index e046b92..862d93b 100644 >>>> --- a/block/blkdebug.c >>>> +++ b/block/blkdebug.c >>>> @@ -195,6 +195,8 @@ static const char *event_names[BLKDBG_EVENT_MAX] = { >>>> [BLKDBG_PWRITEV] = "pwritev", >>>> [BLKDBG_PWRITEV_ZERO] = "pwritev_zero", >>>> [BLKDBG_PWRITEV_DONE] = "pwritev_done", >>>> + >>>> + [BLKDBG_EMPTY_IMAGE_PREPARE] = "empty_image_prepare", >>>> }; >>>> static int get_event_by_name(const char *name, BlkDebugEvent *event) >>>> diff --git a/block/qcow2.c b/block/qcow2.c >>>> index 1ef3a5f..16dece2 100644 >>>> --- a/block/qcow2.c >>>> +++ b/block/qcow2.c >>>> @@ -2232,24 +2232,137 @@ fail: >>>> static int qcow2_make_empty(BlockDriverState *bs) >>>> { >>>> + BDRVQcowState *s = bs->opaque; >>>> int ret = 0; >>>> - uint64_t start_sector; >>>> - int sector_step = INT_MAX / BDRV_SECTOR_SIZE; >>>> - for (start_sector = 0; start_sector < bs->total_sectors; >>>> - start_sector += sector_step) >>>> - { >>>> - /* As this function is generally used after committing an external >>>> - * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the >>>> - * default action for this kind of discard is to pass the discard, >>>> - * which will ideally result in an actually smaller image file, as >>>> - * is probably desired. */ >>>> - ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, >>>> - MIN(sector_step, >>>> - bs->total_sectors - start_sector), >>>> - QCOW2_DISCARD_SNAPSHOT, true); >>>> + if (s->snapshots || s->qcow_version < 3) { >>>> + uint64_t start_sector; >>>> + int sector_step = INT_MAX / BDRV_SECTOR_SIZE; >>>> + >>>> + /* If there are snapshots, every active cluster has to be discarded; and >>>> + * because compat=0.10 does not support setting the dirty flag, we have >>>> + * to use this fallback there as well */ >>>> + >>>> + for (start_sector = 0; start_sector < bs->total_sectors; >>>> + start_sector += sector_step) >>>> + { >>>> + /* As this function is generally used after committing an external >>>> + * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, the >>>> + * default action for this kind of discard is to pass the discard, >>>> + * which will ideally result in an actually smaller image file, as >>>> + * is probably desired. */ >>>> + ret = qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR_SIZE, >>>> + MIN(sector_step, >>>> + bs->total_sectors - start_sector), >>>> + QCOW2_DISCARD_SNAPSHOT, true); >>>> + if (ret < 0) { >>>> + break; >>>> + } >>>> + } >>> My first though was to add a return here, so the indentation level for >>> the rest is one less. Probably a matter of taste, though. >> I'd rather put the second branch into an own function. > Works for me. > >>>> + } else { >>>> + int l1_clusters; >>>> + int64_t offset; >>>> + uint64_t *new_reftable; >>>> + uint64_t rt_entry; >>>> + struct { >>>> + uint64_t l1_offset; >>>> + uint64_t reftable_offset; >>>> + uint32_t reftable_clusters; >>>> + } QEMU_PACKED l1_ofs_rt_ofs_cls; >>>> + >>>> + ret = qcow2_cache_empty(bs, s->l2_table_cache); >>>> if (ret < 0) { >>>> - break; >>>> + return ret; >>>> + } >>>> + >>>> + ret = qcow2_cache_empty(bs, s->refcount_block_cache); >>>> + if (ret < 0) { >>>> + return ret; >>>> + } >>>> + >>>> + /* Refcounts will be broken utterly */ >>>> + ret = qcow2_mark_dirty(bs); >>>> + if (ret < 0) { >>>> + return ret; >>>> + } >>>> + >>>> + l1_clusters = DIV_ROUND_UP(s->l1_size, >>>> + s->cluster_size / sizeof(uint64_t)); >>>> + new_reftable = g_try_new0(uint64_t, s->cluster_size / sizeof(uint64_t)); >>>> + if (!new_reftable) { >>>> + return -ENOMEM; >>>> + } >>>> + >>>> + BLKDBG_EVENT(bs->file, BLKDBG_EMPTY_IMAGE_PREPARE); >>> Until here, the failure cases are trivially okay. The worst thing that >>> could happen is that the image is needlessly marked as dirty. >>> >>>> + /* Overwrite enough clusters at the beginning of the sectors to place >>>> + * the refcount table, a refcount block and the L1 table in; this may >>>> + * overwrite parts of the existing refcount and L1 table, which is not >>>> + * an issue because the dirty flag is set, complete data loss is in fact >>>> + * desired and partial data loss is consequently fine as well */ >>>> + ret = bdrv_write_zeroes(bs->file, s->cluster_size / BDRV_SECTOR_SIZE, >>>> + (2 + l1_clusters) * s->cluster_size / >>>> + BDRV_SECTOR_SIZE, 0); >>> If we crash at this point, we're _not_ okay any more. --verbose follows: >>> >>> On disk, we may have overwritten a refcount table or a refcount block >>> with zeros. This is fine, we have the dirty flag set, so destroying any >>> refcounting structure does no harm. >>> >>> We may also have overwritten an L1 or L2 table. As the commit correctly >>> explains, this is doing partially what the function is supposed to do >>> for the whole image. Affected data clusters are now read from the >>> backing file. Good. >>> >>> However, we may also have overwritten data clusters that are still >>> accessible using an L1/L2 table that hasn't been hit by this write >>> operation. We're reading corrupted (zeroed out) data now instead of >>> going to the backing file. Bug! >> Oh, right, I forgot about the L1 table not always being at the start >> of the file. >> >>> In my original suggestion I had an item where the L1 table was zeroed >>> out first before the start of the image is zeroed. This would have >>> avoided the bug. >>> >>>> + if (ret < 0) { >>>> + g_free(new_reftable); >>>> + return ret; >>>> + } >>> If we fail here (without crashing), the first clusters could be in their >>> original state or partially zeroed. Assuming that you fixed the above >>> bug, the on-disk state would be okay if we opened the image now because >>> the dirty flag would trigger an image repair; but we don't get the >>> repair when taking this failure path and we may have zeroed a refcount >>> table/block. This is probably a problem and we may have to make the BDS >>> unusable. >>> >>> The in-memory state of the L1 table is hopefully zeroed out, so it's >>> consistent with what is on disk. >>> >>> The in-memory state of the refcount table looks like it's not in sync >>> with the on-disk state. Note that while the dirty flag allows that the >>> on-disk state can be anything, the in-memory state is what we keep using >>> after a failure. The in-memory state isn't accurate at this point, but >>> we only create leaks. Lots of them, because we zeroed the L1 table, but >>> that's good enough. If refcounts are updated later, the old offsets >>> should still be valid. >> If we set at least parts of the in-memory reftable to zero, >> everything probably breaks. Try to allocate a new cluster while the >> beginning of the reftable is zero. So we cannot take the on-disk >> reftable into memory. >> >> Doing it the other way around, writing the in-memory reftable to >> disk on error won't work either. The refblocks may have been zeroed >> out, so we have exactly the same problem. >> >> Therefore, to make the BDS usable after error, we have to (in the >> error path) read the on-disk reftable into memory, call the qcow2 >> repair function and hope for the best. >> >> Or, you know, we could go back to v11 which had my other version of >> this patch which always kept everything consistent. :-P > I'm not sure how important it is to keep the BDS usable after such an > unlikely error. > > I started to write up a suggestion that we could do without > qcow2_alloc_clusters(), but just build up the first refcount block > ourselves. Doesn't really help us, because the new state is not what we > need here, but it made me aware that it assumes that the L1 table is > small enough to fit in the first refcount block and we would have to > fall back to discard if it isn't. Come to think of it, I suspect you > already make the same assumption without checking it. > > > Anyway, if we want to keep the BDS usable what is the right state? > We need references for the header, for the L1 table, for the refcount > table and all refcount blocks. If we decide to build a new in-memory > refcount table, the refcount blocks are only the refcount blocks that > are still referenced there, i.e. we don't have to worry about the > location of refcount blocks on the disk. > > In fact, we can also safely change the refcount table offset to > cluster 1 and handle it together with the header. We'll then need one > refcount block for header/reftable and potentially another one for the > L1 table, which still must be in its original place (adding the > assumption that the L1 table doesn't cross a refblock boundary :-/). > > Our refblock cache can hold 4 tables at least, so creating two refblocks > purely in memory is doable. > > Leaves the question, is it worth the hassle? Probably not. Would you be fine with setting bs->drv to NULL in the problematic error paths? Max