From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43809)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1X51Cp-0006v8-3n
	for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1X51Cf-000539-Ud
	for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:19 -0400
Received: from mx1.redhat.com ([209.132.183.28]:23632)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mreitz@redhat.com>) id 1X51Cf-000535-Ly
	for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:09 -0400
Received: from int-mx13.intmail.prod.int.phx2.redhat.com
	(int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s69NN9eQ005692
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=OK)
	for <qemu-devel@nongnu.org>; Wed, 9 Jul 2014 19:23:09 -0400
Message-ID: <53BDCEE0.20400@redhat.com>
Date: Thu, 10 Jul 2014 01:23:12 +0200
From: Max Reitz <mreitz@redhat.com>
MIME-Version: 1.0
References: <1402167080-20316-1-git-send-email-mreitz@redhat.com>
	<1402167080-20316-4-git-send-email-mreitz@redhat.com>
	<20140630113339.GE4334@noname.str.redhat.com>
In-Reply-To: <20140630113339.GE4334@noname.str.redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v8 03/14] qcow2: Optimize bdrv_make_empty()
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

On 30.06.2014 13:33, Kevin Wolf wrote:
> Am 07.06.2014 um 20:51 hat Max Reitz geschrieben:
>> bdrv_make_empty() is currently only called if the current image
>> represents an external snapshot that has been committed to its base
>> image; it is therefore unlikely to have internal snapshots. In this
>> case, bdrv_make_empty() can be greatly sped up by creating an empty L1
>> table and dropping all data clusters at once by recreating the refcount
>> structure accordingly instead of normally discarding all clusters.
>>
>> If there are snapshots, fall back to the simple implementation (discard
>> all clusters).
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
> This approach looks a bit too complicated to me, and calulating the
> required metadata size seems error-prone.
>
> How about this:
>
> 1. Set the dirty flag in the header so we can mess with the L1 table
>     without keeping the refcounts consistent
>
> 2. Overwrite the L1 table with zeros
>
> 3. Overwrite the first n clusters after the header with zeros
>     (n = 2 + l1_clusters).
>
> 4. Update the header:
>     refcount_table_offset = cluster_size
>     refcount_table_clusters = 1
>     l1_table_offset = 3 * cluster_size
>
> 6. bdrv_truncate to n + 1 clusters
>
> 7. Now update the first 8 bytes at cluster_size (the first new refcount
>     table entry) to point to 2 * cluster_size (new refcount block)
>
> 8. Reset refcount block and L2 cache
>
> 9. Allocate n + 1 clusters (the header, too) and make sure you get
>     offset 0
>
> 10. Remove the dirty flag

Okay, after some fixing around and getting it to work, I noticed a 
(seemingly to me) rather big problem: If something bad happens between 3 
and 7 (especially between 4 and 7), the image cannot be repaired. The 
reason is that the refcount table is empty and a new refcount block 
cannot be allocated because the consistency checks correctly signal an 
overlap with the refcount table (I guess, I would have expected the 
image header instead, but well...); this is because nothing is allocated 
and the first cluster offset returned by an allocation will probably be 
zero (the image header) or $cluster_size (where the reftable resides).

So I think we absolutely have to make sure that whenever the 
refcount_table_offset is changed on disk, the reftable it points to 
already contains a valid offset. We could pull 7 before 4, but then we'd 
have to guarantee that 3 did not already overwrite the reftable (which 
it probably does). Well, maybe we could change 3 so it checks whether 
the reftable is already part of that area, and if it is, overwrite its 
first entry not with zero, but with 2 * cluster_size; if the offset of 
the reftable is not 2 * cluster_size, in which case we'd have to take 
some other offset. Then we could either try to write a new reftable 
anyway or just place everything behind that old reftable, just ignoring 
the "lost" space.

In any case, I doubt it'll be much shorter overall with these additional 
checks. The current code has 340 LOC with extremely verbose commentary; 
my new code (failing to address the problem described above) has 100 LOC 
without any comments.

So I guess the main issue is how *complicated* the code actually is; in 
my opinion, the most complicated and hardest to review piece of code in 
this patch (patch v8 3/14) is minimal_blob_size(); which, as far as I 
think, we will need in one form or another eventually anyway. 
create_refcount_l1() is pretty long, but due to the commentary should be 
well comprehensible.

In any case, I still have the code for your proposal here and I'd be 
absolutely fine with working further on it. So if you think it'll be 
worth it anyway (which I personally don't have any opinion on), I'll 
continue on it.

Max