From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43809) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X51Cp-0006v8-3n for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1X51Cf-000539-Ud for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:19 -0400 Received: from mx1.redhat.com ([209.132.183.28]:23632) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1X51Cf-000535-Ly for qemu-devel@nongnu.org; Wed, 09 Jul 2014 19:23:09 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s69NN9eQ005692 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Wed, 9 Jul 2014 19:23:09 -0400 Message-ID: <53BDCEE0.20400@redhat.com> Date: Thu, 10 Jul 2014 01:23:12 +0200 From: Max Reitz MIME-Version: 1.0 References: <1402167080-20316-1-git-send-email-mreitz@redhat.com> <1402167080-20316-4-git-send-email-mreitz@redhat.com> <20140630113339.GE4334@noname.str.redhat.com> In-Reply-To: <20140630113339.GE4334@noname.str.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v8 03/14] qcow2: Optimize bdrv_make_empty() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kevin Wolf Cc: qemu-devel@nongnu.org, Stefan Hajnoczi On 30.06.2014 13:33, Kevin Wolf wrote: > Am 07.06.2014 um 20:51 hat Max Reitz geschrieben: >> bdrv_make_empty() is currently only called if the current image >> represents an external snapshot that has been committed to its base >> image; it is therefore unlikely to have internal snapshots. In this >> case, bdrv_make_empty() can be greatly sped up by creating an empty L1 >> table and dropping all data clusters at once by recreating the refcount >> structure accordingly instead of normally discarding all clusters. >> >> If there are snapshots, fall back to the simple implementation (discard >> all clusters). >> >> Signed-off-by: Max Reitz >> Reviewed-by: Eric Blake > This approach looks a bit too complicated to me, and calulating the > required metadata size seems error-prone. > > How about this: > > 1. Set the dirty flag in the header so we can mess with the L1 table > without keeping the refcounts consistent > > 2. Overwrite the L1 table with zeros > > 3. Overwrite the first n clusters after the header with zeros > (n = 2 + l1_clusters). > > 4. Update the header: > refcount_table_offset = cluster_size > refcount_table_clusters = 1 > l1_table_offset = 3 * cluster_size > > 6. bdrv_truncate to n + 1 clusters > > 7. Now update the first 8 bytes at cluster_size (the first new refcount > table entry) to point to 2 * cluster_size (new refcount block) > > 8. Reset refcount block and L2 cache > > 9. Allocate n + 1 clusters (the header, too) and make sure you get > offset 0 > > 10. Remove the dirty flag Okay, after some fixing around and getting it to work, I noticed a (seemingly to me) rather big problem: If something bad happens between 3 and 7 (especially between 4 and 7), the image cannot be repaired. The reason is that the refcount table is empty and a new refcount block cannot be allocated because the consistency checks correctly signal an overlap with the refcount table (I guess, I would have expected the image header instead, but well...); this is because nothing is allocated and the first cluster offset returned by an allocation will probably be zero (the image header) or $cluster_size (where the reftable resides). So I think we absolutely have to make sure that whenever the refcount_table_offset is changed on disk, the reftable it points to already contains a valid offset. We could pull 7 before 4, but then we'd have to guarantee that 3 did not already overwrite the reftable (which it probably does). Well, maybe we could change 3 so it checks whether the reftable is already part of that area, and if it is, overwrite its first entry not with zero, but with 2 * cluster_size; if the offset of the reftable is not 2 * cluster_size, in which case we'd have to take some other offset. Then we could either try to write a new reftable anyway or just place everything behind that old reftable, just ignoring the "lost" space. In any case, I doubt it'll be much shorter overall with these additional checks. The current code has 340 LOC with extremely verbose commentary; my new code (failing to address the problem described above) has 100 LOC without any comments. So I guess the main issue is how *complicated* the code actually is; in my opinion, the most complicated and hardest to review piece of code in this patch (patch v8 3/14) is minimal_blob_size(); which, as far as I think, we will need in one form or another eventually anyway. create_refcount_l1() is pretty long, but due to the commentary should be well comprehensible. In any case, I still have the code for your proposal here and I'd be absolutely fine with working further on it. So if you think it'll be worth it anyway (which I personally don't have any opinion on), I'll continue on it. Max