From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52092) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WgDmo-0002p1-FS for qemu-devel@nongnu.org; Fri, 02 May 2014 09:46:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WgDme-00008w-Kl for qemu-devel@nongnu.org; Fri, 02 May 2014 09:45:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54806) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WgDme-00008r-9S for qemu-devel@nongnu.org; Fri, 02 May 2014 09:45:48 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s42DjldB023884 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Fri, 2 May 2014 09:45:47 -0400 Message-ID: <5363A17C.4000500@redhat.com> Date: Fri, 02 May 2014 15:45:32 +0200 From: Max Reitz MIME-Version: 1.0 References: <1398784072-3319-1-git-send-email-mreitz@redhat.com> <1398784072-3319-4-git-send-email-mreitz@redhat.com> <5360268A.8080100@redhat.com> In-Reply-To: <5360268A.8080100@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v6 03/14] qcow2: Optimize bdrv_make_empty() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake , qemu-devel@nongnu.org Cc: Kevin Wolf , Stefan Hajnoczi On 30.04.2014 00:24, Eric Blake wrote: > On 04/29/2014 09:07 AM, Max Reitz wrote: >> bdrv_make_empty() is currently only called if the current image >> represents an external snapshot that has been committed to its base >> image; it is therefore unlikely to have internal snapshots. In this >> case, bdrv_make_empty() can be greatly sped up by creating an empty L1 >> table and dropping all data clusters at once by recreating the refcoun= t >> structure accordingly instead of normally discarding all clusters. >> >> If there are snapshots, fall back to the simple implementation (discar= d >> all clusters). >> >> Signed-off-by: Max Reitz >> --- >> block/qcow2.c | 385 ++++++++++++++++++++++++++++++++++++++++++++++++= +++++++--- >> 1 file changed, 370 insertions(+), 15 deletions(-) > Wow! This was a fun review, and took me quite some time. I also had t= o > re-read docs/specs/qcow2.txt several times to make sure I understood al= l > your numbers, but everything in here made sense. > > The pdf helped; I didn't spot any math errors there (I did have to > figure out the mapping between the one-letter variables in the pdf to > the multi-char variables in your code; such as 'x' matching 'overhead'). > > Reviewed-by: Eric Blake Thank you very much. :-) > More comments below, describing some of my thoughts while reviewing. > > This patch undoes anything accomplished earlier by preallocation. I > suppose that if a user wants to empty an image but still guarantee that > it is preallocated large enough, they can do it in two steps; or they > can do the 'commit -d' to skip emptying the image, and then just > recreate a new qcow2 image with full preallocation and bypass the work > of this patch. I agree that it doesn't make any sense to complicate > this patch even more to worry about restoring a level of preallocation. > >> +/* >> + * Creates a reftable pointing to refblocks following right afterward= s and an >> + * empty L1 table at the given @offset. @refblocks is the number of r= efblocks >> + * to create. >> + * >> + * This combination of structures (reftable+refblocks+L1) is here cal= led a >> + * "blob". > Indeed, the bare minimum qcow2 image is one consisting of a reftable an= d > refblocks for all host clusters, and an L1 table large enough for all > guest clusters; making the three tables contiguous and acting on a blob > makes sense. > >> + */ >> +static int create_refcount_l1(BlockDriverState *bs, uint64_t offset, >> + uint64_t refblocks) >> +{ >> + BDRVQcowState *s =3D bs->opaque; >> + uint64_t *reftable =3D NULL; >> + uint16_t *refblock =3D NULL; >> + uint64_t reftable_clusters; >> + uint64_t rbi; >> + uint64_t blob_start, blob_end; >> + uint64_t l2_tables, l1_clusters, l1_size2; >> + uint8_t l1_size_and_offset[12]; >> + uint64_t rt_offset; > Good; most math is in uint64_t, so I don't have to think too hard about > overflows. > >> + int ret, i; >> + >> + reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_size / 8= ); > Each reftable entry is 8 bytes, so this is correct for how many entries > are packed in a cluster. > >> + l2_tables =3D DIV_ROUND_UP(bs->total_sectors / s->cluster_sectors= , >> + s->cluster_size / 8); > Each L2 entry is 8 bytes, so this is correct for how many L2 entries ar= e > packed in an L2 cluster. It took me a while to figure out that you hav= e > to know how many L2 clusters would exist in a fully populated image in > order to size the L1 table, even though you are NOT allocating L2 clust= ers. > >> + l1_clusters =3D DIV_ROUND_UP(l2_tables, s->cluster_size / 8); > Each L1 entry is 8 bytes, so this is also correct. > >> + l1_size2 =3D l1_clusters << s->cluster_bits; >> + >> + blob_start =3D offset; >> + blob_end =3D offset + ((reftable_clusters + refblocks + l1_cluste= rs) << >> + s->cluster_bits); >> + >> + ret =3D qcow2_pre_write_overlap_check(bs, 0, blob_start, >> + blob_end - blob_start); > The caller allocated all three parts of the blob but has not yet put > them to use; so this makes sense that we should have full access as the > first user of each cluster in the range. > >> + if (ret < 0) { >> + return ret; >> + } >> + >> + /* Create the reftable pointing with entries pointing consecutive= ly to the >> + * following clusters */ >> + reftable =3D g_malloc0_n(reftable_clusters, s->cluster_size); > The zero initialization is overkill for all but the last cluster, if th= e > guest image size is not an even multiple of cluster size. But it > doesn't hurt to be safe. You COULD have optimized further by leaving 0 > entries for refcount blocks that would all contain 0 entries, but that > would make the code more complex; furthermore, you call this blob > initialization function twice in the normal case, but only the first > call will generate any refcount blocks containing all zeros. In the > second call, when the refcount block is now hoisted to the front of the > file, all refcount blocks will be non-zero. > >> + >> + for (rbi =3D 0; rbi < refblocks; rbi++) { >> + reftable[rbi] =3D cpu_to_be64(offset + ((reftable_clusters + = rbi) << >> + s->cluster_bits)); >> + } > Straightforward mapping. All but the tail will now be mapped into > consecutive refblocks. > >> + >> + ret =3D bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, (uint8_t= *)reftable, >> + reftable_clusters * s->cluster_sectors); >> + if (ret < 0) { >> + goto out; >> + } >> + >> + offset +=3D reftable_clusters << s->cluster_bits; >> + >> + /* Keep the reftable, as we will need it for the BDRVQcowState an= yway */ >> + >> + /* Now, create all the refblocks */ >> + refblock =3D g_malloc(s->cluster_size); >> + >> + for (rbi =3D 0; rbi < refblocks; rbi++) { >> + for (i =3D 0; i < s->cluster_size / 2; i++) { >> + uint64_t cluster_offset =3D ((rbi << (s->cluster_bits - 1= )) + i) >> + << s->cluster_bits; >> + >> + /* Only 0 and 1 are possible as refcounts here */ >> + refblock[i] =3D cpu_to_be16(!cluster_offset || >> + (cluster_offset >=3D blob_start= && >> + cluster_offset < blob_end)); > Caller passed in refblocks, but you are correct that each refblock entr= y > is 2 bytes long. > > The image header and all clusters within the blob will have a refcount > of 1, all other clusters will have a refcount of 2. Of course, at this > point, this is just pre-populating the table, it is not actually in use= yet. > >> + } >> + >> + ret =3D bdrv_write(bs->file, offset >> BDRV_SECTOR_BITS, >> + (uint8_t *)refblock, s->cluster_sectors); >> + if (ret < 0) { >> + goto out; >> + } >> + >> + offset +=3D s->cluster_size; >> + } >> + >> + g_free(refblock); >> + refblock =3D NULL; >> + >> + /* The L1 table is very simple */ >> + ret =3D bdrv_write_zeroes(bs->file, offset >> BDRV_SECTOR_BITS, >> + l1_clusters * s->cluster_sectors, 0); >> + if (ret < 0) { >> + goto out; >> + } > Sets up the new L1 table to point to all L2 tables being unallocated. > Any reason you aren't passing BDRV_REQ_MAY_UNMAP, to make the allocated > size of the image even smaller when supported? I just didn't think about it at the moment. You're right, it's probably=20 even better to unmap this area. >> + >> + /* Now make sure all changes are stable and clear all caches */ >> + bdrv_flush(bs); >> + >> + /* This is probably not really necessary, but it cannot hurt, eit= her */ >> + ret =3D qcow2_mark_clean(bs); >> + if (ret < 0) { >> + goto out; >> + } >> + >> + ret =3D qcow2_cache_empty(bs, s->l2_table_cache); >> + if (ret < 0) { >> + goto out; >> + } >> + >> + ret =3D qcow2_cache_empty(bs, s->refcount_block_cache); >> + if (ret < 0) { >> + goto out; >> + } >> + >> + /* Modify the image header to point to the new blank L1 table. Th= is will >> + * leak all currently existing data clusters, which is fine. */ >> + BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE); >> + >> + assert(l1_size2 / 8 <=3D UINT32_MAX); >> + cpu_to_be32w((uint32_t *)&l1_size_and_offset[0], l1_size2 / 8); >> + cpu_to_be64w((uint64_t *)&l1_size_and_offset[4], offset); >> + ret =3D bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size), >> + l1_size_and_offset, sizeof(l1_size_and_off= set)); >> + if (ret < 0) { >> + goto out; >> + } > Pivots the active L1 table; so now the guest sees the entire backing > file due to all L2 entries being unallocated; the refcount table is > still unpivoted. This failure point is where the blob is leaked, but > the old image is still consistent. > >> + >> + /* Adapt the in-memory L1 table accordingly */ >> + s->l1_table_offset =3D offset; >> + s->l1_size =3D l1_size2 / 8; >> + >> + s->l1_table =3D g_realloc(s->l1_table, l1_size2); >> + memset(s->l1_table, 0, l1_size2); >> + >> + /* Modify the image header to point to the refcount table. This w= ill fix all >> + * leaks (unless an error occurs). */ >> + BLKDBG_EVENT(bs->file, BLKDBG_REFTABLE_UPDATE); >> + >> + rt_offset =3D cpu_to_be64(blob_start); >> + ret =3D bdrv_pwrite_sync(bs->file, >> + offsetof(QCowHeader, refcount_table_offset= ), >> + &rt_offset, sizeof(rt_offset)); >> + if (ret < 0) { >> + goto out; >> + } > Pivots the refcount table; the old table is now gone and the new table > is now (potentially) self-consistent (we still require a truncate of an= y > clusters already in the image but beyond the size of this new refcount > table before the refcount table once again correctly describes the > entire file). > >> + >> + /* The image is now clean. The only thing left is to adapt the in= -memory >> + * refcount table to match it. */ >> + s->refcount_table_offset =3D blob_start; >> + s->refcount_table_size =3D reftable_clusters << (s->cluster_bit= s - 3); >> + >> + for (rbi =3D 0; rbi < refblocks; rbi++) { >> + be64_to_cpus(&reftable[rbi]); >> + } >> + >> + g_free(s->refcount_table); >> + s->refcount_table =3D reftable; >> + reftable =3D NULL; >> + >> +out: >> + g_free(refblock); >> + g_free(reftable); >> + return ret; >> +} > Looks correct. > >> + >> +/* >> + * Calculates the number of clusters required for an L1 table for an = image with >> + * the given parameters plus the reftable and the refblocks required = to cover >> + * themselves, the L1 table and a given number of clusters @overhead. >> + * >> + * @ts: Total number of guest sectors the image provides >> + * @cb: 1 << @cb is the cluster size in bytes >> + * @spcb: 1 << @spcb is the number of clusters per sector >> + * @overhead: Number of clusters which shall additionally be covered = by the >> + * refcount structures >> + */ >> +static uint64_t minimal_blob_size(uint64_t ts, int cb, int spcb, >> + uint64_t overhead) >> +{ >> + uint64_t cs =3D UINT64_C(1) << cb; >> + uint64_t spc =3D UINT64_C(1) << spcb; >> + >> + /* The following statement is a solution for this system of equat= ions: >> + * >> + * Let req_cls be the required number of clusters required for th= e reftable, >> + * all refblocks and the L1 table. >> + * >> + * rtc be the clusters required for the reftable, rbc the cluster= s for all >> + * refblocks (i.e., the number of refblocks), l1c the clusters fo= r the L1 >> + * table and l2c the clusters for all L2 tables (i.e., the number= of L2 >> + * tables). >> + * >> + * cs be the cluster size (in bytes), ts the total number of sect= ors, spc >> + * the number of sectors per cluster and tc the total number of c= lusters. >> + * >> + * overhead is a number of clusters which should additionally be = covered by >> + * the refcount structures (i.e. all clusters before this blob of= req_cls >> + * clusters). >> + * >> + * req_cls >=3D rtc + rbc + l1c >> + * -- should be obvious > The blob must be large enough to hold all three tables. > >> + * rbc =3D ceil((overhead + req_cls) / (cs / 2)) >> + * -- as each refblock holds cs/2 entries > The refblock table must be large enough to describe all host clusters > (the overhead that occurs before the blob, as well as the blob itself), > assuming the table is fully utilized, and rounded up to cluster size. > We could have a smaller refblock table by omitting clusters with all-0 > entries, but it's not worth the complication. > >> + * rtc =3D ceil(rbc / (cs / 8)) >> + * -- as each reftable cluster holds cs/8 entries > The reftable must be large enough to point to each refblock, and rounde= d > up to cluster size. > >> + * tc =3D ts / spc >> + * -- ts should be a multiple of spc > Hmm. What guarantees that ts is a multiple of spc? 'qemu-img create -= f > qcow2 blah 512' shows an image with a virtual size of 512 but a cluster > size of 64k; isn't that a case where ts=3D=3D1 but spc=3D=3D128 (spcb=3D= =3D7), so > tc=3D=3D0 even though the l2 table would need to describe 1 cluster? B= ut > your math was done on ceil(ts), not on tc, so even though this comment > is fishy, I don't think it invalidates your math. I think all that you > need is to write tc =3D ceil(ts / spc) Okay, I just assumed it wouldn't make sense for ts not to be a multiple=20 of spc, but as qcow2 does specify its virtual disk size in bytes, it is=20 entirely possible. I'll adjust this comment. >> + * l2c =3D ceil(tc / (cs / 8)) >> + * -- as each L2 table holds cs/8 entries > Makes sense, for correct tc > >> + * l1c =3D ceil(l2c / (cs / 8)) >> + * -- as each L1 table cluster holds cs/8 entries > Makes sense, for correct l2c > >> + * >> + * The following equation yields a result which satisfies the con= straint. >> + * The result may be too high, but is never too low. >> + * >> + * The original calculation (without bitshifting) was: >> + * >> + * DIV_ROUND_UP(overhead * (1 + cs / 8) + >> + * 3 * cs * cs / 16 - 5 * cs / 8 - 1 + >> + * 4 * (ts + spc * cs / 8 - 1) / spc, >> + * cs * cs / 16 - cs / 8 - 1) > Yes, this matches your pdf followup. > >> + * >> + */ >> + >> + return DIV_ROUND_UP(overhead + (overhead << (cb - 3)) + >> + (3 << (2 * cb - 4)) - (5 << (cb - 3)) - 1 + >> + (4 * (ts + (spc << (cb - 3)) - 1) >> spcb), >> + (1 << (2 * cb - 4)) - cs / 8 - 1); > And yes, this matches the comment written with division. It could > definitely be made tighter for the case of overhead large enough where > we could omit an entire cluster of all-0 refblocks by writing 0 in the > reftable; but it's not worth the complication. > >> +} >> + >> static int qcow2_make_empty(BlockDriverState *bs) >> { >> + BDRVQcowState *s =3D bs->opaque; >> int ret =3D 0; >> - uint64_t start_sector; >> - int sector_step =3D INT_MAX / BDRV_SECTOR_SIZE; >> =20 >> - for (start_sector =3D 0; start_sector < bs->total_sectors; >> - start_sector +=3D sector_step) >> - { >> - /* As this function is generally used after committing an ext= ernal >> - * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Also, = the >> - * default action for this kind of discard is to pass the dis= card, >> - * which will ideally result in an actually smaller image fil= e, as >> - * is probably desired. */ >> - ret =3D qcow2_discard_clusters(bs, start_sector * BDRV_SECTOR= _SIZE, >> - MIN(sector_step, >> - bs->total_sectors - start_se= ctor), >> - QCOW2_DISCARD_SNAPSHOT, true); >> + if (s->snapshots) { >> + uint64_t start_sector; >> + int sector_step =3D INT_MAX / BDRV_SECTOR_SIZE; >> + >> + /* If there are snapshots, every active cluster has to be dis= carded */ >> + >> + for (start_sector =3D 0; start_sector < bs->total_sectors; >> + start_sector +=3D sector_step) >> + { >> + /* As this function is generally used after committing an= external >> + * snapshot, QCOW2_DISCARD_SNAPSHOT seems appropriate. Al= so, the >> + * default action for this kind of discard is to pass the= discard, >> + * which will ideally result in an actually smaller image= file, as >> + * is probably desired. */ >> + ret =3D qcow2_discard_clusters(bs, start_sector * BDRV_SE= CTOR_SIZE, >> + MIN(sector_step, >> + bs->total_sectors - star= t_sector), >> + QCOW2_DISCARD_SNAPSHOT, true= ); >> + if (ret < 0) { >> + break; >> + } >> + } > The fallback is always safe, even if slower. > >> + } else { >> + uint64_t min_size_1, min_size_2; >> + int64_t blob_start; >> + uint64_t blob_end, real_blob_size, clusters; >> + uint64_t refblocks, reftable_clusters, l2_tables, l1_clusters= ; >> + >> + /* If there are no snapshots, we basically want to create a n= ew empty >> + * image. This function is however not for creating a new ima= ge and >> + * renaming it so it shadows the existing but rather for empt= ying an >> + * image, so do exactly that. >> + * >> + * Therefore, the image should be valid at all points in time= and may >> + * at worst leak clusters. > I've convinced myself that you met that constraint - any early exit > prior to pivoting the reftable probably has leaks, but leaks are easy t= o > clean, and the image is consistent at all times. > >> + * >> + * Any valid qcow2 image requires an L1 table which is large = enough to >> + * cover all of the guest cluster range, therefore it is impo= ssible to >> + * simply drop the L1 table (make its size 0) and create a mi= nimal >> + * refcount structure. >> + * >> + * Instead, allocate a blob of data which is large enough to = hold a new >> + * refcount structure (refcount table and all blocks) coverin= g the whole >> + * image until the end of that blob, and an empty L1 table co= vering the >> + * whole guest cluster range. Then these structures are initi= alized and >> + * the image header is modified to point to them. >> + * >> + * Generally, this blob will be allocated at the end of the i= mage (that >> + * is, its offset will be greater than its size). If that is = indeed the >> + * case, the same operation is repeated, but this time the ne= w blob >> + * starts right after the image header which will then indeed= lead to a >> + * minimal image. If this is not the case, the image will be = nearly >> + * minimal as well, as long as the underlying protocol suppor= ts discard. > Again, depending on whether the all-0 L1 table actually occupies disk > space determines whether the image is truly minimal in that regards, bu= t > even if it occupies disk space this description makes sense as > describing the bare minimum for a qcow2 file that is still valid and > points to the backing file for all clusters, regardless of how much of > the underlying file is allocated. > >> + * >> + * Note that this implementation never frees allocated cluste= rs. This is >> + * because in case of success, the current refcounts are inva= lid anyway; > Or more precisely, on success, the current refcounts are thrown away as > we pivot to the new table of refcounts. > >> + * and in case of failure, it would be too cumbersome to keep= track of >> + * all allocated cluster ranges and free them then. >> + * >> + * min_size_1 will contain the number of clusters minimally r= equired for >> + * a blob that starts right after the image header; min_size_= 2 will >> + * contain the number of clusters minimally required for the = blob which >> + * can be allocated based on the existing refcount structure. >> + */ >> + >> + /* Repeat until a blob could be allocated which is large enou= gh to >> + * contain all structures necessary for describing itself. Al= located >> + * clusters are not freed, even if they are not suitable, as = this would >> + * result in exactly the same cluster range being returned du= ring the >> + * retry (which is obviously not desirable). In case of succe= ss, the >> + * current refcounts do not matter anyway; and in case of fai= lure, the >> + * clusters are only leaked (which can be easily repaired). *= / >> + do { >> + uint64_t fci =3D s->free_cluster_index; >> + >> + /* TODO: Optimize, we do not need refblocks for other par= ts of the >> + * image than the header and this blob */ > Ah, you caught on to the optimization of fewer refblocks. Rather than > wording this as a TODO, I'd just state that any overhead occupied by no= t > being minimal is reclaimed on a successful min_size_1 blob anyways, so > it wasn't worth the complications to the code. > >> + min_size_2 =3D minimal_blob_size(bs->total_sectors, s->cl= uster_bits, >> + s->cluster_bits - BDRV_SEC= TOR_BITS, >> + fci); >> + >> + blob_start =3D qcow2_alloc_clusters(bs, >> + min_size_2 << s->cluste= r_bits); > This call has the side effect of increasing 'fci' for the next iteratio= n > of the do loop, if this iteration didn't guess correctly. > >> + if (blob_start < 0) { >> + return blob_start; >> + } >> + >> + clusters =3D (blob_start >> s->cluster_bits) + m= in_size_2; >> + refblocks =3D DIV_ROUND_UP(clusters, s->cluster_s= ize / 2); >> + reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_= size / 8); >> + l2_tables =3D DIV_ROUND_UP(bs->total_sectors / >> + s->cluster_sectors, >> + s->cluster_size / 8); >> + l1_clusters =3D DIV_ROUND_UP(l2_tables, s->cluster_= size / 8); >> + >> + real_blob_size =3D reftable_clusters + refblocks + l1_clu= sters; >> + } while (min_size_2 < real_blob_size); > But in reality, will the while loop ever be executed more than once? > Since min_size_2 was computed via a nice set of equations to be large > enough, it looks like you are just being paranoid here. Yes, it may be executed more than once. The calculation for the minimal=20 blob size assumes the blob is allocated at $fci. This is however not=20 necessarily the case if the chunk of free space available at $fci is not=20 large enough to hold the blob; in that case, the allocated area may not=20 be large enough (as the overhead (the offset of the allocated area) is=20 bigger than expected). Then, a new iteration is necessary.=20 s->free_cluster_index will have been increased by alloc_clusters_noref()=20 in qcow2-refcount.c to point somewhere after that last allocation so=20 eventually it will point at the end of the image, resulting in a=20 successful allocation with (blob_start >> s->cluster_bits) =3D=3D=20 s->free_cluster_index (the overhead is as expected). >> + >> + ret =3D create_refcount_l1(bs, blob_start, refblocks); >> if (ret < 0) { >> - break; >> + return ret; >> } >> + >> + /* The only overhead is the image header */ >> + min_size_1 =3D minimal_blob_size(bs->total_sectors, s->cluste= r_bits, >> + s->cluster_bits - BDRV_SECTOR_= BITS, 1); >> + >> + /* If we cannot create a new blob before the current one, jus= t discard >> + * the sectors in between and return. Even if the discard doe= s nothing, >> + * only up to min_size_1 clusters plus the refcount blocks fo= r those >> + * are unused. The worst case is therefore an image of double= the size >> + * it needs to be, which is not too bad. */ >> + if ((blob_start >> s->cluster_bits) < 1 + min_size_1) { >> + uint64_t sector, end; >> + int step =3D INT_MAX / BDRV_SECTOR_SIZE; >> + >> + end =3D blob_start >> (s->cluster_bits - BDRV_SECTOR_SIZE= ); >> + >> + /* skip the image header */ >> + for (sector =3D s->cluster_sectors; sector < end; sector = +=3D step) { >> + bdrv_discard(bs->file, sector, MIN(step, end - sector= )); > Why bdrv_discard and not qcow2_discard_clusters(..., > QCOW2_DISCARD_SNAPSHOT, true);? qcow2_discard_clusters() works on guest offsets; the respective clusters=20 are not visible from the guest as the L1 table is empty. Therefore,=20 qcow2_discard_clusters() will not work. Furthermore, the clusters are=20 unallocated from qcow2's point of view and qcow2_discard_clusters(=E2=80=A6= ,=20 true) should work for allocated clusters only. > Why can you ignore errors here? You are correct, I cannot. Or rather, in theory I can as the image is=20 consistent and the only things left to do are discarding all unused=20 clusters and truncating the image (which are not vital); but it's better=20 to return an error anyway if one occurs. Thus, I'll include it in v7. >> + } >> + >> + blob_end =3D (blob_start + real_blob_size) << s->cluster_= bits; >> + } else { >> + clusters =3D 1 + min_size_1; >> + refblocks =3D DIV_ROUND_UP(clusters, s->cluster_s= ize / 2); >> + reftable_clusters =3D DIV_ROUND_UP(refblocks, s->cluster_= size / 8); >> + l2_tables =3D DIV_ROUND_UP(bs->total_sectors / >> + s->cluster_sectors, >> + s->cluster_size / 8); >> + l1_clusters =3D DIV_ROUND_UP(l2_tables, s->cluster_= size / 8); >> + >> + real_blob_size =3D reftable_clusters + refblocks + l1_clu= sters; >> + >> + assert(min_size_1 >=3D real_blob_size); >> + >> + ret =3D create_refcount_l1(bs, s->cluster_size, refblocks= ); >> + if (ret < 0) { >> + return ret; >> + } >> + >> + blob_end =3D (1 + real_blob_size) << s->cluster_bits; >> + } >> + >> + ret =3D bdrv_truncate(bs->file, blob_end); > The truncate is the key that finally makes the newly used refcount tabl= e > (whether from one call to create_refcount_l1 or two) accurate. The refcount table is accurate even before the bdrv_truncate(). In fact,=20 it is accurate immediately after it is written. There is random=20 unreferenced data in the image, but judging from the new L1 table and=20 refcount data, these are only unused clusters which can be claimed by=20 allocating clusters (just as unused clusters in a file system may=20 contain any random data). qemu-img check will report a clean image right=20 after create_refcount_l1() returns. > Note > that we have a case where if the first attempt at allocating min_size_2 > manages to find room for a blob in the middle of the file rather than a= t > the end, and then the second create_refcount_l1() fails (or you don't > attempt a second one but instead add in error handling for failed > bdrv_discard), then resulting qcow2 file is inconsistent in that the > reftable is smaller than the overall size of the qcow2 image. Correct. However, it is my understanding that clusters not covered by=20 the refcount table should be treated as unallocated; at least, this is=20 how qemu does it. get_refcount() returns 0 if the index in the refcount=20 table exceeds the refcount table size. But this is not covered of the qcow2 specification, so perhaps we should=20 add it. > But as > long as our image sanity checking code can handle that, and repair the > image (if I understand correctly, anything beyond the refcount table is > basically treated the same as any other leaked cluster), They are not referenced and the reference count is assumed to be 0 (as=20 they are not covered by the refcount table). Therefore, they are not=20 leaked. From qcow2's point of view, they are simply not there. > I don't think > it is fatal to this patch. > >> } >> =20 >> return ret; >> Thank you again for working through this. I'll make the adjustments you=20 suggested and I confirmed in v7 (I guess I'll include a diff then ;-)). Max