From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38582) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wka6H-00089b-Oz for qemu-devel@nongnu.org; Wed, 14 May 2014 10:24:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Wka6B-0003va-IC for qemu-devel@nongnu.org; Wed, 14 May 2014 10:24:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55697) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Wka6B-0003vV-Ac for qemu-devel@nongnu.org; Wed, 14 May 2014 10:23:59 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s4EENwaP029839 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 14 May 2014 10:23:58 -0400 Date: Wed, 14 May 2014 16:23:56 +0200 From: Kevin Wolf Message-ID: <20140514142356.GJ3610@noname.redhat.com> References: <1399528635-30159-1-git-send-email-famz@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1399528635-30159-1-git-send-email-famz@redhat.com> Subject: Re: [Qemu-devel] [PATCH v3] vmdk: Optimize cluster allocation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Fam Zheng Cc: qemu-devel@nongnu.org, Stefan Hajnoczi Am 08.05.2014 um 07:57 hat Fam Zheng geschrieben: > This drops the unnecessary bdrv_truncate() from, and also improves, > cluster allocation code path. > [...] > > Tested that this passes qemu-iotests for all VMDK subformats. > > Signed-off-by: Fam Zheng Unfortunately, this is seriously broken and only compatible with itself, but appears not to work any more with real VMDKs. > --- > V3: A new implementation following Kevin's suggestion. > > Signed-off-by: Fam Zheng > --- > block/vmdk.c | 184 +++++++++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 121 insertions(+), 63 deletions(-) > > diff --git a/block/vmdk.c b/block/vmdk.c > index 06a1f9f..8c34d5e 100644 > --- a/block/vmdk.c > +++ b/block/vmdk.c > @@ -106,6 +106,7 @@ typedef struct VmdkExtent { > uint32_t l2_cache_counts[L2_CACHE_SIZE]; > > int64_t cluster_sectors; > + int64_t next_cluster_offset; > char *type; > } VmdkExtent; > > @@ -397,6 +398,7 @@ static int vmdk_add_extent(BlockDriverState *bs, > { > VmdkExtent *extent; > BDRVVmdkState *s = bs->opaque; > + int64_t ret; > > if (cluster_sectors > 0x200000) { > /* 0x200000 * 512Bytes = 1GB for one cluster is unrealistic */ > @@ -428,6 +430,13 @@ static int vmdk_add_extent(BlockDriverState *bs, > extent->l2_size = l2_size; > extent->cluster_sectors = flat ? sectors : cluster_sectors; > > + ret = bdrv_getlength(extent->file); > + Why this empty line? > + if (ret < 0) { > + return ret; > + } > + extent->next_cluster_offset = ROUND_UP(ret, BDRV_SECTOR_SIZE); > + > if (s->num_extents > 1) { > extent->end_sector = (*(extent - 1)).end_sector + extent->sectors; > } else { > @@ -954,42 +963,77 @@ static int vmdk_refresh_limits(BlockDriverState *bs) > return 0; > } > > +/** > + * get_whole_cluster > + * > + * Copy backing file's cluster that covers @sector_num, otherwise write zero, > + * to the cluster at @cluster_sector_num. > + * > + * If @skip_start < @skip_end, the relative range [@skip_start, @skip_end) is > + * not copied or written, and leave it for call to write user data in the > + * request. > + */ > static int get_whole_cluster(BlockDriverState *bs, > - VmdkExtent *extent, > - uint64_t cluster_offset, > - uint64_t offset, > - bool allocate) > + VmdkExtent *extent, > + uint64_t cluster_sector_num, > + uint64_t sector_num, > + uint64_t skip_start, uint64_t skip_end) > { > int ret = VMDK_OK; > - uint8_t *whole_grain = NULL; > + int64_t cluster_bytes; > + uint8_t *whole_grain; > + > + /* For COW, align request sector_num to cluster start */ > + sector_num -= sector_num % extent->cluster_sectors; QEMU_ALIGN_DOWN? > + cluster_bytes = extent->cluster_sectors << BDRV_SECTOR_BITS; > + whole_grain = qemu_blockalign(bs, cluster_bytes); > + memset(whole_grain, 0, cluster_bytes); This is completely unnecessary for cases with backing files, and unnecessary for the skipped part in any case. > + assert(skip_end <= sector_num + extent->cluster_sectors); > /* we will be here if it's first write on non-exist grain(cluster). > * try to read from parent image, if exist */ > - if (bs->backing_hd) { > - whole_grain = > - qemu_blockalign(bs, extent->cluster_sectors << BDRV_SECTOR_BITS); > - if (!vmdk_is_cid_valid(bs)) { > - ret = VMDK_ERROR; > - goto exit; > - } > + if (bs->backing_hd && !vmdk_is_cid_valid(bs)) { > + ret = VMDK_ERROR; > + goto exit; > + } > > - /* floor offset to cluster */ > - offset -= offset % (extent->cluster_sectors * 512); > - ret = bdrv_read(bs->backing_hd, offset >> 9, whole_grain, > - extent->cluster_sectors); > + /* Read backing data before skip range */ > + if (skip_start > 0) { > + if (bs->backing_hd) { > + ret = bdrv_read(bs->backing_hd, sector_num, > + whole_grain, skip_start); > + if (ret < 0) { > + ret = VMDK_ERROR; > + goto exit; > + } > + } > + ret = bdrv_write(extent->file, cluster_sector_num, whole_grain, > + skip_start); > if (ret < 0) { > ret = VMDK_ERROR; > goto exit; > } > - > - /* Write grain only into the active image */ > - ret = bdrv_write(extent->file, cluster_offset, whole_grain, > - extent->cluster_sectors); > + } > + /* Read backing data after skip range */ > + if (skip_end < extent->cluster_sectors) { > + if (bs->backing_hd) { > + ret = bdrv_read(bs->backing_hd, sector_num + skip_end, > + whole_grain + (skip_end << BDRV_SECTOR_BITS), > + extent->cluster_sectors - skip_end); > + if (ret < 0) { > + ret = VMDK_ERROR; > + goto exit; > + } > + } > + ret = bdrv_write(extent->file, cluster_sector_num + skip_end, > + whole_grain + (skip_end << BDRV_SECTOR_BITS), > + extent->cluster_sectors - skip_end); > if (ret < 0) { > ret = VMDK_ERROR; > goto exit; > } > } > + > exit: > qemu_vfree(whole_grain); > return ret; > @@ -1026,17 +1070,40 @@ static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data) > return VMDK_OK; > } > > +/** > + * get_cluster_offset > + * > + * Look up cluster offset in extent file by sector number, and stor in store > + * @cluster_offset. > + * > + * For flat extent, the start offset as parsed from the description file is > + * returned. > + * > + * For sparse extent, look up in L1, L2 table. If allocate is true, return an > + * offset for a new cluster and update L2 cache. If there is a backing file, > + * COW is done before returning; otherwise, zeroes are written to the allocated > + * cluster. Both COW and zero writting skips the sector range > + * [@skip_start_sector, @skip_end_sector) passed in by caller, because caller > + * has new data to write there. > + * > + * Returns: VMDK_OK if cluster exists and mapped in the image. > + * VMDK_UNALLOC if cluster is not mapped and @allocate is false. > + * VMDK_ERROR if failed. > + */ > static int get_cluster_offset(BlockDriverState *bs, > - VmdkExtent *extent, > - VmdkMetaData *m_data, > - uint64_t offset, > - int allocate, > - uint64_t *cluster_offset) > + VmdkExtent *extent, > + VmdkMetaData *m_data, > + uint64_t offset, > + bool allocate, > + uint64_t *cluster_offset, > + uint64_t skip_start_sector, > + uint64_t skip_end_sector) > { > unsigned int l1_index, l2_offset, l2_index; > int min_index, i, j; > uint32_t min_count, *l2_table; > bool zeroed = false; > + int64_t ret; > > if (m_data) { > m_data->valid = 0; > @@ -1109,33 +1176,29 @@ static int get_cluster_offset(BlockDriverState *bs, > return zeroed ? VMDK_ZEROED : VMDK_UNALLOC; > } > > - /* Avoid the L2 tables update for the images that have snapshots. */ > - *cluster_offset = bdrv_getlength(extent->file); > - if (!extent->compressed) { > - bdrv_truncate( > - extent->file, > - *cluster_offset + (extent->cluster_sectors << 9) > - ); > - } > + *cluster_offset = extent->next_cluster_offset; > + extent->next_cluster_offset += > + extent->cluster_sectors << BDRV_SECTOR_BITS; > > - *cluster_offset >>= 9; > - l2_table[l2_index] = cpu_to_le32(*cluster_offset); > + l2_table[l2_index] = cpu_to_le32(*cluster_offset >> BDRV_SECTOR_BITS); Something is fishy with the whole VMDK cluster allocation. The L2 table entry is set here (to a cluster number), and later again to m_data->offset, which is a byte offset now. I don't quite understand why the L2 table is updated _twice_ and why we have both *cluster_offset and m_data->offset, which should always both be the same value (but aren't in practice with this patch applied) > /* First of all we write grain itself, to avoid race condition > * that may to corrupt the image. > * This problem may occur because of insufficient space on host disk > * or inappropriate VM shutdown. > */ > - if (get_whole_cluster( > - bs, extent, *cluster_offset, offset, allocate) == -1) { > - return VMDK_ERROR; > + ret = get_whole_cluster(bs, extent, > + *cluster_offset >> BDRV_SECTOR_BITS, > + offset >> BDRV_SECTOR_BITS, > + skip_start_sector, skip_end_sector); > + if (ret) { > + return ret; > } > > if (m_data) { > m_data->offset = *cluster_offset; > } > } > - *cluster_offset <<= 9; In the case where the cluster was already allocated, *cluster_offset is now a sector number. In the case of a newly allocated cluster, it is a byte offset. Can't be right. The bug is partly cancelled out because m_data->offset is also byte allocated and the L2 table that is stored above gets overwritten with it, so you end up writing byte offsets to the L2 table. This isn't VMDK any more, obviously, but it explains why qemu-iotests couldn't catch it. Perhaps we should add some binary sample image even for formats that we implement r/w. > return VMDK_OK; > } Kevin