From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38582)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1Wka6H-00089b-Oz
	for qemu-devel@nongnu.org; Wed, 14 May 2014 10:24:11 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1Wka6B-0003va-IC
	for qemu-devel@nongnu.org; Wed, 14 May 2014 10:24:05 -0400
Received: from mx1.redhat.com ([209.132.183.28]:55697)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1Wka6B-0003vV-Ac
	for qemu-devel@nongnu.org; Wed, 14 May 2014 10:23:59 -0400
Received: from int-mx02.intmail.prod.int.phx2.redhat.com
	(int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s4EENwaP029839
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Wed, 14 May 2014 10:23:58 -0400
Date: Wed, 14 May 2014 16:23:56 +0200
From: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20140514142356.GJ3610@noname.redhat.com>
References: <1399528635-30159-1-git-send-email-famz@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1399528635-30159-1-git-send-email-famz@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v3] vmdk: Optimize cluster allocation
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Fam Zheng <famz@redhat.com>
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

Am 08.05.2014 um 07:57 hat Fam Zheng geschrieben:
> This drops the unnecessary bdrv_truncate() from, and also improves,
> cluster allocation code path.
> [...]
> 
> Tested that this passes qemu-iotests for all VMDK subformats.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>

Unfortunately, this is seriously broken and only compatible with itself,
but appears not to work any more with real VMDKs.

> ---
> V3: A new implementation following Kevin's suggestion.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/vmdk.c | 184 +++++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 121 insertions(+), 63 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 06a1f9f..8c34d5e 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -106,6 +106,7 @@ typedef struct VmdkExtent {
>      uint32_t l2_cache_counts[L2_CACHE_SIZE];
>  
>      int64_t cluster_sectors;
> +    int64_t next_cluster_offset;
>      char *type;
>  } VmdkExtent;
>  
> @@ -397,6 +398,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
>  {
>      VmdkExtent *extent;
>      BDRVVmdkState *s = bs->opaque;
> +    int64_t ret;
>  
>      if (cluster_sectors > 0x200000) {
>          /* 0x200000 * 512Bytes = 1GB for one cluster is unrealistic */
> @@ -428,6 +430,13 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>  
> +    ret = bdrv_getlength(extent->file);
> +

Why this empty line?

> +    if (ret < 0) {
> +        return ret;
> +    }
> +    extent->next_cluster_offset = ROUND_UP(ret, BDRV_SECTOR_SIZE);
> +
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
>      } else {
> @@ -954,42 +963,77 @@ static int vmdk_refresh_limits(BlockDriverState *bs)
>      return 0;
>  }
>  
> +/**
> + * get_whole_cluster
> + *
> + * Copy backing file's cluster that covers @sector_num, otherwise write zero,
> + * to the cluster at @cluster_sector_num.
> + *
> + * If @skip_start < @skip_end, the relative range [@skip_start, @skip_end) is
> + * not copied or written, and leave it for call to write user data in the
> + * request.
> + */
>  static int get_whole_cluster(BlockDriverState *bs,
> -                VmdkExtent *extent,
> -                uint64_t cluster_offset,
> -                uint64_t offset,
> -                bool allocate)
> +                             VmdkExtent *extent,
> +                             uint64_t cluster_sector_num,
> +                             uint64_t sector_num,
> +                             uint64_t skip_start, uint64_t skip_end)
>  {
>      int ret = VMDK_OK;
> -    uint8_t *whole_grain = NULL;
> +    int64_t cluster_bytes;
> +    uint8_t *whole_grain;
> +
> +    /* For COW, align request sector_num to cluster start */
> +    sector_num -= sector_num % extent->cluster_sectors;

QEMU_ALIGN_DOWN?

> +    cluster_bytes = extent->cluster_sectors << BDRV_SECTOR_BITS;
> +    whole_grain = qemu_blockalign(bs, cluster_bytes);
> +    memset(whole_grain, 0, cluster_bytes);

This is completely unnecessary for cases with backing files, and
unnecessary for the skipped part in any case.

> +    assert(skip_end <= sector_num + extent->cluster_sectors);
>      /* we will be here if it's first write on non-exist grain(cluster).
>       * try to read from parent image, if exist */
> -    if (bs->backing_hd) {
> -        whole_grain =
> -            qemu_blockalign(bs, extent->cluster_sectors << BDRV_SECTOR_BITS);
> -        if (!vmdk_is_cid_valid(bs)) {
> -            ret = VMDK_ERROR;
> -            goto exit;
> -        }
> +    if (bs->backing_hd && !vmdk_is_cid_valid(bs)) {
> +        ret = VMDK_ERROR;
> +        goto exit;
> +    }
>  
> -        /* floor offset to cluster */
> -        offset -= offset % (extent->cluster_sectors * 512);
> -        ret = bdrv_read(bs->backing_hd, offset >> 9, whole_grain,
> -                extent->cluster_sectors);
> +    /* Read backing data before skip range */
> +    if (skip_start > 0) {
> +        if (bs->backing_hd) {
> +            ret = bdrv_read(bs->backing_hd, sector_num,
> +                            whole_grain, skip_start);
> +            if (ret < 0) {
> +                ret = VMDK_ERROR;
> +                goto exit;
> +            }
> +        }
> +        ret = bdrv_write(extent->file, cluster_sector_num, whole_grain,
> +                         skip_start);
>          if (ret < 0) {
>              ret = VMDK_ERROR;
>              goto exit;
>          }
> -
> -        /* Write grain only into the active image */
> -        ret = bdrv_write(extent->file, cluster_offset, whole_grain,
> -                extent->cluster_sectors);
> +    }
> +    /* Read backing data after skip range */
> +    if (skip_end < extent->cluster_sectors) {
> +        if (bs->backing_hd) {
> +            ret = bdrv_read(bs->backing_hd, sector_num + skip_end,
> +                            whole_grain + (skip_end << BDRV_SECTOR_BITS),
> +                            extent->cluster_sectors - skip_end);
> +            if (ret < 0) {
> +                ret = VMDK_ERROR;
> +                goto exit;
> +            }
> +        }
> +        ret = bdrv_write(extent->file, cluster_sector_num + skip_end,
> +                         whole_grain + (skip_end << BDRV_SECTOR_BITS),
> +                         extent->cluster_sectors - skip_end);
>          if (ret < 0) {
>              ret = VMDK_ERROR;
>              goto exit;
>          }
>      }
> +
>  exit:
>      qemu_vfree(whole_grain);
>      return ret;
> @@ -1026,17 +1070,40 @@ static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data)
>      return VMDK_OK;
>  }
>  
> +/**
> + * get_cluster_offset
> + *
> + * Look up cluster offset in extent file by sector number, and stor in

store

> + * @cluster_offset.
> + *
> + * For flat extent, the start offset as parsed from the description file is
> + * returned.
> + *
> + * For sparse extent, look up in L1, L2 table. If allocate is true, return an
> + * offset for a new cluster and update L2 cache. If there is a backing file,
> + * COW is done before returning; otherwise, zeroes are written to the allocated
> + * cluster. Both COW and zero writting skips the sector range
> + * [@skip_start_sector, @skip_end_sector) passed in by caller, because caller
> + * has new data to write there.
> + *
> + * Returns: VMDK_OK if cluster exists and mapped in the image.
> + *          VMDK_UNALLOC if cluster is not mapped and @allocate is false.
> + *          VMDK_ERROR if failed.
> + */
>  static int get_cluster_offset(BlockDriverState *bs,
> -                                    VmdkExtent *extent,
> -                                    VmdkMetaData *m_data,
> -                                    uint64_t offset,
> -                                    int allocate,
> -                                    uint64_t *cluster_offset)
> +                              VmdkExtent *extent,
> +                              VmdkMetaData *m_data,
> +                              uint64_t offset,
> +                              bool allocate,
> +                              uint64_t *cluster_offset,
> +                              uint64_t skip_start_sector,
> +                              uint64_t skip_end_sector)
>  {
>      unsigned int l1_index, l2_offset, l2_index;
>      int min_index, i, j;
>      uint32_t min_count, *l2_table;
>      bool zeroed = false;
> +    int64_t ret;
>  
>      if (m_data) {
>          m_data->valid = 0;
> @@ -1109,33 +1176,29 @@ static int get_cluster_offset(BlockDriverState *bs,
>              return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>          }
>  
> -        /* Avoid the L2 tables update for the images that have snapshots. */
> -        *cluster_offset = bdrv_getlength(extent->file);
> -        if (!extent->compressed) {
> -            bdrv_truncate(
> -                extent->file,
> -                *cluster_offset + (extent->cluster_sectors << 9)
> -            );
> -        }
> +        *cluster_offset = extent->next_cluster_offset;
> +        extent->next_cluster_offset +=
> +            extent->cluster_sectors << BDRV_SECTOR_BITS;
>  
> -        *cluster_offset >>= 9;
> -        l2_table[l2_index] = cpu_to_le32(*cluster_offset);
> +        l2_table[l2_index] = cpu_to_le32(*cluster_offset >> BDRV_SECTOR_BITS);

Something is fishy with the whole VMDK cluster allocation. The L2 table
entry is set here (to a cluster number), and later again to
m_data->offset, which is a byte offset now.

I don't quite understand why the L2 table is updated _twice_ and why we
have both *cluster_offset and m_data->offset, which should always both
be the same value (but aren't in practice with this patch applied)

>          /* First of all we write grain itself, to avoid race condition
>           * that may to corrupt the image.
>           * This problem may occur because of insufficient space on host disk
>           * or inappropriate VM shutdown.
>           */
> -        if (get_whole_cluster(
> -                bs, extent, *cluster_offset, offset, allocate) == -1) {
> -            return VMDK_ERROR;
> +        ret = get_whole_cluster(bs, extent,
> +                                *cluster_offset >> BDRV_SECTOR_BITS,
> +                                offset >> BDRV_SECTOR_BITS,
> +                                skip_start_sector, skip_end_sector);
> +        if (ret) {
> +            return ret;
>          }
>  
>          if (m_data) {
>              m_data->offset = *cluster_offset;
>          }
>      }
> -    *cluster_offset <<= 9;

In the case where the cluster was already allocated, *cluster_offset is
now a sector number. In the case of a newly allocated cluster, it is a
byte offset. Can't be right.

The bug is partly cancelled out because m_data->offset is also byte
allocated and the L2 table that is stored above gets overwritten with
it, so you end up writing byte offsets to the L2 table. This isn't VMDK
any more, obviously, but it explains why qemu-iotests couldn't catch it.
Perhaps we should add some binary sample image even for formats that we
implement r/w.

>      return VMDK_OK;
>  }

Kevin