[Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
@ 2017-03-11 11:54 Ashijeet Acharya
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters Ashijeet Acharya
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-11 11:54 UTC (permalink / raw)
  To: famz; +Cc: kwolf, jsnow, mreitz, qemu-devel, qemu-block, Ashijeet Acharya

This series optimizes the I/O performance of VMDK driver.

Patch 1 makes the VMDK driver to allocate multiple clusters at once. Earlier
it used to allocate cluster by cluster which slowed down its performance to a
great extent.

Patch 2 changes the metadata update code to update the L2 tables for multiple
clusters at once.

Note: These changes pass all 41/41 tests suitable for VMDK driver.

Ashijeet Acharya (2):
  vmdk: Optimize I/O by allocating multiple clusters
  vmdk: Update metadata for multiple clusters

 block/vmdk.c | 596 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 444 insertions(+), 152 deletions(-)

-- 
2.6.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters
  2017-03-11 11:54 [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Ashijeet Acharya
@ 2017-03-11 11:54 ` Ashijeet Acharya
  2017-03-23 19:10   ` Kevin Wolf
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 2/2] vmdk: Update metadata for " Ashijeet Acharya
  2017-03-21  7:51 ` [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Stefan Hajnoczi
  2 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-11 11:54 UTC (permalink / raw)
  To: famz; +Cc: kwolf, jsnow, mreitz, qemu-devel, qemu-block, Ashijeet Acharya

The vmdk driver in block/vmdk.c used to allocate cluster by cluster
which slowed down its I/O performance.

Make vmdk driver allocate multiple clusters at once to reduce the
overhead costs of multiple separate cluster allocation calls. The number
of clusters allocated at once depends on the L2 table boundaries. Also
the first and the last clusters are allocated separately as we may need
to perform COW for them

Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
---
 block/vmdk.c | 513 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 366 insertions(+), 147 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index a9bd22b..3dc178b 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -136,6 +136,7 @@ typedef struct VmdkMetaData {
     unsigned int l2_offset;
     int valid;
     uint32_t *l2_cache_entry;
+    uint32_t nb_clusters;
 } VmdkMetaData;
 
 typedef struct VmdkGrainMarker {
@@ -242,6 +243,26 @@ static void vmdk_free_last_extent(BlockDriverState *bs)
     s->extents = g_renew(VmdkExtent, s->extents, s->num_extents);
 }
 
+static inline uint64_t vmdk_find_offset_in_cluster(VmdkExtent *extent,
+                                                   int64_t offset)
+{
+    uint64_t extent_begin_offset, extent_relative_offset;
+    uint64_t cluster_size = extent->cluster_sectors * BDRV_SECTOR_SIZE;
+
+    extent_begin_offset =
+        (extent->end_sector - extent->sectors) * BDRV_SECTOR_SIZE;
+    extent_relative_offset = offset - extent_begin_offset;
+    return extent_relative_offset % cluster_size;
+}
+
+static inline uint64_t size_to_clusters(VmdkExtent *extent, uint64_t size)
+{
+    uint64_t cluster_size, round_off_size;
+    cluster_size = extent->cluster_sectors * BDRV_SECTOR_SIZE;
+    round_off_size = cluster_size - (size % cluster_size);
+    return ((size + round_off_size) >> 16) - 1;
+}
+
 static uint32_t vmdk_read_cid(BlockDriverState *bs, int parent)
 {
     char *desc;
@@ -1016,8 +1037,135 @@ static void vmdk_refresh_limits(BlockDriverState *bs, Error **errp)
     }
 }
 
-/**
- * get_whole_cluster
+static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
+                         uint32_t offset)
+{
+    offset = cpu_to_le32(offset);
+    /* update L2 table */
+    if (bdrv_pwrite_sync(extent->file,
+                ((int64_t)m_data->l2_offset * 512)
+                    + (m_data->l2_index * sizeof(offset)),
+                &offset, sizeof(offset)) < 0) {
+        return VMDK_ERROR;
+    }
+    /* update backup L2 table */
+    if (extent->l1_backup_table_offset != 0) {
+        m_data->l2_offset = extent->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite_sync(extent->file,
+                    ((int64_t)m_data->l2_offset * 512)
+                        + (m_data->l2_index * sizeof(offset)),
+                    &offset, sizeof(offset)) < 0) {
+            return VMDK_ERROR;
+        }
+    }
+    if (m_data->l2_cache_entry) {
+        *m_data->l2_cache_entry = offset;
+    }
+
+    return VMDK_OK;
+}
+
+/*
+ * vmdk_L2load
+ *
+ * Loads a new L2 table into memory. If the table is in the cache, the cache
+ * is used; otherwise the L2 table is loaded from the image file.
+ *
+ * Returns:
+ *   VMDK_OK:       on success
+ *   VMDK_ERROR:    in error cases
+ */
+static int vmdk_L2load(VmdkExtent *extent, uint64_t offset, int l2_offset,
+                       uint32_t **new_l2_table, int *new_l2_index)
+{
+    int min_index, i, j;
+    uint32_t *l2_table;
+    uint32_t min_count;
+
+    for (i = 0; i < L2_CACHE_SIZE; i++) {
+        if (l2_offset == extent->l2_cache_offsets[i]) {
+            /* increment the hit count */
+            if (++extent->l2_cache_counts[i] == 0xffffffff) {
+                for (j = 0; j < L2_CACHE_SIZE; j++) {
+                    extent->l2_cache_counts[j] >>= 1;
+                }
+            }
+            l2_table = extent->l2_cache + (i * extent->l2_size);
+            goto found;
+        }
+    }
+    /* not found: load a new entry in the least used one */
+    min_index = 0;
+    min_count = 0xffffffff;
+    for (i = 0; i < L2_CACHE_SIZE; i++) {
+        if (extent->l2_cache_counts[i] < min_count) {
+            min_count = extent->l2_cache_counts[i];
+            min_index = i;
+        }
+    }
+    l2_table = extent->l2_cache + (min_index * extent->l2_size);
+    if (bdrv_pread(extent->file,
+                (int64_t)l2_offset * 512,
+                l2_table,
+                extent->l2_size * sizeof(uint32_t)
+            ) != extent->l2_size * sizeof(uint32_t)) {
+        return VMDK_ERROR;
+    }
+
+    extent->l2_cache_offsets[min_index] = l2_offset;
+    extent->l2_cache_counts[min_index] = 1;
+found:
+    *new_l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
+    *new_l2_table = l2_table;
+
+    return VMDK_OK;
+}
+
+/*
+ * get_cluster_table
+ *
+ * for a given offset, load (and allocate if needed) the l2 table.
+ *
+ * Returns:
+ *   VMDK_OK:        on success
+ *
+ *   VMDK_UNALLOC:   if cluster is not mapped
+ *
+ *   VMDK_ERROR:     in error cases
+ */
+static int get_cluster_table(VmdkExtent *extent, uint64_t offset,
+                             int *new_l1_index, int *new_l2_offset,
+                             int *new_l2_index, uint32_t **new_l2_table)
+{
+    int l1_index, l2_offset, l2_index;
+    uint32_t *l2_table;
+    int ret;
+
+    offset -= (extent->end_sector - extent->sectors) * SECTOR_SIZE;
+    l1_index = (offset >> 9) / extent->l1_entry_sectors;
+    if (l1_index >= extent->l1_size) {
+        return VMDK_ERROR;
+    }
+    l2_offset = extent->l1_table[l1_index];
+    if (!l2_offset) {
+        return VMDK_UNALLOC;
+    }
+
+    ret = vmdk_L2load(extent, offset, l2_offset, &l2_table, &l2_index);
+    if (ret < 0) {
+        return ret;
+    }
+
+    *new_l1_index = l1_index;
+    *new_l2_offset = l2_offset;
+    *new_l2_index = l2_index;
+    *new_l2_table = l2_table;
+
+    return VMDK_OK;
+}
+
+/*
+ * do_alloc_cluster_offset
  *
  * Copy backing file's cluster that covers @sector_num, otherwise write zero,
  * to the cluster at @cluster_sector_num.
@@ -1025,13 +1173,18 @@ static void vmdk_refresh_limits(BlockDriverState *bs, Error **errp)
  * If @skip_start_sector < @skip_end_sector, the relative range
  * [@skip_start_sector, @skip_end_sector) is not copied or written, and leave
  * it for call to write user data in the request.
+ *
+ * Returns:
+ *   VMDK_OK:       on success
+ *
+ *   VMDK_ERROR:    in error cases
  */
-static int get_whole_cluster(BlockDriverState *bs,
-                             VmdkExtent *extent,
-                             uint64_t cluster_offset,
-                             uint64_t offset,
-                             uint64_t skip_start_bytes,
-                             uint64_t skip_end_bytes)
+static int do_alloc_cluster_offset(BlockDriverState *bs,
+                                   VmdkExtent *extent,
+                                   uint64_t cluster_offset,
+                                   uint64_t offset,
+                                   uint64_t skip_start_bytes,
+                                   uint64_t skip_end_bytes)
 {
     int ret = VMDK_OK;
     int64_t cluster_bytes;
@@ -1098,36 +1251,176 @@ exit:
     return ret;
 }
 
-static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
-                         uint32_t offset)
+/*
+ * handle_alloc
+ *
+ * Allocates new clusters for an area that either is yet unallocated or needs a
+ * copy on write. If *cluster_offset is non_zero, clusters are only allocated if
+ * the new allocation can match the specified host offset.
+ *
+ * Returns:
+ *   VMDK_OK:       if new clusters were allocated, *bytes may be decreased if
+ *                  the new allocation doesn't cover all of the requested area.
+ *                  *cluster_offset is updated to contain the offset of the
+ *                  first newly allocated cluster.
+ *
+ *   VMDK_UNALLOC:  if no clusters could be allocated. *cluster_offset is left
+ *                  unchanged.
+ *
+ *   VMDK_ERROR:    in error cases
+ */
+static int handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
+                        uint64_t offset, uint64_t *cluster_offset,
+                        int64_t *bytes, VmdkMetaData *m_data,
+                        bool allocate, uint32_t *total_alloc_clusters)
 {
-    offset = cpu_to_le32(offset);
-    /* update L2 table */
-    if (bdrv_pwrite_sync(extent->file,
-                ((int64_t)m_data->l2_offset * 512)
-                    + (m_data->l2_index * sizeof(offset)),
-                &offset, sizeof(offset)) < 0) {
-        return VMDK_ERROR;
+    int l1_index, l2_offset, l2_index;
+    uint32_t *l2_table;
+    uint32_t cluster_sector;
+    uint32_t nb_clusters;
+    bool zeroed = false;
+    uint64_t skip_start_bytes, skip_end_bytes;
+    int ret;
+
+    ret = get_cluster_table(extent, offset, &l1_index, &l2_offset,
+                            &l2_index, &l2_table);
+    if (ret < 0) {
+        return ret;
     }
-    /* update backup L2 table */
-    if (extent->l1_backup_table_offset != 0) {
-        m_data->l2_offset = extent->l1_backup_table[m_data->l1_index];
-        if (bdrv_pwrite_sync(extent->file,
-                    ((int64_t)m_data->l2_offset * 512)
-                        + (m_data->l2_index * sizeof(offset)),
-                    &offset, sizeof(offset)) < 0) {
-            return VMDK_ERROR;
+
+    cluster_sector = le32_to_cpu(l2_table[l2_index]);
+
+    skip_start_bytes = vmdk_find_offset_in_cluster(extent, offset);
+    /* Calculate the number of clusters to look for. Here it will return one
+     * cluster less than the actual value calculated as we may need to perfrom
+     * COW for the last one. */
+    nb_clusters = size_to_clusters(extent, skip_start_bytes + *bytes);
+
+    nb_clusters = MIN(nb_clusters, extent->l2_size - l2_index);
+    assert(nb_clusters <= INT_MAX);
+
+    /* update bytes according to final nb_clusters value */
+    if (nb_clusters != 0) {
+        *bytes = ((nb_clusters * extent->cluster_sectors) << 9)
+                                            - skip_start_bytes;
+    } else {
+        nb_clusters = 1;
+    }
+    *total_alloc_clusters += nb_clusters;
+    skip_end_bytes = skip_start_bytes + MIN(*bytes,
+                     extent->cluster_sectors * BDRV_SECTOR_SIZE
+                                    - skip_start_bytes);
+
+    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
+        zeroed = true;
+    }
+
+    if (!cluster_sector || zeroed) {
+        if (!allocate) {
+            return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
+        }
+
+        cluster_sector = extent->next_cluster_sector;
+        extent->next_cluster_sector += extent->cluster_sectors
+                                                * nb_clusters;
+
+        ret = do_alloc_cluster_offset(bs, extent,
+                                      cluster_sector * BDRV_SECTOR_SIZE,
+                                      offset, skip_start_bytes,
+                                      skip_end_bytes);
+        if (ret < 0) {
+            return ret;
+        }
+        if (m_data) {
+            m_data->valid = 1;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->l2_cache_entry = &l2_table[l2_index];
+            m_data->nb_clusters = nb_clusters;
         }
     }
-    if (m_data->l2_cache_entry) {
-        *m_data->l2_cache_entry = offset;
+    *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
+    return VMDK_OK;
+}
+
+/*
+ * vmdk_alloc_cluster_offset
+ *
+ * For a given offset on the virtual disk, find the cluster offset in vmdk
+ * file. If the offset is not found, allocate a new cluster.
+ *
+ * If the cluster is newly allocated, m_data->nb_clusters is set to the number
+ * of contiguous clusters that have been allocated. In this case, the other
+ * fields of m_data are valid and contain information about the first allocated
+ * cluster.
+ *
+ * Returns:
+ *
+ *   VMDK_OK:           on success and @cluster_offset was set
+ *
+ *   VMDK_UNALLOC:      if no clusters were allocated and @cluster_offset is
+ *                      set to zero
+ *
+ *   VMDK_ERROR:        in error cases
+ */
+static int vmdk_alloc_cluster_offset(BlockDriverState *bs,
+                                     VmdkExtent *extent,
+                                     VmdkMetaData *m_data, uint64_t offset,
+                                     bool allocate, uint64_t *cluster_offset,
+                                     int64_t bytes,
+                                     uint32_t *total_alloc_clusters)
+{
+    uint64_t start, remaining;
+    uint64_t new_cluster_offset;
+    int64_t n_bytes;
+    int ret;
+
+    if (extent->flat) {
+        *cluster_offset = extent->flat_start_offset;
+        return VMDK_OK;
+    }
+
+    start = offset;
+    remaining = bytes;
+    new_cluster_offset = 0;
+    *cluster_offset = 0;
+    n_bytes = 0;
+    if (m_data) {
+        m_data->valid = 0;
+    }
+
+    /* due to L2 table margins all bytes may not get allocated at once */
+    while (true) {
+
+        if (!*cluster_offset) {
+            *cluster_offset = new_cluster_offset;
+        }
+
+        start              += n_bytes;
+        remaining          -= n_bytes;
+        new_cluster_offset += n_bytes;
+
+        if (remaining == 0) {
+            break;
+        }
+
+        n_bytes = remaining;
+
+        ret = handle_alloc(bs, extent, start, &new_cluster_offset, &n_bytes,
+                           m_data, allocate, total_alloc_clusters);
+
+        if (ret < 0) {
+            return ret;
+
+        }
     }
 
     return VMDK_OK;
 }
 
 /**
- * get_cluster_offset
+ * vmdk_get_cluster_offset
  *
  * Look up cluster offset in extent file by sector number, and store in
  * @cluster_offset.
@@ -1135,84 +1428,34 @@ static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
  * For flat extents, the start offset as parsed from the description file is
  * returned.
  *
- * For sparse extents, look up in L1, L2 table. If allocate is true, return an
- * offset for a new cluster and update L2 cache. If there is a backing file,
- * COW is done before returning; otherwise, zeroes are written to the allocated
- * cluster. Both COW and zero writing skips the sector range
- * [@skip_start_sector, @skip_end_sector) passed in by caller, because caller
- * has new data to write there.
+ * For sparse extents, look up in L1, L2 table.
  *
  * Returns: VMDK_OK if cluster exists and mapped in the image.
- *          VMDK_UNALLOC if cluster is not mapped and @allocate is false.
- *          VMDK_ERROR if failed.
+ *          VMDK_UNALLOC if cluster is not mapped.
+ *          VMDK_ERROR if failed
  */
-static int get_cluster_offset(BlockDriverState *bs,
-                              VmdkExtent *extent,
-                              VmdkMetaData *m_data,
-                              uint64_t offset,
-                              bool allocate,
-                              uint64_t *cluster_offset,
-                              uint64_t skip_start_bytes,
-                              uint64_t skip_end_bytes)
+static int vmdk_get_cluster_offset(BlockDriverState *bs,
+                                   VmdkExtent *extent,
+                                   uint64_t offset,
+                                   uint64_t *cluster_offset)
 {
-    unsigned int l1_index, l2_offset, l2_index;
-    int min_index, i, j;
-    uint32_t min_count, *l2_table;
+    int l1_index, l2_offset, l2_index;
+    uint32_t *l2_table;
     bool zeroed = false;
     int64_t ret;
     int64_t cluster_sector;
 
-    if (m_data) {
-        m_data->valid = 0;
-    }
     if (extent->flat) {
         *cluster_offset = extent->flat_start_offset;
         return VMDK_OK;
     }
 
-    offset -= (extent->end_sector - extent->sectors) * SECTOR_SIZE;
-    l1_index = (offset >> 9) / extent->l1_entry_sectors;
-    if (l1_index >= extent->l1_size) {
-        return VMDK_ERROR;
-    }
-    l2_offset = extent->l1_table[l1_index];
-    if (!l2_offset) {
-        return VMDK_UNALLOC;
-    }
-    for (i = 0; i < L2_CACHE_SIZE; i++) {
-        if (l2_offset == extent->l2_cache_offsets[i]) {
-            /* increment the hit count */
-            if (++extent->l2_cache_counts[i] == 0xffffffff) {
-                for (j = 0; j < L2_CACHE_SIZE; j++) {
-                    extent->l2_cache_counts[j] >>= 1;
-                }
-            }
-            l2_table = extent->l2_cache + (i * extent->l2_size);
-            goto found;
-        }
-    }
-    /* not found: load a new entry in the least used one */
-    min_index = 0;
-    min_count = 0xffffffff;
-    for (i = 0; i < L2_CACHE_SIZE; i++) {
-        if (extent->l2_cache_counts[i] < min_count) {
-            min_count = extent->l2_cache_counts[i];
-            min_index = i;
-        }
-    }
-    l2_table = extent->l2_cache + (min_index * extent->l2_size);
-    if (bdrv_pread(extent->file,
-                (int64_t)l2_offset * 512,
-                l2_table,
-                extent->l2_size * sizeof(uint32_t)
-            ) != extent->l2_size * sizeof(uint32_t)) {
-        return VMDK_ERROR;
+    ret = get_cluster_table(extent, offset, &l1_index, &l2_offset,
+                            &l2_index, &l2_table);
+    if (ret < 0) {
+        return ret;
     }
 
-    extent->l2_cache_offsets[min_index] = l2_offset;
-    extent->l2_cache_counts[min_index] = 1;
- found:
-    l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
     cluster_sector = le32_to_cpu(l2_table[l2_index]);
 
     if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
@@ -1220,31 +1463,9 @@ static int get_cluster_offset(BlockDriverState *bs,
     }
 
     if (!cluster_sector || zeroed) {
-        if (!allocate) {
-            return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
-        }
-
-        cluster_sector = extent->next_cluster_sector;
-        extent->next_cluster_sector += extent->cluster_sectors;
-
-        /* First of all we write grain itself, to avoid race condition
-         * that may to corrupt the image.
-         * This problem may occur because of insufficient space on host disk
-         * or inappropriate VM shutdown.
-         */
-        ret = get_whole_cluster(bs, extent, cluster_sector * BDRV_SECTOR_SIZE,
-                                offset, skip_start_bytes, skip_end_bytes);
-        if (ret) {
-            return ret;
-        }
-        if (m_data) {
-            m_data->valid = 1;
-            m_data->l1_index = l1_index;
-            m_data->l2_index = l2_index;
-            m_data->l2_offset = l2_offset;
-            m_data->l2_cache_entry = &l2_table[l2_index];
-        }
+        return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
     }
+
     *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
     return VMDK_OK;
 }
@@ -1266,18 +1487,6 @@ static VmdkExtent *find_extent(BDRVVmdkState *s,
     return NULL;
 }
 
-static inline uint64_t vmdk_find_offset_in_cluster(VmdkExtent *extent,
-                                                   int64_t offset)
-{
-    uint64_t extent_begin_offset, extent_relative_offset;
-    uint64_t cluster_size = extent->cluster_sectors * BDRV_SECTOR_SIZE;
-
-    extent_begin_offset =
-        (extent->end_sector - extent->sectors) * BDRV_SECTOR_SIZE;
-    extent_relative_offset = offset - extent_begin_offset;
-    return extent_relative_offset % cluster_size;
-}
-
 static inline uint64_t vmdk_find_index_in_cluster(VmdkExtent *extent,
                                                   int64_t sector_num)
 {
@@ -1299,9 +1508,7 @@ static int64_t coroutine_fn vmdk_co_get_block_status(BlockDriverState *bs,
         return 0;
     }
     qemu_co_mutex_lock(&s->lock);
-    ret = get_cluster_offset(bs, extent, NULL,
-                             sector_num * 512, false, &offset,
-                             0, 0);
+    ret = vmdk_get_cluster_offset(bs, extent, sector_num * 512, &offset);
     qemu_co_mutex_unlock(&s->lock);
 
     index_in_cluster = vmdk_find_index_in_cluster(extent, sector_num);
@@ -1492,13 +1699,13 @@ vmdk_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
             ret = -EIO;
             goto fail;
         }
-        ret = get_cluster_offset(bs, extent, NULL,
-                                 offset, false, &cluster_offset, 0, 0);
         offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
 
         n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
                              - offset_in_cluster);
 
+        ret = vmdk_get_cluster_offset(bs, extent, offset, &cluster_offset);
+
         if (ret != VMDK_OK) {
             /* if not allocated, try to read from parent image, if exist */
             if (bs->backing && ret != VMDK_ZEROED) {
@@ -1561,7 +1768,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
     int64_t offset_in_cluster, n_bytes;
     uint64_t cluster_offset;
     uint64_t bytes_done = 0;
+    uint64_t extent_size;
     VmdkMetaData m_data;
+    uint32_t total_alloc_clusters = 0;
 
     if (DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE) > bs->total_sectors) {
         error_report("Wrong offset: offset=0x%" PRIx64
@@ -1575,14 +1784,22 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
         if (!extent) {
             return -EIO;
         }
+        extent_size = extent->end_sector * BDRV_SECTOR_SIZE;
+
         offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
-        n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
-                             - offset_in_cluster);
 
-        ret = get_cluster_offset(bs, extent, &m_data, offset,
-                                 !(extent->compressed || zeroed),
-                                 &cluster_offset, offset_in_cluster,
-                                 offset_in_cluster + n_bytes);
+        /* truncate n_bytes to first cluster because we need to perform COW */
+        if (offset_in_cluster > 0) {
+            n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
+                                 - offset_in_cluster);
+        } else {
+            n_bytes = MIN(bytes, extent_size - offset);
+        }
+
+        ret = vmdk_alloc_cluster_offset(bs, extent, &m_data, offset,
+                                        !(extent->compressed || zeroed),
+                                        &cluster_offset, n_bytes,
+                                        &total_alloc_clusters);
         if (extent->compressed) {
             if (ret == VMDK_OK) {
                 /* Refuse write to allocated cluster for streamOptimized */
@@ -1591,19 +1808,22 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
                 return -EIO;
             } else {
                 /* allocate */
-                ret = get_cluster_offset(bs, extent, &m_data, offset,
-                                         true, &cluster_offset, 0, 0);
+                ret = vmdk_alloc_cluster_offset(bs, extent, &m_data, offset,
+                                                true, &cluster_offset, n_bytes,
+                                                &total_alloc_clusters);
             }
         }
         if (ret == VMDK_ERROR) {
             return -EINVAL;
         }
+
         if (zeroed) {
             /* Do zeroed write, buf is ignored */
-            if (extent->has_zero_grain &&
-                    offset_in_cluster == 0 &&
-                    n_bytes >= extent->cluster_sectors * BDRV_SECTOR_SIZE) {
-                n_bytes = extent->cluster_sectors * BDRV_SECTOR_SIZE;
+            if (extent->has_zero_grain && offset_in_cluster == 0 &&
+                    n_bytes >= extent->cluster_sectors * BDRV_SECTOR_SIZE *
+                        total_alloc_clusters) {
+                n_bytes = extent->cluster_sectors * BDRV_SECTOR_SIZE *
+                                        total_alloc_clusters;
                 if (!zero_dry_run) {
                     /* update L2 tables */
                     if (vmdk_L2update(extent, &m_data, VMDK_GTE_ZEROED)
@@ -2224,9 +2444,8 @@ static int vmdk_check(BlockDriverState *bs, BdrvCheckResult *result,
                     sector_num);
             break;
         }
-        ret = get_cluster_offset(bs, extent, NULL,
-                                 sector_num << BDRV_SECTOR_BITS,
-                                 false, &cluster_offset, 0, 0);
+        ret = vmdk_get_cluster_offset(bs, extent,
+                        sector_num << BDRV_SECTOR_BITS, &cluster_offset);
         if (ret == VMDK_ERROR) {
             fprintf(stderr,
                     "ERROR: could not get cluster_offset for sector %"
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH 2/2] vmdk: Update metadata for multiple clusters
  2017-03-11 11:54 [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Ashijeet Acharya
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters Ashijeet Acharya
@ 2017-03-11 11:54 ` Ashijeet Acharya
  2017-03-21  7:51 ` [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Stefan Hajnoczi
  2 siblings, 0 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-11 11:54 UTC (permalink / raw)
  To: famz; +Cc: kwolf, jsnow, mreitz, qemu-devel, qemu-block, Ashijeet Acharya

Include a next pointer in VmdkMetaData struct to point to the previous
allocated L2 table. Modify vmdk_L2update to start updating metadata for
allocation of multiple clusters at once.

Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
---
 block/vmdk.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 3dc178b..62ea46f 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -137,6 +137,8 @@ typedef struct VmdkMetaData {
     int valid;
     uint32_t *l2_cache_entry;
     uint32_t nb_clusters;
+    uint32_t offset;
+    struct VmdkMetaData *next;
 } VmdkMetaData;
 
 typedef struct VmdkGrainMarker {
@@ -1037,29 +1039,81 @@ static void vmdk_refresh_limits(BlockDriverState *bs, Error **errp)
     }
 }
 
-static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
-                         uint32_t offset)
+static int vmdk_alloc_cluster_link_l2(VmdkExtent *extent,
+                                      VmdkMetaData *m_data, bool zeroed)
 {
-    offset = cpu_to_le32(offset);
+    int i;
+    uint32_t offset, temp_offset;
+
+    if (zeroed) {
+        temp_offset = VMDK_GTE_ZEROED;
+    } else {
+        temp_offset = m_data->offset;
+    }
+
+    temp_offset = cpu_to_le32(temp_offset);
+
     /* update L2 table */
-    if (bdrv_pwrite_sync(extent->file,
+    offset = temp_offset;
+    for (i = 0; i < m_data->nb_clusters; i++) {
+        if (bdrv_pwrite_sync(extent->file,
                 ((int64_t)m_data->l2_offset * 512)
-                    + (m_data->l2_index * sizeof(offset)),
-                &offset, sizeof(offset)) < 0) {
-        return VMDK_ERROR;
+                    + ((m_data->l2_index + i) * sizeof(offset)),
+                &(offset), sizeof(offset)) < 0) {
+            return VMDK_ERROR;
+        }
+        if (!zeroed) {
+            offset += 128;
+        }
     }
+
     /* update backup L2 table */
+    offset = temp_offset;
     if (extent->l1_backup_table_offset != 0) {
         m_data->l2_offset = extent->l1_backup_table[m_data->l1_index];
-        if (bdrv_pwrite_sync(extent->file,
-                    ((int64_t)m_data->l2_offset * 512)
-                        + (m_data->l2_index * sizeof(offset)),
-                    &offset, sizeof(offset)) < 0) {
-            return VMDK_ERROR;
+        for (i = 0; i < m_data->nb_clusters; i++) {
+            if (bdrv_pwrite_sync(extent->file,
+                        ((int64_t)m_data->l2_offset * 512)
+                            + ((m_data->l2_index + i) * sizeof(offset)),
+                        &(offset), sizeof(offset)) < 0) {
+                return VMDK_ERROR;
+            }
+            if (!zeroed) {
+                offset += 128;
+            }
         }
     }
+
+    offset = temp_offset;
     if (m_data->l2_cache_entry) {
-        *m_data->l2_cache_entry = offset;
+        for (i = 0; i < m_data->nb_clusters; i++) {
+            *m_data->l2_cache_entry = offset;
+            m_data->l2_cache_entry++;
+
+            if (!zeroed) {
+                offset += 128;
+            }
+        }
+    }
+
+    return VMDK_OK;
+}
+
+static int vmdk_L2update(VmdkExtent *extent, VmdkMetaData *m_data,
+                         bool zeroed)
+{
+    int ret;
+
+    while (m_data->next != NULL) {
+        VmdkMetaData *next;
+
+        ret = vmdk_alloc_cluster_link_l2(extent, m_data, zeroed);
+        if (ret < 0) {
+            return ret;
+        }
+
+        next = m_data->next;
+        m_data = next;
     }
 
     return VMDK_OK;
@@ -1271,7 +1325,7 @@ exit:
  */
 static int handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
                         uint64_t offset, uint64_t *cluster_offset,
-                        int64_t *bytes, VmdkMetaData *m_data,
+                        int64_t *bytes, VmdkMetaData **m_data,
                         bool allocate, uint32_t *total_alloc_clusters)
 {
     int l1_index, l2_offset, l2_index;
@@ -1280,6 +1334,7 @@ static int handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
     uint32_t nb_clusters;
     bool zeroed = false;
     uint64_t skip_start_bytes, skip_end_bytes;
+    VmdkMetaData *old_m_data;
     int ret;
 
     ret = get_cluster_table(extent, offset, &l1_index, &l2_offset,
@@ -1331,13 +1386,21 @@ static int handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
         if (ret < 0) {
             return ret;
         }
-        if (m_data) {
-            m_data->valid = 1;
-            m_data->l1_index = l1_index;
-            m_data->l2_index = l2_index;
-            m_data->l2_offset = l2_offset;
-            m_data->l2_cache_entry = &l2_table[l2_index];
-            m_data->nb_clusters = nb_clusters;
+
+        if (*m_data) {
+            old_m_data = *m_data;
+            *m_data = g_malloc0(sizeof(**m_data));
+
+            **m_data = (VmdkMetaData) {
+                .valid            =    1,
+                .l1_index         =    l1_index,
+                .l2_index         =    l2_index,
+                .l2_offset        =    l2_offset,
+                .l2_cache_entry   =    &l2_table[l2_index],
+                .nb_clusters      =    nb_clusters,
+                .offset           =    cluster_sector,
+                .next             =    old_m_data,
+            };
         }
     }
     *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
@@ -1366,7 +1429,7 @@ static int handle_alloc(BlockDriverState *bs, VmdkExtent *extent,
  */
 static int vmdk_alloc_cluster_offset(BlockDriverState *bs,
                                      VmdkExtent *extent,
-                                     VmdkMetaData *m_data, uint64_t offset,
+                                     VmdkMetaData **m_data, uint64_t offset,
                                      bool allocate, uint64_t *cluster_offset,
                                      int64_t bytes,
                                      uint32_t *total_alloc_clusters)
@@ -1386,8 +1449,8 @@ static int vmdk_alloc_cluster_offset(BlockDriverState *bs,
     new_cluster_offset = 0;
     *cluster_offset = 0;
     n_bytes = 0;
-    if (m_data) {
-        m_data->valid = 0;
+    if (*m_data) {
+        (*m_data)->valid = 0;
     }
 
     /* due to L2 table margins all bytes may not get allocated at once */
@@ -1769,9 +1832,11 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
     uint64_t cluster_offset;
     uint64_t bytes_done = 0;
     uint64_t extent_size;
-    VmdkMetaData m_data;
+    VmdkMetaData *m_data;
     uint32_t total_alloc_clusters = 0;
 
+    m_data = g_malloc0(sizeof(*m_data));
+
     if (DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE) > bs->total_sectors) {
         error_report("Wrong offset: offset=0x%" PRIx64
                      " total_sectors=0x%" PRIx64,
@@ -1780,6 +1845,7 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
     }
 
     while (bytes > 0) {
+        m_data->next = NULL;
         extent = find_extent(s, offset >> BDRV_SECTOR_BITS, extent);
         if (!extent) {
             return -EIO;
@@ -1826,7 +1892,7 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
                                         total_alloc_clusters;
                 if (!zero_dry_run) {
                     /* update L2 tables */
-                    if (vmdk_L2update(extent, &m_data, VMDK_GTE_ZEROED)
+                    if (vmdk_L2update(extent, m_data, zeroed)
                             != VMDK_OK) {
                         return -EIO;
                     }
@@ -1840,10 +1906,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
             if (ret) {
                 return ret;
             }
-            if (m_data.valid) {
+            if (m_data->valid) {
                 /* update L2 tables */
-                if (vmdk_L2update(extent, &m_data,
-                                  cluster_offset >> BDRV_SECTOR_BITS)
+                if (vmdk_L2update(extent, m_data, zeroed)
                         != VMDK_OK) {
                     return -EIO;
                 }
@@ -1853,6 +1918,13 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
         offset += n_bytes;
         bytes_done += n_bytes;
 
+        while (m_data->next != NULL) {
+            VmdkMetaData *next;
+            next = m_data->next;
+            g_free(m_data);
+            m_data = next;
+        }
+
         /* update CID on the first write every time the virtual disk is
          * opened */
         if (!s->cid_updated) {
@@ -1863,6 +1935,7 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
             s->cid_updated = true;
         }
     }
+    g_free(m_data);
     return 0;
 }
 
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-11 11:54 [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Ashijeet Acharya
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters Ashijeet Acharya
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 2/2] vmdk: Update metadata for " Ashijeet Acharya
@ 2017-03-21  7:51 ` Stefan Hajnoczi
  2017-03-21  9:14   ` Ashijeet Acharya
  2 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2017-03-21  7:51 UTC (permalink / raw)
  To: Ashijeet Acharya
  Cc: Fam Zheng, Kevin Wolf, qemu block, qemu-devel, Max Reitz,
	John Snow

On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
<ashijeetacharya@gmail.com> wrote:
> This series optimizes the I/O performance of VMDK driver.
>
> Patch 1 makes the VMDK driver to allocate multiple clusters at once. Earlier
> it used to allocate cluster by cluster which slowed down its performance to a
> great extent.
>
> Patch 2 changes the metadata update code to update the L2 tables for multiple
> clusters at once.

This patch series is a performance optimization.  Benchmark results
are required to justify optimizations.  Please include performance
results in the next revision.

A popular disk I/O benchmarking is fio (https://github.com/axboe/fio).
I suggest a write-heavy workload with a large block size:

$ cat fio.job
[global]
direct=1
filename=/dev/vdb
ioengine=libaio
runtime=30
ramp_time=5

[job1]
iodepth=4
rw=randwrite
bs=256k
$ for i in 1 2 3 4 5; do fio --output=fio-$i.txt fio.job; done #
WARNING: overwrites /dev/vdb

It's good practice to run the benchmark several times because there is
usually some variation between runs.  This allows you to check that
the variance is within a reasonable range (5-10% on a normal machine
that hasn't been specially prepared for benchmarking).

Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-21  7:51 ` [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Stefan Hajnoczi
@ 2017-03-21  9:14   ` Ashijeet Acharya
  2017-03-23 15:09     ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-21  9:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, qemu block,
	qemu-devel

On Tue, 21 Mar 2017 at 13:21, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
> <ashijeetacharya@gmail.com> wrote:
> > This series optimizes the I/O performance of VMDK driver.
> >
> > Patch 1 makes the VMDK driver to allocate multiple clusters at once.
> Earlier
> > it used to allocate cluster by cluster which slowed down its performance
> to a
> > great extent.
> >
> > Patch 2 changes the metadata update code to update the L2 tables for
> multiple
> > clusters at once.
>
> This patch series is a performance optimization.  Benchmark results
> are required to justify optimizations.  Please include performance
> results in the next revision.
>
> A popular disk I/O benchmarking is fio (https://github.com/axboe/fio).
> I suggest a write-heavy workload with a large block size:
>
> $ cat fio.job
> [global]
> direct=1
> filename=/dev/vdb
> ioengine=libaio
> runtime=30
> ramp_time=5
>
> [job1]
> iodepth=4
> rw=randwrite
> bs=256k
> $ for i in 1 2 3 4 5; do fio --output=fio-$i.txt fio.job; done #
> WARNING: overwrites /dev/vdb
>
> It's good practice to run the benchmark several times because there is
> usually some variation between runs.  This allows you to check that
> the variance is within a reasonable range (5-10% on a normal machine
> that hasn't been specially prepared for benchmarking).


I ran a few write tests of 128M using qemu-io and the results showed the
time to drop to almost half, will those work? Although, I will also try to
use the tool you mentioned later today when I am free and include those
results as well.

Ashijeet

>
>
> Stefan
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-21  9:14   ` Ashijeet Acharya
@ 2017-03-23 15:09     ` Stefan Hajnoczi
  2017-03-23 16:22       ` Ashijeet Acharya
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2017-03-23 15:09 UTC (permalink / raw)
  To: Ashijeet Acharya
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, qemu block,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1882 bytes --]

On Tue, Mar 21, 2017 at 09:14:08AM +0000, Ashijeet Acharya wrote:
> On Tue, 21 Mar 2017 at 13:21, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> > On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
> > <ashijeetacharya@gmail.com> wrote:
> > > This series optimizes the I/O performance of VMDK driver.
> > >
> > > Patch 1 makes the VMDK driver to allocate multiple clusters at once.
> > Earlier
> > > it used to allocate cluster by cluster which slowed down its performance
> > to a
> > > great extent.
> > >
> > > Patch 2 changes the metadata update code to update the L2 tables for
> > multiple
> > > clusters at once.
> >
> > This patch series is a performance optimization.  Benchmark results
> > are required to justify optimizations.  Please include performance
> > results in the next revision.
> >
> > A popular disk I/O benchmarking is fio (https://github.com/axboe/fio).
> > I suggest a write-heavy workload with a large block size:
> >
> > $ cat fio.job
> > [global]
> > direct=1
> > filename=/dev/vdb
> > ioengine=libaio
> > runtime=30
> > ramp_time=5
> >
> > [job1]
> > iodepth=4
> > rw=randwrite
> > bs=256k
> > $ for i in 1 2 3 4 5; do fio --output=fio-$i.txt fio.job; done #
> > WARNING: overwrites /dev/vdb
> >
> > It's good practice to run the benchmark several times because there is
> > usually some variation between runs.  This allows you to check that
> > the variance is within a reasonable range (5-10% on a normal machine
> > that hasn't been specially prepared for benchmarking).
> 
> 
> I ran a few write tests of 128M using qemu-io and the results showed the
> time to drop to almost half, will those work? Although, I will also try to
> use the tool you mentioned later today when I am free and include those
> results as well.

Maybe, it's hard to say without seeing the commands you ran.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-23 15:09     ` Stefan Hajnoczi
@ 2017-03-23 16:22       ` Ashijeet Acharya
  2017-03-24 15:24         ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-23 16:22 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, qemu block,
	qemu-devel

On Thu, Mar 23, 2017 at 8:39 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Mar 21, 2017 at 09:14:08AM +0000, Ashijeet Acharya wrote:
>> On Tue, 21 Mar 2017 at 13:21, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> > On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
>> > <ashijeetacharya@gmail.com> wrote:
>> > > This series optimizes the I/O performance of VMDK driver.
>> > >
>> > > Patch 1 makes the VMDK driver to allocate multiple clusters at once.
>> > Earlier
>> > > it used to allocate cluster by cluster which slowed down its performance
>> > to a
>> > > great extent.
>> > >
>> > > Patch 2 changes the metadata update code to update the L2 tables for
>> > multiple
>> > > clusters at once.
>> >
>> > This patch series is a performance optimization.  Benchmark results
>> > are required to justify optimizations.  Please include performance
>> > results in the next revision.
>> >
>> > A popular disk I/O benchmarking is fio (https://github.com/axboe/fio).
>> > I suggest a write-heavy workload with a large block size:
>> >
>> > $ cat fio.job
>> > [global]
>> > direct=1
>> > filename=/dev/vdb
>> > ioengine=libaio
>> > runtime=30
>> > ramp_time=5
>> >
>> > [job1]
>> > iodepth=4
>> > rw=randwrite
>> > bs=256k
>> > $ for i in 1 2 3 4 5; do fio --output=fio-$i.txt fio.job; done #
>> > WARNING: overwrites /dev/vdb
>> >
>> > It's good practice to run the benchmark several times because there is
>> > usually some variation between runs.  This allows you to check that
>> > the variance is within a reasonable range (5-10% on a normal machine
>> > that hasn't been specially prepared for benchmarking).
>>
>>
>> I ran a few write tests of 128M using qemu-io and the results showed the
>> time to drop to almost half, will those work? Although, I will also try to
>> use the tool you mentioned later today when I am free and include those
>> results as well.
>
> Maybe, it's hard to say without seeing the commands you ran.

These are the commands I ran to test the write requests:

My test file "test1.vmdk" is a 1G empty vmdk image created by using
'qemu-img' tool.

Before optimization:
$ ./bin/qemu-io -f vmdk --cache writeback
qemu-io> open -n -o driver=vmdk test1.vmdk
qemu-io> aio_write 0 128M
qemu-io> wrote 134217728/134217728 bytes at offset 0
128 MiB, 1 ops; 0:00:16.46 (7.772 MiB/sec and 0.0607 ops/sec)

After optimization:
$ ./bin/qemu-io -f vmdk --cache writeback
qemu-io> open -n -o driver=vmdk test1.vmdk
qemu-io> aio_write 0 128M
qemu-io> wrote 134217728/134217728 bytes at offset 0
128 MiB, 1 ops; 0:00:08.19 (15.627 MiB/sec and 0.1221 ops/sec)

Will these work?

Although, I do have to mention that I ran these tests on my PC at home
two weeks ago and since I am back in my college campus again, I no
longer have access to it. Compared to my PC, my laptop has very low
specs (for eg: it embarrassingly takes more than 3 minutes for the
same write request of 128M in the 'before optimization' case), so I
won't be able to reproduce those results here. If possible, can anyone
of you maintainers run these tests with the fio tool on your machines
and send me the results in case the above ones don't work and help me
out? Sorry!

Thanks
Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters
  2017-03-11 11:54 ` [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters Ashijeet Acharya
@ 2017-03-23 19:10   ` Kevin Wolf
  2017-03-23 19:18     ` Ashijeet Acharya
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Wolf @ 2017-03-23 19:10 UTC (permalink / raw)
  To: Ashijeet Acharya; +Cc: famz, jsnow, mreitz, qemu-devel, qemu-block

Am 11.03.2017 um 12:54 hat Ashijeet Acharya geschrieben:
> The vmdk driver in block/vmdk.c used to allocate cluster by cluster
> which slowed down its I/O performance.
> 
> Make vmdk driver allocate multiple clusters at once to reduce the
> overhead costs of multiple separate cluster allocation calls. The number
> of clusters allocated at once depends on the L2 table boundaries. Also
> the first and the last clusters are allocated separately as we may need
> to perform COW for them
> 
> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>

Ashijeet, this is pretty hard to review because you're doing too many
things in a single patch. The rule is to do only one logical change in
each patch whenever it's possible. In particular, pure code motion (or
renames) need to be separated from code modification because when you
move code and modify it at the same time, it is very hard to see the
actual difference.

Do you think you could split this into multiple patches, like this:

1. Move vmdk_find_offset_in_cluster() to the top
2. Rename get_whole_cluster() to do_alloc_cluster_offset()
   (To be honest, both names aren't great. Something like qcow2's
   perform_cow() would describe better what's going on there.)
3. Factor out some code from get_cluster_offset() into handle_alloc()
4. Rename get_cluster_offset() into vmdk_get_cluster_offset()
...
n. Handle multiple clusters at once

Obviously, I haven't taken the time to fully read and understand all of
your patch, so don't take this too literally, but you see what kind of
granularity the individual patches should be. If you do it like this,
each change becomes really simple and can be reviewed and discussed on
its own. And if we later find a bug in VMDK, git bisect can point out
exactly which step introduced the problem rather than pointing at a
single big commit that is hard to understand.

Good splitting of patches is one of the tricks to get quicker reviews.
Admittedly, it is often not easy, but it is worth spending some thought
on how to make a series really easy to read for others.

Kevin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters
  2017-03-23 19:10   ` Kevin Wolf
@ 2017-03-23 19:18     ` Ashijeet Acharya
  0 siblings, 0 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-23 19:18 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Fam Zheng, John Snow, Max Reitz, QEMU Developers, qemu block

On Fri, Mar 24, 2017 at 12:40 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.03.2017 um 12:54 hat Ashijeet Acharya geschrieben:
>> The vmdk driver in block/vmdk.c used to allocate cluster by cluster
>> which slowed down its I/O performance.
>>
>> Make vmdk driver allocate multiple clusters at once to reduce the
>> overhead costs of multiple separate cluster allocation calls. The number
>> of clusters allocated at once depends on the L2 table boundaries. Also
>> the first and the last clusters are allocated separately as we may need
>> to perform COW for them
>>
>> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
>
> Ashijeet, this is pretty hard to review because you're doing too many
> things in a single patch. The rule is to do only one logical change in
> each patch whenever it's possible. In particular, pure code motion (or
> renames) need to be separated from code modification because when you
> move code and modify it at the same time, it is very hard to see the
> actual difference.

Hmm, sorry about that.

>
> Do you think you could split this into multiple patches, like this:
>
> 1. Move vmdk_find_offset_in_cluster() to the top
> 2. Rename get_whole_cluster() to do_alloc_cluster_offset()
>    (To be honest, both names aren't great. Something like qcow2's
>    perform_cow() would describe better what's going on there.)
> 3. Factor out some code from get_cluster_offset() into handle_alloc()
> 4. Rename get_cluster_offset() into vmdk_get_cluster_offset()
> ...
> n. Handle multiple clusters at once
>
> Obviously, I haven't taken the time to fully read and understand all of
> your patch, so don't take this too literally, but you see what kind of
> granularity the individual patches should be. If you do it like this,
> each change becomes really simple and can be reviewed and discussed on
> its own. And if we later find a bug in VMDK, git bisect can point out
> exactly which step introduced the problem rather than pointing at a
> single big commit that is hard to understand.
>
> Good splitting of patches is one of the tricks to get quicker reviews.
> Admittedly, it is often not easy, but it is worth spending some thought
> on how to make a series really easy to read for others.

Sure, I understand your point completely. I wasn't able to segregate
things too, but now with the list of things you mentioned above, it
certainly helps making things clearer. I will post a v2 as soon as I
can, probably you will receive it first thing Monday.

One thing I wanna ask is that were you able to take a look at my reply
to the cover letter earlier. Will the test results using 'qemi-io' I
posted there work?

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-23 16:22       ` Ashijeet Acharya
@ 2017-03-24 15:24         ` Stefan Hajnoczi
  2017-03-24 15:37           ` Ashijeet Acharya
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2017-03-24 15:24 UTC (permalink / raw)
  To: Ashijeet Acharya
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, qemu block,
	qemu-devel

On Thu, Mar 23, 2017 at 4:22 PM, Ashijeet Acharya
<ashijeetacharya@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 8:39 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Mar 21, 2017 at 09:14:08AM +0000, Ashijeet Acharya wrote:
>>> On Tue, 21 Mar 2017 at 13:21, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>
>>> > On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
>>> > <ashijeetacharya@gmail.com> wrote:
>>> > > This series optimizes the I/O performance of VMDK driver.
>>> > >
>>> > > Patch 1 makes the VMDK driver to allocate multiple clusters at once.
>>> > Earlier
>>> > > it used to allocate cluster by cluster which slowed down its performance
>>> > to a
>>> > > great extent.
>>> > >
>>> > > Patch 2 changes the metadata update code to update the L2 tables for
>>> > multiple
>>> > > clusters at once.
>>> >
>>> > This patch series is a performance optimization.  Benchmark results
>>> > are required to justify optimizations.  Please include performance
>>> > results in the next revision.
>>> >
>>> > A popular disk I/O benchmarking is fio (https://github.com/axboe/fio).
>>> > I suggest a write-heavy workload with a large block size:
>>> >
>>> > $ cat fio.job
>>> > [global]
>>> > direct=1
>>> > filename=/dev/vdb
>>> > ioengine=libaio
>>> > runtime=30
>>> > ramp_time=5
>>> >
>>> > [job1]
>>> > iodepth=4
>>> > rw=randwrite
>>> > bs=256k
>>> > $ for i in 1 2 3 4 5; do fio --output=fio-$i.txt fio.job; done #
>>> > WARNING: overwrites /dev/vdb
>>> >
>>> > It's good practice to run the benchmark several times because there is
>>> > usually some variation between runs.  This allows you to check that
>>> > the variance is within a reasonable range (5-10% on a normal machine
>>> > that hasn't been specially prepared for benchmarking).
>>>
>>>
>>> I ran a few write tests of 128M using qemu-io and the results showed the
>>> time to drop to almost half, will those work? Although, I will also try to
>>> use the tool you mentioned later today when I am free and include those
>>> results as well.
>>
>> Maybe, it's hard to say without seeing the commands you ran.
>
> These are the commands I ran to test the write requests:
>
> My test file "test1.vmdk" is a 1G empty vmdk image created by using
> 'qemu-img' tool.
>
> Before optimization:
> $ ./bin/qemu-io -f vmdk --cache writeback
> qemu-io> open -n -o driver=vmdk test1.vmdk
> qemu-io> aio_write 0 128M
> qemu-io> wrote 134217728/134217728 bytes at offset 0
> 128 MiB, 1 ops; 0:00:16.46 (7.772 MiB/sec and 0.0607 ops/sec)
>
> After optimization:
> $ ./bin/qemu-io -f vmdk --cache writeback
> qemu-io> open -n -o driver=vmdk test1.vmdk
> qemu-io> aio_write 0 128M
> qemu-io> wrote 134217728/134217728 bytes at offset 0
> 128 MiB, 1 ops; 0:00:08.19 (15.627 MiB/sec and 0.1221 ops/sec)
>
> Will these work?

It is best to avoid --cache writeback in performance tests because
using the host page cache puts the performance at the mercy of the
kernel's page cache.

I have run the following benchmark using "qemu-img bench":

This patch series improves 128 KB sequential write performance to an
empty VMDK file by 29%.

Benchmark command: ./qemu-img bench -w -c 1024 -s 128K -d 1 -t none -f
vmdk test.vmdk

(Please include the 2 lines above in the next revision of the patch.)

The qemu-img bench options used:
 * -w issues write requests instead of reads
 * -c 1024 terminates after 1024 requests
 * -s 128K sets the request size to 128 KB
 * -d 1 restricts the benchmark to 1 in-flight request at any time
 * -t none uses O_DIRECT to bypass the host page cache

1. Without your patch
$ for i in 1 2 3 4 5; do ./qemu-img create -f vmdk test.vmdk 4G;
./qemu-img bench -w -c 1024 -s 128K -d 1 -t none -f vmdk test.vmdk;
done
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 35.081 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 34.548 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 34.637 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 34.411 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 34.599 seconds.

2. With your patch
$ for i in 1 2 3 4 5; do ./qemu-img create -f vmdk test.vmdk 4G;
./qemu-img bench -w -c 1024 -s 128K -d 1 -t none -f vmdk test.vmdk;
done
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 24.974 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 24.769 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 24.800 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 24.928 seconds.
Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined
Sending 1024 write requests, 131072 bytes each, 1 in parallel
(starting at offset 0, step size 131072)
Run completed in 24.897 seconds.

Stefan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O
  2017-03-24 15:24         ` Stefan Hajnoczi
@ 2017-03-24 15:37           ` Ashijeet Acharya
  0 siblings, 0 replies; 11+ messages in thread
From: Ashijeet Acharya @ 2017-03-24 15:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, John Snow, Kevin Wolf, Max Reitz, qemu block,
	qemu-devel

On Fri, Mar 24, 2017 at 8:54 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Mar 23, 2017 at 4:22 PM, Ashijeet Acharya
> <ashijeetacharya@gmail.com> wrote:
>> On Thu, Mar 23, 2017 at 8:39 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Tue, Mar 21, 2017 at 09:14:08AM +0000, Ashijeet Acharya wrote:
>>>> On Tue, 21 Mar 2017 at 13:21, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>
>>>> > On Sat, Mar 11, 2017 at 11:54 AM, Ashijeet Acharya
>>>> > <ashijeetacharya@gmail.com> wrote:
>>>> > > This series optimizes the I/O performance of VMDK driver.
>>>> > >
>>>> > > Patch 1 makes the VMDK driver to allocate multiple clusters at once.
>>>> > Earlier
>>>> > > it used to allocate cluster by cluster which slowed down its performance
>>>> > to a
>>>> > > great extent.
>>>> > >
>>>> > > Patch 2 changes the metadata update code to update the L2 tables for
>>>> > multiple
>>>> > > clusters at once.
>>
>> These are the commands I ran to test the write requests:
>>
>> My test file "test1.vmdk" is a 1G empty vmdk image created by using
>> 'qemu-img' tool.
>>
>> Before optimization:
>> $ ./bin/qemu-io -f vmdk --cache writeback
>> qemu-io> open -n -o driver=vmdk test1.vmdk
>> qemu-io> aio_write 0 128M
>> qemu-io> wrote 134217728/134217728 bytes at offset 0
>> 128 MiB, 1 ops; 0:00:16.46 (7.772 MiB/sec and 0.0607 ops/sec)
>>
>> After optimization:
>> $ ./bin/qemu-io -f vmdk --cache writeback
>> qemu-io> open -n -o driver=vmdk test1.vmdk
>> qemu-io> aio_write 0 128M
>> qemu-io> wrote 134217728/134217728 bytes at offset 0
>> 128 MiB, 1 ops; 0:00:08.19 (15.627 MiB/sec and 0.1221 ops/sec)
>>
>> Will these work?
>
> It is best to avoid --cache writeback in performance tests because
> using the host page cache puts the performance at the mercy of the
> kernel's page cache.

Okay, understood.

>
> I have run the following benchmark using "qemu-img bench":
>
> This patch series improves 128 KB sequential write performance to an
> empty VMDK file by 29%.
>
> Benchmark command: ./qemu-img bench -w -c 1024 -s 128K -d 1 -t none -f
> vmdk test.vmdk
>
> (Please include the 2 lines above in the next revision of the patch.)
>
Yes, I will do that. Also, this really helped me understand how to
actually test I/O optimizations.

Thanks for your help, this really solved my issue.

Ashijeet

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-03-24 15:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-11 11:54 [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Ashijeet Acharya
2017-03-11 11:54 ` [Qemu-devel] [PATCH 1/2] vmdk: Optimize I/O by allocating multiple clusters Ashijeet Acharya
2017-03-23 19:10   ` Kevin Wolf
2017-03-23 19:18     ` Ashijeet Acharya
2017-03-11 11:54 ` [Qemu-devel] [PATCH 2/2] vmdk: Update metadata for " Ashijeet Acharya
2017-03-21  7:51 ` [Qemu-devel] [PATCH 0/2] Allocate mutiple clusters for VMDK I/O Stefan Hajnoczi
2017-03-21  9:14   ` Ashijeet Acharya
2017-03-23 15:09     ` Stefan Hajnoczi
2017-03-23 16:22       ` Ashijeet Acharya
2017-03-24 15:24         ` Stefan Hajnoczi
2017-03-24 15:37           ` Ashijeet Acharya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).