[Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2
@ 2017-07-12 11:46 Pavel Butsykin
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-12 11:46 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, den, pbutsykin

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

# ./qemu-img create -f qcow2 image.qcow2 4G
Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

# ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:04.59 (222.886 MiB/sec and 0.2177 ops/sec)

# ./qemu-img resize image.qcow2 512M
warning: qemu-img: Shrinking an image will delete all data beyond the shrunken image's end. Before performing such an operation, make sure there is no important data there.
error: qemu-img: Use the --shrink option to perform a shrink operation.

# ./qemu-img resize --shrink image.qcow2 128M
Image resized.

# ./qemu-img info image.qcow2
image: image.qcow2
file format: qcow2
virtual size: 128M (134217728 bytes)
disk size: 128M
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

# du -h image.qcow2 
129M    image.qcow2

Changes from v1:
- add --shrink flag for qemu-img resize
- add qcow2_cache_discard
- simplify qcow2_shrink_l1_table() to reduce the likelihood of image corruption
- add new qemu-iotests for shrinking images

Changes from v2:
- replace qprintf() on error_report() (1)
- rewrite warning messages (1)
- enforce --shrink flag for all formats except raw (1)
- split qcow2_cache_discard() (2)
- minor fixes according to comments (3)
- rewrite the last part of qcow2_shrink_reftable() to avoid
  qcow2_free_clusters() calls inside (3)
- improve test for shrinking image (4)

Changes from v3:
- rebase on "Implement a warning_report function" Alistair's patch-set (1)
- spelling fixes (1)
- the man page fix according to the discussion (1)
- add call qcow2_signal_corruption() in case of image corruption (3)

Changes from v4:
- rebase on https://github.com/XanClic/qemu/commits/block Max's block branch

Pavel Butsykin (4):
  qemu-img: add --shrink flag for resize
  qcow2: add qcow2_cache_discard
  qcow2: add shrink image support
  qemu-iotests: add shrinking image test

 block/qcow2-cache.c        |  26 +++++++
 block/qcow2-cluster.c      |  40 +++++++++++
 block/qcow2-refcount.c     | 124 +++++++++++++++++++++++++++++++++
 block/qcow2.c              |  43 +++++++++---
 block/qcow2.h              |  17 +++++
 qapi/block-core.json       |   3 +-
 qemu-img-cmds.hx           |   4 +-
 qemu-img.c                 |  23 ++++++
 qemu-img.texi              |   6 +-
 tests/qemu-iotests/102     |   4 +-
 tests/qemu-iotests/163     | 170 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 13 files changed, 451 insertions(+), 15 deletions(-)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

-- 
2.13.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize
  2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
@ 2017-07-12 11:46 ` Pavel Butsykin
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-12 11:46 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, den, pbutsykin

The flag is additional precaution against data loss. Perhaps in the future the
operation shrink without this flag will be blocked for all formats, but for now
we need to maintain compatibility with raw.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 qemu-img-cmds.hx       |  4 ++--
 qemu-img.c             | 23 +++++++++++++++++++++++
 qemu-img.texi          |  6 +++++-
 tests/qemu-iotests/102 |  4 ++--
 4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ac5946bc4f..e36957a2ca 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -82,9 +82,9 @@ STEXI
 ETEXI
 
 DEF("resize", img_resize,
-    "resize [--object objectdef] [--image-opts] [-q] filename [+ | -]size")
+    "resize [--object objectdef] [--image-opts] [-q] [--shrink] filename [+ | -]size")
 STEXI
-@item resize [--object @var{objectdef}] [--image-opts] [-q] @var{filename} [+ | -]@var{size}
+@item resize [--object @var{objectdef}] [--image-opts] [-q] [--shrink] @var{filename} [+ | -]@var{size}
 ETEXI
 
 DEF("amend", img_amend,
diff --git a/qemu-img.c b/qemu-img.c
index 28022145d5..b4dc4bb5c4 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -64,6 +64,7 @@ enum {
     OPTION_TARGET_IMAGE_OPTS = 263,
     OPTION_SIZE = 264,
     OPTION_PREALLOCATION = 265,
+    OPTION_SHRINK = 266,
 };
 
 typedef enum OutputFormat {
@@ -3430,6 +3431,7 @@ static int img_resize(int argc, char **argv)
         },
     };
     bool image_opts = false;
+    bool shrink = false;
 
     /* Remove size from argv manually so that negative numbers are not treated
      * as options by getopt. */
@@ -3448,6 +3450,7 @@ static int img_resize(int argc, char **argv)
             {"object", required_argument, 0, OPTION_OBJECT},
             {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
             {"preallocation", required_argument, 0, OPTION_PREALLOCATION},
+            {"shrink", no_argument, 0, OPTION_SHRINK},
             {0, 0, 0, 0}
         };
         c = getopt_long(argc, argv, ":f:hq",
@@ -3491,6 +3494,9 @@ static int img_resize(int argc, char **argv)
                 return 1;
             }
             break;
+        case OPTION_SHRINK:
+            shrink = true;
+            break;
         }
     }
     if (optind != argc - 1) {
@@ -3564,6 +3570,23 @@ static int img_resize(int argc, char **argv)
         goto out;
     }
 
+    if (total_size < current_size && !shrink) {
+        warn_report("Shrinking an image will delete all data beyond the "
+                    "shrunken image's end. Before performing such an "
+                    "operation, make sure there is no important data there.");
+
+        if (g_strcmp0(bdrv_get_format_name(blk_bs(blk)), "raw") != 0) {
+            error_report(
+              "Use the --shrink option to perform a shrink operation.");
+            ret = -1;
+            goto out;
+        } else {
+            warn_report("Using the --shrink option will suppress this message."
+                        "Note that future versions of qemu-img may refuse to "
+                        "shrink images without this option.");
+        }
+    }
+
     ret = blk_truncate(blk, total_size, prealloc, &err);
     if (!ret) {
         qprintf(quiet, "Image resized.\n");
diff --git a/qemu-img.texi b/qemu-img.texi
index f11f6036ad..9a930f5e6d 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -529,7 +529,7 @@ qemu-img rebase -b base.img diff.qcow2
 At this point, @code{modified.img} can be discarded, since
 @code{base.img + diff.qcow2} contains the same information.
 
-@item resize [--preallocation=@var{prealloc}] @var{filename} [+ | -]@var{size}
+@item resize [--shrink] [--preallocation=@var{prealloc}] @var{filename} [+ | -]@var{size}
 
 Change the disk image as if it had been created with @var{size}.
 
@@ -537,6 +537,10 @@ Before using this command to shrink a disk image, you MUST use file system and
 partitioning tools inside the VM to reduce allocated file systems and partition
 sizes accordingly.  Failure to do so will result in data loss!
 
+When shrinking images, the @code{--shrink} option must be given. This informs
+qemu-img that the user acknowledges all loss of data beyond the truncated
+image's end.
+
 After using this command to grow a disk image, you must use file system and
 partitioning tools inside the VM to actually begin using the new space on the
 device.
diff --git a/tests/qemu-iotests/102 b/tests/qemu-iotests/102
index 87db1bb1bf..d7ad8d9840 100755
--- a/tests/qemu-iotests/102
+++ b/tests/qemu-iotests/102
@@ -54,7 +54,7 @@ _make_test_img $IMG_SIZE
 $QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
 # Remove data cluster from image (first cluster: image header, second: reftable,
 # third: refblock, fourth: L1 table, fifth: L2 table)
-$QEMU_IMG resize -f raw "$TEST_IMG" $((5 * 64 * 1024))
+$QEMU_IMG resize -f raw --shrink "$TEST_IMG" $((5 * 64 * 1024))
 
 $QEMU_IO -c map "$TEST_IMG"
 $QEMU_IMG map "$TEST_IMG"
@@ -69,7 +69,7 @@ $QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
 
 qemu_comm_method=monitor _launch_qemu -drive if=none,file="$TEST_IMG",id=drv0
 
-$QEMU_IMG resize -f raw "$TEST_IMG" $((5 * 64 * 1024))
+$QEMU_IMG resize -f raw --shrink "$TEST_IMG" $((5 * 64 * 1024))
 
 _send_qemu_cmd $QEMU_HANDLE 'qemu-io drv0 map' 'allocated' \
     | sed -e 's/^(qemu).*qemu-io drv0 map...$/(qemu) qemu-io drv0 map/'
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard
  2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
@ 2017-07-12 11:46 ` Pavel Butsykin
  2017-07-12 14:45   ` Kevin Wolf
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support Pavel Butsykin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-12 11:46 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, den, pbutsykin

Whenever l2/refcount table clusters are discarded from the file we can
automatically drop unnecessary content of the cache tables. This reduces
the chance of eviction useful cache data and eliminates inconsistent data
in the cache with the data in the file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cache.c    | 26 ++++++++++++++++++++++++++
 block/qcow2-refcount.c | 14 ++++++++++++++
 block/qcow2.h          |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147392..75746a7f43 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -411,3 +411,29 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
     assert(c->entries[i].offset != 0);
     c->entries[i].dirty = true;
 }
+
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+                                  uint64_t offset)
+{
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].offset == offset) {
+            return qcow2_cache_get_table_addr(bs, c, i);
+        }
+    }
+    return NULL;
+}
+
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
+{
+    int i = qcow2_cache_get_table_idx(bs, c, table);
+
+    assert(c->entries[i].ref == 0);
+
+    c->entries[i].offset = 0;
+    c->entries[i].lru_counter = 0;
+    c->entries[i].dirty = false;
+
+    qcow2_cache_table_release(bs, c, i, 1);
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index c9b0dcb4f3..8050db4544 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -862,6 +862,20 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
         s->set_refcount(refcount_block, block_index, refcount);
 
         if (refcount == 0 && s->discard_passthrough[type]) {
+            void *table;
+
+            table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+                                                offset);
+            if (table != NULL) {
+                qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
+                qcow2_cache_discard(bs, s->refcount_block_cache, table);
+            }
+
+            table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
+            if (table != NULL) {
+                qcow2_cache_discard(bs, s->l2_table_cache, table);
+            }
+
             update_refcount_discard(bs, cluster_offset, s->cluster_size);
         }
     }
diff --git a/block/qcow2.h b/block/qcow2.h
index 96a8d43c17..52c374e9ed 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -649,6 +649,9 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
     void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+                                  uint64_t offset);
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table);
 
 /* qcow2-bitmap.c functions */
 int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
@ 2017-07-12 11:46 ` Pavel Butsykin
  2017-07-12 14:52   ` Kevin Wolf
  2017-07-12 11:47 ` [Qemu-devel] [PATCH v5 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
  2017-07-12 14:04 ` [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Max Reitz
  4 siblings, 1 reply; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-12 11:46 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, den, pbutsykin

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 block/qcow2-cluster.c  |  40 ++++++++++++++++++
 block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.c          |  43 +++++++++++++++----
 block/qcow2.h          |  14 +++++++
 qapi/block-core.json   |   3 +-
 5 files changed, 200 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index f06c08f64c..518429c64b 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,46 @@
 #include "qemu/bswap.h"
 #include "trace.h"
 
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int new_l1_size, i, ret;
+
+    if (exact_size >= s->l1_size) {
+        return 0;
+    }
+
+    new_l1_size = exact_size;
+
+#ifdef DEBUG_ALLOC2
+    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
+#endif
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
+    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
+                                       sizeof(uint64_t) * new_l1_size,
+                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_flush(bs->file->bs);
+    if (ret < 0) {
+        return ret;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
+    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
+        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
+            continue;
+        }
+        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
+                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
+        s->l1_table[i] = 0;
+    }
+    return 0;
+}
+
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size)
 {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8050db4544..5c8d606d29 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
 #include "block/qcow2.h"
 #include "qemu/range.h"
 #include "qemu/bswap.h"
+#include "qemu/cutils.h"
 
 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
@@ -3059,3 +3060,112 @@ done:
     qemu_vfree(new_refblock);
     return ret;
 }
+
+int qcow2_shrink_reftable(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t *reftable_tmp =
+        g_try_malloc(sizeof(uint64_t) * s->refcount_table_size);
+    int i, ret;
+
+    if (s->refcount_table_size && reftable_tmp == NULL) {
+        return -ENOMEM;
+    }
+
+    for (i = 0; i < s->refcount_table_size; i++) {
+        int64_t refblock_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
+        void *refblock;
+        bool unused_block;
+
+        if (refblock_offs == 0) {
+            reftable_tmp[i] = 0;
+            continue;
+        }
+        ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+                              &refblock);
+        if (ret < 0) {
+            goto out;
+        }
+
+        /* the refblock has own reference */
+        if (i == offset_to_reftable_index(s, refblock_offs)) {
+            uint64_t block_index = (refblock_offs >> s->cluster_bits) &
+                                   (s->refcount_block_size - 1);
+            uint64_t refcount = s->get_refcount(refblock, block_index);
+
+            s->set_refcount(refblock, block_index, 0);
+
+            unused_block = buffer_is_zero(refblock, s->cluster_size);
+
+            s->set_refcount(refblock, block_index, refcount);
+        } else {
+            unused_block = buffer_is_zero(refblock, s->cluster_size);
+        }
+        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+
+        reftable_tmp[i] = unused_block ? 0 : cpu_to_be64(s->refcount_table[i]);
+    }
+
+    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset, reftable_tmp,
+                           sizeof(uint64_t) * s->refcount_table_size);
+    if (ret < 0) {
+        goto out;
+    }
+
+    for (i = 0; i < s->refcount_table_size; i++) {
+        if (s->refcount_table[i] && !reftable_tmp[i]) {
+            uint64_t discard_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
+            uint64_t refblock_offs = get_refblock_offset(s, discard_offs);
+            uint64_t cluster_index = discard_offs >> s->cluster_bits;
+            uint32_t block_index = cluster_index & (s->refcount_block_size - 1);
+            void *refblock;
+
+            assert(discard_offs != 0);
+
+            ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+                                  &refblock);
+            if (ret < 0) {
+                goto out;
+            }
+
+            if (s->get_refcount(refblock, block_index) != 1) {
+                qcow2_signal_corruption(bs, true, -1, -1, "Invalid refcount:"
+                                        " refblock offset %#" PRIx64
+                                        ", reftable index %d"
+                                        ", block offset %#" PRIx64
+                                        ", refcount %#" PRIx64,
+                                        refblock_offs, i, discard_offs,
+                                        s->get_refcount(refblock, block_index));
+                qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+                ret = -EINVAL;
+                goto out;
+            }
+            s->set_refcount(refblock, block_index, 0);
+
+            qcow2_cache_entry_mark_dirty(bs, s->refcount_block_cache, refblock);
+
+            qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+
+            if (cluster_index < s->free_cluster_index) {
+                s->free_cluster_index = cluster_index;
+            }
+
+            refblock = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+                                                   discard_offs);
+            if (refblock) {
+                /* discard refblock from the cache if refblock is cached */
+                qcow2_cache_discard(bs, s->refcount_block_cache, refblock);
+            }
+            update_refcount_discard(bs, discard_offs, s->cluster_size);
+            s->refcount_table[i] = 0;
+        }
+    }
+
+    if (!s->cache_discards) {
+        qcow2_process_discards(bs, ret);
+    }
+
+out:
+    g_free(reftable_tmp);
+    return ret;
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index c144ea5620..bd281fdd04 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3120,18 +3120,43 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
     }
 
     old_length = bs->total_sectors * 512;
+    new_l1_size = size_to_l1(s, offset);
 
-    /* shrinking is currently not supported */
     if (offset < old_length) {
-        error_setg(errp, "qcow2 doesn't support shrinking images yet");
-        return -ENOTSUP;
-    }
+        if (prealloc != PREALLOC_MODE_OFF) {
+            error_setg(errp,
+                       "Preallocation can't be used for shrinking an image");
+            return -EINVAL;
+        }
 
-    new_l1_size = size_to_l1(s, offset);
-    ret = qcow2_grow_l1_table(bs, new_l1_size, true);
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "Failed to grow the L1 table");
-        return ret;
+        ret = qcow2_cluster_discard(bs, ROUND_UP(offset, s->cluster_size),
+                                    old_length - ROUND_UP(offset,
+                                                          s->cluster_size),
+                                    QCOW2_DISCARD_ALWAYS, true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to discard cropped clusters");
+            return ret;
+        }
+
+        ret = qcow2_shrink_l1_table(bs, new_l1_size);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret,
+                             "Failed to reduce the number of L2 tables");
+            return ret;
+        }
+
+        ret = qcow2_shrink_reftable(bs);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret,
+                             "Failed to discard unused refblocks");
+            return ret;
+        }
+    } else {
+        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to grow the L1 table");
+            return ret;
+        }
     }
 
     switch (prealloc) {
diff --git a/block/qcow2.h b/block/qcow2.h
index 52c374e9ed..5a289a81e2 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -521,6 +521,18 @@ static inline uint64_t refcount_diff(uint64_t r1, uint64_t r2)
     return r1 > r2 ? r1 - r2 : r2 - r1;
 }
 
+static inline
+uint32_t offset_to_reftable_index(BDRVQcow2State *s, uint64_t offset)
+{
+    return offset >> (s->refcount_block_bits + s->cluster_bits);
+}
+
+static inline uint64_t get_refblock_offset(BDRVQcow2State *s, uint64_t offset)
+{
+    uint32_t index = offset_to_reftable_index(s, offset);
+    return s->refcount_table[index] & REFT_OFFSET_MASK;
+}
+
 /* qcow2.c functions */
 int qcow2_backing_read1(BlockDriverState *bs, QEMUIOVector *qiov,
                   int64_t sector_num, int nb_sectors);
@@ -584,10 +596,12 @@ int qcow2_inc_refcounts_imrt(BlockDriverState *bs, BdrvCheckResult *res,
 int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
                                 BlockDriverAmendStatusCB *status_cb,
                                 void *cb_opaque, Error **errp);
+int qcow2_shrink_reftable(BlockDriverState *bs);
 
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size);
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
 int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
 int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
diff --git a/qapi/block-core.json b/qapi/block-core.json
index c437aa50ef..99cef55b7c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2487,7 +2487,8 @@
             'cluster_alloc_bytes', 'cluster_free', 'flush_to_os',
             'flush_to_disk', 'pwritev_rmw_head', 'pwritev_rmw_after_head',
             'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
-            'pwritev_zero', 'pwritev_done', 'empty_image_prepare' ] }
+            'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
+            'l1_shrink_write_table', 'l1_shrink_free_l2_clusters' ] }
 
 ##
 # @BlkdebugInjectErrorOptions:
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 4/4] qemu-iotests: add shrinking image test
  2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
                   ` (2 preceding siblings ...)
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support Pavel Butsykin
@ 2017-07-12 11:47 ` Pavel Butsykin
  2017-07-12 14:04 ` [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Max Reitz
  4 siblings, 0 replies; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-12 11:47 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, mreitz, eblake, armbru, den, pbutsykin

Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
---
 tests/qemu-iotests/163     | 170 +++++++++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 176 insertions(+)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

diff --git a/tests/qemu-iotests/163 b/tests/qemu-iotests/163
new file mode 100644
index 0000000000..403842354e
--- /dev/null
+++ b/tests/qemu-iotests/163
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+#
+# Tests for shrinking images
+#
+# Copyright (c) 2016-2017 Parallels International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os, random, iotests, struct, qcow2
+from iotests import qemu_img, qemu_io, image_size
+
+test_img = os.path.join(iotests.test_dir, 'test.img')
+check_img = os.path.join(iotests.test_dir, 'check.img')
+
+def size_to_int(str):
+    suff = ['B', 'K', 'M', 'G', 'T']
+    return int(str[:-1]) * 1024**suff.index(str[-1:])
+
+class ShrinkBaseClass(iotests.QMPTestCase):
+    image_len = '128M'
+    shrink_size = '10M'
+    chunk_size = '16M'
+    refcount_bits = '16'
+
+    def __qcow2_check(self, filename):
+        entry_bits = 3
+        entry_size = 1 << entry_bits
+        l1_mask = 0x00fffffffffffe00
+        div_roundup = lambda n, d: (n + d - 1) / d
+
+        def split_by_n(data, n):
+            for x in xrange(0, len(data), n):
+                yield struct.unpack('>Q', data[x:x + n])[0] & l1_mask
+
+        def check_l1_table(h, l1_data):
+            l1_list = list(split_by_n(l1_data, entry_size))
+            real_l1_size = div_roundup(h.size,
+                                       1 << (h.cluster_bits*2 - entry_size))
+            used, unused = l1_list[:real_l1_size], l1_list[real_l1_size:]
+
+            self.assertTrue(len(used) != 0, "Verifying l1 table content")
+            self.assertFalse(any(unused), "Verifying l1 table content")
+
+        def check_reftable(fd, h, reftable):
+            for offset in split_by_n(reftable, entry_size):
+                if offset != 0:
+                    fd.seek(offset)
+                    cluster = fd.read(1 << h.cluster_bits)
+                    self.assertTrue(any(cluster), "Verifying reftable content")
+
+        with open(filename, "rb") as fd:
+            h = qcow2.QcowHeader(fd)
+
+            fd.seek(h.l1_table_offset)
+            l1_table = fd.read(h.l1_size << entry_bits)
+
+            fd.seek(h.refcount_table_offset)
+            reftable = fd.read(h.refcount_table_clusters << h.cluster_bits)
+
+            check_l1_table(h, l1_table)
+            check_reftable(fd, h, reftable)
+
+    def __raw_check(self, filename):
+        pass
+
+    image_check = {
+        'qcow2' : __qcow2_check,
+        'raw' : __raw_check
+    }
+
+    def setUp(self):
+        if iotests.imgfmt == 'raw':
+            qemu_img('create', '-f', iotests.imgfmt, test_img, self.image_len)
+            qemu_img('create', '-f', iotests.imgfmt, check_img,
+                     self.shrink_size)
+        else:
+            qemu_img('create', '-f', iotests.imgfmt,
+                     '-o', 'cluster_size=' + self.cluster_size +
+                     ',refcount_bits=' + self.refcount_bits,
+                     test_img, self.image_len)
+            qemu_img('create', '-f', iotests.imgfmt,
+                     '-o', 'cluster_size=%s'% self.cluster_size,
+                     check_img, self.shrink_size)
+        qemu_io('-c', 'write -P 0xff 0 ' + self.shrink_size, check_img)
+
+    def tearDown(self):
+        os.remove(test_img)
+        os.remove(check_img)
+
+    def image_verify(self):
+        self.assertEqual(image_size(test_img), image_size(check_img),
+                         "Verifying image size")
+        self.image_check[iotests.imgfmt](self, test_img)
+
+        if iotests.imgfmt == 'raw':
+            return
+        self.assertEqual(qemu_img('check', test_img), 0,
+                         "Verifying image corruption")
+
+    def test_empty_image(self):
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 self.shrink_size)
+
+        self.assertEqual(
+            qemu_io('-c', 'read -P 0x00 %s'%self.shrink_size, test_img),
+            qemu_io('-c', 'read -P 0x00 %s'%self.shrink_size, check_img),
+            "Verifying image content")
+
+        self.image_verify()
+
+    def test_sequential_write(self):
+        for offs in range(0, size_to_int(self.image_len),
+                          size_to_int(self.chunk_size)):
+            qemu_io('-c', 'write -P 0xff %d %s' % (offs, self.chunk_size),
+                    test_img)
+
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 self.shrink_size)
+
+        self.assertEqual(qemu_img("compare", test_img, check_img), 0,
+                         "Verifying image content")
+
+        self.image_verify()
+
+    def test_random_write(self):
+        offs_list = range(0, size_to_int(self.image_len),
+                          size_to_int(self.chunk_size))
+        random.shuffle(offs_list)
+        for offs in offs_list:
+            qemu_io('-c', 'write -P 0xff %d %s' % (offs, self.chunk_size),
+                    test_img)
+
+        qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+                 self.shrink_size)
+
+        self.assertEqual(qemu_img("compare", test_img, check_img), 0,
+                         "Verifying image content")
+
+        self.image_verify()
+
+class TestShrink512(ShrinkBaseClass):
+    image_len = '3M'
+    shrink_size = '1M'
+    chunk_size = '256K'
+    cluster_size = '512'
+    refcount_bits = '64'
+
+class TestShrink64K(ShrinkBaseClass):
+    cluster_size = '64K'
+
+class TestShrink1M(ShrinkBaseClass):
+    cluster_size = '1M'
+    refcount_bits = '1'
+
+ShrinkBaseClass = None
+
+if __name__ == '__main__':
+    iotests.main(supported_fmts=['raw', 'qcow2'])
diff --git a/tests/qemu-iotests/163.out b/tests/qemu-iotests/163.out
new file mode 100644
index 0000000000..dae404e278
--- /dev/null
+++ b/tests/qemu-iotests/163.out
@@ -0,0 +1,5 @@
+.........
+----------------------------------------------------------------------
+Ran 9 tests
+
+OK
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 2aba585287..1d985c3b45 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -166,6 +166,7 @@
 159 rw auto quick
 160 rw auto quick
 162 auto quick
+163 rw auto quick
 165 rw auto quick
 170 rw auto quick
 171 rw auto quick
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2
  2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
                   ` (3 preceding siblings ...)
  2017-07-12 11:47 ` [Qemu-devel] [PATCH v5 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
@ 2017-07-12 14:04 ` Max Reitz
  4 siblings, 0 replies; 16+ messages in thread
From: Max Reitz @ 2017-07-12 14:04 UTC (permalink / raw)
  To: Pavel Butsykin, qemu-block, qemu-devel; +Cc: kwolf, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]

On 2017-07-12 13:46, Pavel Butsykin wrote:
> This patch add shrinking of the image file for qcow2. As a result, this allows
> us to reduce the virtual image size and free up space on the disk without
> copying the image. Image can be fragmented and shrink is done by punching holes
> in the image file.
> 
> # ./qemu-img create -f qcow2 image.qcow2 4G
> Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
> 
> # ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:04.59 (222.886 MiB/sec and 0.2177 ops/sec)
> 
> # ./qemu-img resize image.qcow2 512M
> warning: qemu-img: Shrinking an image will delete all data beyond the shrunken image's end. Before performing such an operation, make sure there is no important data there.
> error: qemu-img: Use the --shrink option to perform a shrink operation.
> 
> # ./qemu-img resize --shrink image.qcow2 128M
> Image resized.
> 
> # ./qemu-img info image.qcow2
> image: image.qcow2
> file format: qcow2
> virtual size: 128M (134217728 bytes)
> disk size: 128M
> cluster_size: 65536
> Format specific information:
>     compat: 1.1
>     lazy refcounts: false
>     refcount bits: 16
>     corrupt: false
> 
> # du -h image.qcow2 
> 129M    image.qcow2

Thanks, looks good.  Now we'll just have to wait for the warn_report()
series.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
@ 2017-07-12 14:45   ` Kevin Wolf
  2017-07-13 17:18     ` Pavel Butsykin
  0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2017-07-12 14:45 UTC (permalink / raw)
  To: Pavel Butsykin; +Cc: qemu-block, qemu-devel, mreitz, eblake, armbru, den

Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
> Whenever l2/refcount table clusters are discarded from the file we can
> automatically drop unnecessary content of the cache tables. This reduces
> the chance of eviction useful cache data and eliminates inconsistent data
> in the cache with the data in the file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cache.c    | 26 ++++++++++++++++++++++++++
>  block/qcow2-refcount.c | 14 ++++++++++++++
>  block/qcow2.h          |  3 +++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> index 1d25147392..75746a7f43 100644
> --- a/block/qcow2-cache.c
> +++ b/block/qcow2-cache.c
> @@ -411,3 +411,29 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
>      assert(c->entries[i].offset != 0);
>      c->entries[i].dirty = true;
>  }
> +
> +void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
> +                                  uint64_t offset)
> +{
> +    int i;
> +
> +    for (i = 0; i < c->size; i++) {
> +        if (c->entries[i].offset == offset) {
> +            return qcow2_cache_get_table_addr(bs, c, i);
> +        }
> +    }
> +    return NULL;
> +}
> +
> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
> +{
> +    int i = qcow2_cache_get_table_idx(bs, c, table);
> +
> +    assert(c->entries[i].ref == 0);
> +
> +    c->entries[i].offset = 0;
> +    c->entries[i].lru_counter = 0;
> +    c->entries[i].dirty = false;
> +
> +    qcow2_cache_table_release(bs, c, i, 1);
> +}
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index c9b0dcb4f3..8050db4544 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -862,6 +862,20 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>          s->set_refcount(refcount_block, block_index, refcount);
>  
>          if (refcount == 0 && s->discard_passthrough[type]) {
> +            void *table;
> +
> +            table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
> +                                                offset);
> +            if (table != NULL) {
> +                qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
> +                qcow2_cache_discard(bs, s->refcount_block_cache, table);
> +            }
> +
> +            table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
> +            if (table != NULL) {
> +                qcow2_cache_discard(bs, s->l2_table_cache, table);
> +            }
> +
>              update_refcount_discard(bs, cluster_offset, s->cluster_size);
>          }
>      }

This is not wrong, but wouldn't actually even refcount == 0 be enough as
a condition to evict the cluster from the cache? I don't think it really
matters whether we actually send a discard request or not for the image
file.

Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support Pavel Butsykin
@ 2017-07-12 14:52   ` Kevin Wolf
  2017-07-12 16:58     ` Max Reitz
  0 siblings, 1 reply; 16+ messages in thread
From: Kevin Wolf @ 2017-07-12 14:52 UTC (permalink / raw)
  To: Pavel Butsykin; +Cc: qemu-block, qemu-devel, mreitz, eblake, armbru, den

Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
> This patch add shrinking of the image file for qcow2. As a result, this allows
> us to reduce the virtual image size and free up space on the disk without
> copying the image. Image can be fragmented and shrink is done by punching holes
> in the image file.
> 
> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> Reviewed-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>  block/qcow2.c          |  43 +++++++++++++++----
>  block/qcow2.h          |  14 +++++++
>  qapi/block-core.json   |   3 +-
>  5 files changed, 200 insertions(+), 10 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index f06c08f64c..518429c64b 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -32,6 +32,46 @@
>  #include "qemu/bswap.h"
>  #include "trace.h"
>  
> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int new_l1_size, i, ret;
> +
> +    if (exact_size >= s->l1_size) {
> +        return 0;
> +    }
> +
> +    new_l1_size = exact_size;
> +
> +#ifdef DEBUG_ALLOC2
> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
> +#endif
> +
> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
> +                                       sizeof(uint64_t) * new_l1_size,
> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = bdrv_flush(bs->file->bs);
> +    if (ret < 0) {
> +        return ret;
> +    }

If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
have entries zeroed out on disk, but in memory we still have the
original L1 table.

Should the in-memory L1 table be zeroed first? Then we can't
accidentally reuse stale entries, but would have to allocate new ones,
which would get on-disk state and in-memory state back in sync again.

> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
> +            continue;
> +        }
> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
> +        s->l1_table[i] = 0;
> +    }
> +    return 0;
> +}
> +
>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                          bool exact_size)
>  {

I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
I hope Max has.

Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-12 14:52   ` Kevin Wolf
@ 2017-07-12 16:58     ` Max Reitz
  2017-07-13  8:41       ` Kevin Wolf
  0 siblings, 1 reply; 16+ messages in thread
From: Max Reitz @ 2017-07-12 16:58 UTC (permalink / raw)
  To: Kevin Wolf, Pavel Butsykin; +Cc: qemu-block, qemu-devel, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 3274 bytes --]

On 2017-07-12 16:52, Kevin Wolf wrote:
> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>> This patch add shrinking of the image file for qcow2. As a result, this allows
>> us to reduce the virtual image size and free up space on the disk without
>> copying the image. Image can be fragmented and shrink is done by punching holes
>> in the image file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
>>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  block/qcow2.c          |  43 +++++++++++++++----
>>  block/qcow2.h          |  14 +++++++
>>  qapi/block-core.json   |   3 +-
>>  5 files changed, 200 insertions(+), 10 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index f06c08f64c..518429c64b 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -32,6 +32,46 @@
>>  #include "qemu/bswap.h"
>>  #include "trace.h"
>>  
>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int new_l1_size, i, ret;
>> +
>> +    if (exact_size >= s->l1_size) {
>> +        return 0;
>> +    }
>> +
>> +    new_l1_size = exact_size;
>> +
>> +#ifdef DEBUG_ALLOC2
>> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
>> +#endif
>> +
>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>> +                                       sizeof(uint64_t) * new_l1_size,
>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = bdrv_flush(bs->file->bs);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
> 
> If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
> have entries zeroed out on disk, but in memory we still have the
> original L1 table.
> 
> Should the in-memory L1 table be zeroed first? Then we can't
> accidentally reuse stale entries, but would have to allocate new ones,
> which would get on-disk state and in-memory state back in sync again.

Yes, I thought of the same.  But this implies that the allocation is
able to modify the L1 table, and I find that unlikely if that
bdrv_flush() failed already...

So I concluded not to have an opinion on which order is better.

>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>> +            continue;
>> +        }
>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
>> +        s->l1_table[i] = 0;
>> +    }
>> +    return 0;
>> +}
>> +
>>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                          bool exact_size)
>>  {
> 
> I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
> I hope Max has.

Well, it's exactly the same there.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-12 16:58     ` Max Reitz
@ 2017-07-13  8:41       ` Kevin Wolf
  2017-07-13 14:36         ` Max Reitz
  2017-07-13 17:18         ` Pavel Butsykin
  0 siblings, 2 replies; 16+ messages in thread
From: Kevin Wolf @ 2017-07-13  8:41 UTC (permalink / raw)
  To: Max Reitz; +Cc: Pavel Butsykin, qemu-block, qemu-devel, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 3889 bytes --]

Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
> On 2017-07-12 16:52, Kevin Wolf wrote:
> > Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
> >> This patch add shrinking of the image file for qcow2. As a result, this allows
> >> us to reduce the virtual image size and free up space on the disk without
> >> copying the image. Image can be fragmented and shrink is done by punching holes
> >> in the image file.
> >>
> >> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> >> Reviewed-by: Max Reitz <mreitz@redhat.com>
> >> ---
> >>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
> >>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  block/qcow2.c          |  43 +++++++++++++++----
> >>  block/qcow2.h          |  14 +++++++
> >>  qapi/block-core.json   |   3 +-
> >>  5 files changed, 200 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> >> index f06c08f64c..518429c64b 100644
> >> --- a/block/qcow2-cluster.c
> >> +++ b/block/qcow2-cluster.c
> >> @@ -32,6 +32,46 @@
> >>  #include "qemu/bswap.h"
> >>  #include "trace.h"
> >>  
> >> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
> >> +{
> >> +    BDRVQcow2State *s = bs->opaque;
> >> +    int new_l1_size, i, ret;
> >> +
> >> +    if (exact_size >= s->l1_size) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    new_l1_size = exact_size;
> >> +
> >> +#ifdef DEBUG_ALLOC2
> >> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
> >> +#endif
> >> +
> >> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
> >> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
> >> +                                       sizeof(uint64_t) * new_l1_size,
> >> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
> >> +    if (ret < 0) {
> >> +        return ret;
> >> +    }
> >> +
> >> +    ret = bdrv_flush(bs->file->bs);
> >> +    if (ret < 0) {
> >> +        return ret;
> >> +    }
> > 
> > If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
> > have entries zeroed out on disk, but in memory we still have the
> > original L1 table.
> > 
> > Should the in-memory L1 table be zeroed first? Then we can't
> > accidentally reuse stale entries, but would have to allocate new ones,
> > which would get on-disk state and in-memory state back in sync again.
> 
> Yes, I thought of the same.  But this implies that the allocation is
> able to modify the L1 table, and I find that unlikely if that
> bdrv_flush() failed already...
> 
> So I concluded not to have an opinion on which order is better.

Well, let me ask the other way round: Is there any disadvantage in first
zeroing the in-memory table and then writing to the image?

If I have a choice between "always safe" and "not completely safe, but I
think it's unlikely to happen", especially in image formats, then I will
certainly take the "always safe".

> >> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
> >> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
> >> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
> >> +            continue;
> >> +        }
> >> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
> >> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
> >> +        s->l1_table[i] = 0;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
> >>                          bool exact_size)
> >>  {
> > 
> > I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
> > I hope Max has.
> 
> Well, it's exactly the same there.

Ok, so I'll object to this code without really having looked at it.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-13  8:41       ` Kevin Wolf
@ 2017-07-13 14:36         ` Max Reitz
  2017-07-13 17:28           ` Pavel Butsykin
  2017-07-13 17:18         ` Pavel Butsykin
  1 sibling, 1 reply; 16+ messages in thread
From: Max Reitz @ 2017-07-13 14:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Pavel Butsykin, qemu-block, qemu-devel, eblake, armbru, den

[-- Attachment #1: Type: text/plain, Size: 4054 bytes --]

On 2017-07-13 10:41, Kevin Wolf wrote:
> Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
>> On 2017-07-12 16:52, Kevin Wolf wrote:
>>> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>>>> This patch add shrinking of the image file for qcow2. As a result, this allows
>>>> us to reduce the virtual image size and free up space on the disk without
>>>> copying the image. Image can be fragmented and shrink is done by punching holes
>>>> in the image file.
>>>>
>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
>>>>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  block/qcow2.c          |  43 +++++++++++++++----
>>>>  block/qcow2.h          |  14 +++++++
>>>>  qapi/block-core.json   |   3 +-
>>>>  5 files changed, 200 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>>> index f06c08f64c..518429c64b 100644
>>>> --- a/block/qcow2-cluster.c
>>>> +++ b/block/qcow2-cluster.c
>>>> @@ -32,6 +32,46 @@
>>>>  #include "qemu/bswap.h"
>>>>  #include "trace.h"
>>>>  
>>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int new_l1_size, i, ret;
>>>> +
>>>> +    if (exact_size >= s->l1_size) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    new_l1_size = exact_size;
>>>> +
>>>> +#ifdef DEBUG_ALLOC2
>>>> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>>> +#endif
>>>> +
>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>>>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>>>> +                                       sizeof(uint64_t) * new_l1_size,
>>>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = bdrv_flush(bs->file->bs);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>
>>> If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
>>> have entries zeroed out on disk, but in memory we still have the
>>> original L1 table.
>>>
>>> Should the in-memory L1 table be zeroed first? Then we can't
>>> accidentally reuse stale entries, but would have to allocate new ones,
>>> which would get on-disk state and in-memory state back in sync again.
>>
>> Yes, I thought of the same.  But this implies that the allocation is
>> able to modify the L1 table, and I find that unlikely if that
>> bdrv_flush() failed already...
>>
>> So I concluded not to have an opinion on which order is better.
> 
> Well, let me ask the other way round: Is there any disadvantage in first
> zeroing the in-memory table and then writing to the image?

I was informed that the code would be harder to write. :-)

> If I have a choice between "always safe" and "not completely safe, but I
> think it's unlikely to happen", especially in image formats, then I will
> certainly take the "always safe".
> 
>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>>>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>>>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>>>> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>>> +        s->l1_table[i] = 0;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>>                          bool exact_size)
>>>>  {
>>>
>>> I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
>>> I hope Max has.
>>
>> Well, it's exactly the same there.
> 
> Ok, so I'll object to this code without really having looked at it.
I won't object to your objection. O:-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 498 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard
  2017-07-12 14:45   ` Kevin Wolf
@ 2017-07-13 17:18     ` Pavel Butsykin
  0 siblings, 0 replies; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-13 17:18 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, mreitz, eblake, armbru, den

On 12.07.2017 17:45, Kevin Wolf wrote:
> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>> Whenever l2/refcount table clusters are discarded from the file we can
>> automatically drop unnecessary content of the cache tables. This reduces
>> the chance of eviction useful cache data and eliminates inconsistent data
>> in the cache with the data in the file.
>>
>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/qcow2-cache.c    | 26 ++++++++++++++++++++++++++
>>   block/qcow2-refcount.c | 14 ++++++++++++++
>>   block/qcow2.h          |  3 +++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
>> index 1d25147392..75746a7f43 100644
>> --- a/block/qcow2-cache.c
>> +++ b/block/qcow2-cache.c
>> @@ -411,3 +411,29 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
>>       assert(c->entries[i].offset != 0);
>>       c->entries[i].dirty = true;
>>   }
>> +
>> +void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
>> +                                  uint64_t offset)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < c->size; i++) {
>> +        if (c->entries[i].offset == offset) {
>> +            return qcow2_cache_get_table_addr(bs, c, i);
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
>> +{
>> +    int i = qcow2_cache_get_table_idx(bs, c, table);
>> +
>> +    assert(c->entries[i].ref == 0);
>> +
>> +    c->entries[i].offset = 0;
>> +    c->entries[i].lru_counter = 0;
>> +    c->entries[i].dirty = false;
>> +
>> +    qcow2_cache_table_release(bs, c, i, 1);
>> +}
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index c9b0dcb4f3..8050db4544 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -862,6 +862,20 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           s->set_refcount(refcount_block, block_index, refcount);
>>   
>>           if (refcount == 0 && s->discard_passthrough[type]) {
>> +            void *table;
>> +
>> +            table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
>> +                                                offset);
>> +            if (table != NULL) {
>> +                qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
>> +                qcow2_cache_discard(bs, s->refcount_block_cache, table);
>> +            }
>> +
>> +            table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
>> +            if (table != NULL) {
>> +                qcow2_cache_discard(bs, s->l2_table_cache, table);
>> +            }
>> +
>>               update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>           }
>>       }
> 
> This is not wrong, but wouldn't actually even refcount == 0 be enough as
> a condition to evict the cluster from the cache? I don't think it really
> matters whether we actually send a discard request or not for the image
> file.

Yes, you're right. Will fix.

> Kevin
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-13  8:41       ` Kevin Wolf
  2017-07-13 14:36         ` Max Reitz
@ 2017-07-13 17:18         ` Pavel Butsykin
  2017-07-14  7:44           ` Kevin Wolf
  1 sibling, 1 reply; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-13 17:18 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: qemu-block, qemu-devel, eblake, armbru, den

On 13.07.2017 11:41, Kevin Wolf wrote:
> Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
>> On 2017-07-12 16:52, Kevin Wolf wrote:
>>> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>>>> This patch add shrinking of the image file for qcow2. As a result, this allows
>>>> us to reduce the virtual image size and free up space on the disk without
>>>> copying the image. Image can be fragmented and shrink is done by punching holes
>>>> in the image file.
>>>>
>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   block/qcow2-cluster.c  |  40 ++++++++++++++++++
>>>>   block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   block/qcow2.c          |  43 +++++++++++++++----
>>>>   block/qcow2.h          |  14 +++++++
>>>>   qapi/block-core.json   |   3 +-
>>>>   5 files changed, 200 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>>> index f06c08f64c..518429c64b 100644
>>>> --- a/block/qcow2-cluster.c
>>>> +++ b/block/qcow2-cluster.c
>>>> @@ -32,6 +32,46 @@
>>>>   #include "qemu/bswap.h"
>>>>   #include "trace.h"
>>>>   
>>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int new_l1_size, i, ret;
>>>> +
>>>> +    if (exact_size >= s->l1_size) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    new_l1_size = exact_size;
>>>> +
>>>> +#ifdef DEBUG_ALLOC2
>>>> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>>> +#endif
>>>> +
>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>>>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>>>> +                                       sizeof(uint64_t) * new_l1_size,
>>>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = bdrv_flush(bs->file->bs);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>
>>> If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
>>> have entries zeroed out on disk, but in memory we still have the
>>> original L1 table.
>>>
>>> Should the in-memory L1 table be zeroed first? Then we can't
>>> accidentally reuse stale entries, but would have to allocate new ones,
>>> which would get on-disk state and in-memory state back in sync again.
>>
>> Yes, I thought of the same.  But this implies that the allocation is
>> able to modify the L1 table, and I find that unlikely if that
>> bdrv_flush() failed already...
>>
>> So I concluded not to have an opinion on which order is better.
> 
> Well, let me ask the other way round: Is there any disadvantage in first
> zeroing the in-memory table and then writing to the image?

If bdrv_flush/drv_pwrite_zeroes function failed, the subsequent writes
to truncated area lead to allocation L2 tables. This implies two things:

1. We need call qcow2_free_clusters() after bdrv_flush/drv_pwrite_zeroes
anyway, otherwise it may lead to the situation when the l1 table will
have two identical offsets.

2. Old l2 blocks may be lost and will be dead weight for the image.

> If I have a choice between "always safe" and "not completely safe, but I
> think it's unlikely to happen", especially in image formats, then I will
> certainly take the "always safe".

In my understanding both cases are "unsafe", because both cases may lead
to inconsistent state between image and memory. When writing this code I
was looking at an existing approach in qcow2*.c to such kind of issues:
...
static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
...
     trace_qcow2_l2_allocate_write_l1(bs, l1_index);
     s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
     ret = qcow2_write_l1_entry(bs, l1_index);
     if (ret < 0) {
         goto fail;
     }

     *table = l2_table;
     trace_qcow2_l2_allocate_done(bs, l1_index, 0);
     return 0;

fail:
     trace_qcow2_l2_allocate_done(bs, l1_index, ret);
     if (l2_table != NULL) {
         qcow2_cache_put(bs, s->l2_table_cache, (void**) table);
     }
     s->l1_table[l1_index] = old_l2_offset;
     if (l2_offset > 0) {
         qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
                             QCOW2_DISCARD_ALWAYS);
     }
     return ret;

Do I understand correctly? that if qcow2_write_l1_entr call fails
somewhere in the middle of bdrv_flush() then l1_table still contain the
old l2_offset, but the image may already contain the new l2_offset. We
can continue to use the image and write data, but in this case we lose
the data after reopening image.

So it seemed to me that in current conditions we can't escape from such
kind of problem completely. But I understand your desire to do better
even in little things, I agree that would be a little safer in the case
"first zeroing the in-memory table and then writing". So if this
solution suits all, let me write the next patch-set version :)

>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>>>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>>>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>>>> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>>> +        s->l1_table[i] = 0;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>>                           bool exact_size)
>>>>   {
>>>
>>> I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
>>> I hope Max has.
>>
>> Well, it's exactly the same there.
> 
> Ok, so I'll object to this code without really having looked at it.
> 
> Kevin
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-13 14:36         ` Max Reitz
@ 2017-07-13 17:28           ` Pavel Butsykin
  2017-07-14  7:28             ` Kevin Wolf
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Butsykin @ 2017-07-13 17:28 UTC (permalink / raw)
  To: Max Reitz, Kevin Wolf; +Cc: qemu-block, qemu-devel, eblake, armbru, den

On 13.07.2017 17:36, Max Reitz wrote:
> On 2017-07-13 10:41, Kevin Wolf wrote:
>> Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
>>> On 2017-07-12 16:52, Kevin Wolf wrote:
>>>> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>>>>> This patch add shrinking of the image file for qcow2. As a result, this allows
>>>>> us to reduce the virtual image size and free up space on the disk without
>>>>> copying the image. Image can be fragmented and shrink is done by punching holes
>>>>> in the image file.
>>>>>
>>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>>>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>>>> ---
>>>>>   block/qcow2-cluster.c  |  40 ++++++++++++++++++
>>>>>   block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>   block/qcow2.c          |  43 +++++++++++++++----
>>>>>   block/qcow2.h          |  14 +++++++
>>>>>   qapi/block-core.json   |   3 +-
>>>>>   5 files changed, 200 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>>>> index f06c08f64c..518429c64b 100644
>>>>> --- a/block/qcow2-cluster.c
>>>>> +++ b/block/qcow2-cluster.c
>>>>> @@ -32,6 +32,46 @@
>>>>>   #include "qemu/bswap.h"
>>>>>   #include "trace.h"
>>>>>   
>>>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
>>>>> +{
>>>>> +    BDRVQcow2State *s = bs->opaque;
>>>>> +    int new_l1_size, i, ret;
>>>>> +
>>>>> +    if (exact_size >= s->l1_size) {
>>>>> +        return 0;
>>>>> +    }
>>>>> +
>>>>> +    new_l1_size = exact_size;
>>>>> +
>>>>> +#ifdef DEBUG_ALLOC2
>>>>> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>>>> +#endif
>>>>> +
>>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>>>>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>>>>> +                                       sizeof(uint64_t) * new_l1_size,
>>>>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>>>>> +    if (ret < 0) {
>>>>> +        return ret;
>>>>> +    }
>>>>> +
>>>>> +    ret = bdrv_flush(bs->file->bs);
>>>>> +    if (ret < 0) {
>>>>> +        return ret;
>>>>> +    }
>>>>
>>>> If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
>>>> have entries zeroed out on disk, but in memory we still have the
>>>> original L1 table.
>>>>
>>>> Should the in-memory L1 table be zeroed first? Then we can't
>>>> accidentally reuse stale entries, but would have to allocate new ones,
>>>> which would get on-disk state and in-memory state back in sync again.
>>>
>>> Yes, I thought of the same.  But this implies that the allocation is
>>> able to modify the L1 table, and I find that unlikely if that
>>> bdrv_flush() failed already...
>>>
>>> So I concluded not to have an opinion on which order is better.
>>
>> Well, let me ask the other way round: Is there any disadvantage in first
>> zeroing the in-memory table and then writing to the image?
> 
> I was informed that the code would be harder to write. :-)
> 
>> If I have a choice between "always safe" and "not completely safe, but I
>> think it's unlikely to happen", especially in image formats, then I will
>> certainly take the "always safe".
>>
>>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>>>>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>>>>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>>>>> +            continue;
>>>>> +        }
>>>>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>>>>> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>>>> +        s->l1_table[i] = 0;
>>>>> +    }
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>>>                           bool exact_size)
>>>>>   {
>>>>
>>>> I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
>>>> I hope Max has.
>>>
>>> Well, it's exactly the same there.
>>
>> Ok, so I'll object to this code without really having looked at it.
> I won't object to your objection. O:-)

Kevin,

Can you help me to reduce the number of patch-set versions? :)
And look at the rest part of the series, thanks!

> Max
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-13 17:28           ` Pavel Butsykin
@ 2017-07-14  7:28             ` Kevin Wolf
  0 siblings, 0 replies; 16+ messages in thread
From: Kevin Wolf @ 2017-07-14  7:28 UTC (permalink / raw)
  To: Pavel Butsykin; +Cc: Max Reitz, qemu-block, qemu-devel, eblake, armbru, den

Am 13.07.2017 um 19:28 hat Pavel Butsykin geschrieben:
> On 13.07.2017 17:36, Max Reitz wrote:
> >On 2017-07-13 10:41, Kevin Wolf wrote:
> >>Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
> >>>On 2017-07-12 16:52, Kevin Wolf wrote:
> >>>>Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
> >>>>>This patch add shrinking of the image file for qcow2. As a result, this allows
> >>>>>us to reduce the virtual image size and free up space on the disk without
> >>>>>copying the image. Image can be fragmented and shrink is done by punching holes
> >>>>>in the image file.
> >>>>>
> >>>>>Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> >>>>>Reviewed-by: Max Reitz <mreitz@redhat.com>
> >>>>>---
> >>>>>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
> >>>>>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  block/qcow2.c          |  43 +++++++++++++++----
> >>>>>  block/qcow2.h          |  14 +++++++
> >>>>>  qapi/block-core.json   |   3 +-
> >>>>>  5 files changed, 200 insertions(+), 10 deletions(-)
> >>>>>
> >>>>>diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> >>>>>index f06c08f64c..518429c64b 100644
> >>>>>--- a/block/qcow2-cluster.c
> >>>>>+++ b/block/qcow2-cluster.c
> >>>>>@@ -32,6 +32,46 @@
> >>>>>  #include "qemu/bswap.h"
> >>>>>  #include "trace.h"
> >>>>>+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
> >>>>>+{
> >>>>>+    BDRVQcow2State *s = bs->opaque;
> >>>>>+    int new_l1_size, i, ret;
> >>>>>+
> >>>>>+    if (exact_size >= s->l1_size) {
> >>>>>+        return 0;
> >>>>>+    }
> >>>>>+
> >>>>>+    new_l1_size = exact_size;
> >>>>>+
> >>>>>+#ifdef DEBUG_ALLOC2
> >>>>>+    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
> >>>>>+#endif
> >>>>>+
> >>>>>+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
> >>>>>+    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
> >>>>>+                                       sizeof(uint64_t) * new_l1_size,
> >>>>>+                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
> >>>>>+    if (ret < 0) {
> >>>>>+        return ret;
> >>>>>+    }
> >>>>>+
> >>>>>+    ret = bdrv_flush(bs->file->bs);
> >>>>>+    if (ret < 0) {
> >>>>>+        return ret;
> >>>>>+    }
> >>>>
> >>>>If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
> >>>>have entries zeroed out on disk, but in memory we still have the
> >>>>original L1 table.
> >>>>
> >>>>Should the in-memory L1 table be zeroed first? Then we can't
> >>>>accidentally reuse stale entries, but would have to allocate new ones,
> >>>>which would get on-disk state and in-memory state back in sync again.
> >>>
> >>>Yes, I thought of the same.  But this implies that the allocation is
> >>>able to modify the L1 table, and I find that unlikely if that
> >>>bdrv_flush() failed already...
> >>>
> >>>So I concluded not to have an opinion on which order is better.
> >>
> >>Well, let me ask the other way round: Is there any disadvantage in first
> >>zeroing the in-memory table and then writing to the image?
> >
> >I was informed that the code would be harder to write. :-)
> >
> >>If I have a choice between "always safe" and "not completely safe, but I
> >>think it's unlikely to happen", especially in image formats, then I will
> >>certainly take the "always safe".
> >>
> >>>>>+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
> >>>>>+    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
> >>>>>+        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
> >>>>>+            continue;
> >>>>>+        }
> >>>>>+        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
> >>>>>+                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
> >>>>>+        s->l1_table[i] = 0;
> >>>>>+    }
> >>>>>+    return 0;
> >>>>>+}
> >>>>>+
> >>>>>  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
> >>>>>                          bool exact_size)
> >>>>>  {
> >>>>
> >>>>I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
> >>>>I hope Max has.
> >>>
> >>>Well, it's exactly the same there.
> >>
> >>Ok, so I'll object to this code without really having looked at it.
> >I won't object to your objection. O:-)
> 
> Kevin,
> 
> Can you help me to reduce the number of patch-set versions? :)
> And look at the rest part of the series, thanks!

Patches 1 and 2 looked good to me. I didn't have a look at the test
case, but if it passes and Max reviewed it, it can't be too bad. ;-)

Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
  2017-07-13 17:18         ` Pavel Butsykin
@ 2017-07-14  7:44           ` Kevin Wolf
  0 siblings, 0 replies; 16+ messages in thread
From: Kevin Wolf @ 2017-07-14  7:44 UTC (permalink / raw)
  To: Pavel Butsykin; +Cc: Max Reitz, qemu-block, qemu-devel, eblake, armbru, den

Am 13.07.2017 um 19:18 hat Pavel Butsykin geschrieben:
> On 13.07.2017 11:41, Kevin Wolf wrote:
> >Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
> >>On 2017-07-12 16:52, Kevin Wolf wrote:
> >>>Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
> >>>>This patch add shrinking of the image file for qcow2. As a result, this allows
> >>>>us to reduce the virtual image size and free up space on the disk without
> >>>>copying the image. Image can be fragmented and shrink is done by punching holes
> >>>>in the image file.
> >>>>
> >>>>Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
> >>>>Reviewed-by: Max Reitz <mreitz@redhat.com>
> >>>>---
> >>>>  block/qcow2-cluster.c  |  40 ++++++++++++++++++
> >>>>  block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  block/qcow2.c          |  43 +++++++++++++++----
> >>>>  block/qcow2.h          |  14 +++++++
> >>>>  qapi/block-core.json   |   3 +-
> >>>>  5 files changed, 200 insertions(+), 10 deletions(-)
> >>>>
> >>>>diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> >>>>index f06c08f64c..518429c64b 100644
> >>>>--- a/block/qcow2-cluster.c
> >>>>+++ b/block/qcow2-cluster.c
> >>>>@@ -32,6 +32,46 @@
> >>>>  #include "qemu/bswap.h"
> >>>>  #include "trace.h"
> >>>>+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
> >>>>+{
> >>>>+    BDRVQcow2State *s = bs->opaque;
> >>>>+    int new_l1_size, i, ret;
> >>>>+
> >>>>+    if (exact_size >= s->l1_size) {
> >>>>+        return 0;
> >>>>+    }
> >>>>+
> >>>>+    new_l1_size = exact_size;
> >>>>+
> >>>>+#ifdef DEBUG_ALLOC2
> >>>>+    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
> >>>>+#endif
> >>>>+
> >>>>+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
> >>>>+    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
> >>>>+                                       sizeof(uint64_t) * new_l1_size,
> >>>>+                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
> >>>>+    if (ret < 0) {
> >>>>+        return ret;
> >>>>+    }
> >>>>+
> >>>>+    ret = bdrv_flush(bs->file->bs);
> >>>>+    if (ret < 0) {
> >>>>+        return ret;
> >>>>+    }
> >>>
> >>>If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
> >>>have entries zeroed out on disk, but in memory we still have the
> >>>original L1 table.
> >>>
> >>>Should the in-memory L1 table be zeroed first? Then we can't
> >>>accidentally reuse stale entries, but would have to allocate new ones,
> >>>which would get on-disk state and in-memory state back in sync again.
> >>
> >>Yes, I thought of the same.  But this implies that the allocation is
> >>able to modify the L1 table, and I find that unlikely if that
> >>bdrv_flush() failed already...
> >>
> >>So I concluded not to have an opinion on which order is better.
> >
> >Well, let me ask the other way round: Is there any disadvantage in first
> >zeroing the in-memory table and then writing to the image?
> 
> If bdrv_flush/drv_pwrite_zeroes function failed, the subsequent writes
> to truncated area lead to allocation L2 tables. This implies two things:
> 
> 1. We need call qcow2_free_clusters() after bdrv_flush/drv_pwrite_zeroes
> anyway, otherwise it may lead to the situation when the l1 table will
> have two identical offsets.

Yes, I'm only talking about zeroing the in-memory L1 table first, not
moving anything else.

> 2. Old l2 blocks may be lost and will be dead weight for the image.

Yes, without a journal, we can't avoid leaked clusters in certain error
cases. They are harmless, though, and 'qemu-img check' can detect and
repair them without any risk.

Leaked clusters are always better than actual image corruption, which is
what would potentially happen with your current order of operations.

> >If I have a choice between "always safe" and "not completely safe, but I
> >think it's unlikely to happen", especially in image formats, then I will
> >certainly take the "always safe".
> 
> In my understanding both cases are "unsafe", because both cases may lead
> to inconsistent state between image and memory.

Yes, we cannot guarantee full consistency of the image, but that doesn't
make it unsafe yet. I would only call things unsafe if we may end up
corrupting user data. This is the most important thing for an image
format to avoid. If at all possible, we need to write our code so that
no matter what happens, the worst thing we get is leaked clusters.

> When writing this code I was looking at an existing approach in
> qcow2*.c to such kind of issues:
> ...
> static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
> ...
>     trace_qcow2_l2_allocate_write_l1(bs, l1_index);
>     s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
>     ret = qcow2_write_l1_entry(bs, l1_index);
>     if (ret < 0) {
>         goto fail;
>     }
> 
>     *table = l2_table;
>     trace_qcow2_l2_allocate_done(bs, l1_index, 0);
>     return 0;
> 
> fail:
>     trace_qcow2_l2_allocate_done(bs, l1_index, ret);
>     if (l2_table != NULL) {
>         qcow2_cache_put(bs, s->l2_table_cache, (void**) table);
>     }
>     s->l1_table[l1_index] = old_l2_offset;
>     if (l2_offset > 0) {
>         qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
>                             QCOW2_DISCARD_ALWAYS);
>     }
>     return ret;
> 
> Do I understand correctly? that if qcow2_write_l1_entr call fails
> somewhere in the middle of bdrv_flush() then l1_table still contain the
> old l2_offset, but the image may already contain the new l2_offset. We
> can continue to use the image and write data, but in this case we lose
> the data after reopening image.

You're right, if the error happens somewhere in the middle, we have a
serious problem there. :-(

I think the only way out in this case would be to write the new data at
a different position in the image file than the old data. (Do I hear
anyone say "journal"?)

> So it seemed to me that in current conditions we can't escape from such
> kind of problem completely. But I understand your desire to do better
> even in little things, I agree that would be a little safer in the case
> "first zeroing the in-memory table and then writing". So if this
> solution suits all, let me write the next patch-set version :)

Yay! :-)

Kevin

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-07-14  7:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
2017-07-12 14:45   ` Kevin Wolf
2017-07-13 17:18     ` Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support Pavel Butsykin
2017-07-12 14:52   ` Kevin Wolf
2017-07-12 16:58     ` Max Reitz
2017-07-13  8:41       ` Kevin Wolf
2017-07-13 14:36         ` Max Reitz
2017-07-13 17:28           ` Pavel Butsykin
2017-07-14  7:28             ` Kevin Wolf
2017-07-13 17:18         ` Pavel Butsykin
2017-07-14  7:44           ` Kevin Wolf
2017-07-12 11:47 ` [Qemu-devel] [PATCH v5 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
2017-07-12 14:04 ` [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).