Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Pavel Butsykin <pbutsykin@virtuozzo.com>
To: Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, eblake@redhat.com,
	armbru@redhat.com, den@openvz.org
Subject: Re: [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support
Date: Thu, 13 Jul 2017 20:18:59 +0300	[thread overview]
Message-ID: <e6d3c11d-a020-457c-d38f-4114e3023dc8@virtuozzo.com> (raw)
In-Reply-To: <20170713084126.GA4139@noname.redhat.com>

On 13.07.2017 11:41, Kevin Wolf wrote:
> Am 12.07.2017 um 18:58 hat Max Reitz geschrieben:
>> On 2017-07-12 16:52, Kevin Wolf wrote:
>>> Am 12.07.2017 um 13:46 hat Pavel Butsykin geschrieben:
>>>> This patch add shrinking of the image file for qcow2. As a result, this allows
>>>> us to reduce the virtual image size and free up space on the disk without
>>>> copying the image. Image can be fragmented and shrink is done by punching holes
>>>> in the image file.
>>>>
>>>> Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com>
>>>> Reviewed-by: Max Reitz <mreitz@redhat.com>
>>>> ---
>>>>   block/qcow2-cluster.c  |  40 ++++++++++++++++++
>>>>   block/qcow2-refcount.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   block/qcow2.c          |  43 +++++++++++++++----
>>>>   block/qcow2.h          |  14 +++++++
>>>>   qapi/block-core.json   |   3 +-
>>>>   5 files changed, 200 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>>>> index f06c08f64c..518429c64b 100644
>>>> --- a/block/qcow2-cluster.c
>>>> +++ b/block/qcow2-cluster.c
>>>> @@ -32,6 +32,46 @@
>>>>   #include "qemu/bswap.h"
>>>>   #include "trace.h"
>>>>   
>>>> +int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
>>>> +{
>>>> +    BDRVQcow2State *s = bs->opaque;
>>>> +    int new_l1_size, i, ret;
>>>> +
>>>> +    if (exact_size >= s->l1_size) {
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    new_l1_size = exact_size;
>>>> +
>>>> +#ifdef DEBUG_ALLOC2
>>>> +    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>>> +#endif
>>>> +
>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
>>>> +    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
>>>> +                                       sizeof(uint64_t) * new_l1_size,
>>>> +                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    ret = bdrv_flush(bs->file->bs);
>>>> +    if (ret < 0) {
>>>> +        return ret;
>>>> +    }
>>>
>>> If we have an error here (or after a partial bdrv_pwrite_zeroes()), we
>>> have entries zeroed out on disk, but in memory we still have the
>>> original L1 table.
>>>
>>> Should the in-memory L1 table be zeroed first? Then we can't
>>> accidentally reuse stale entries, but would have to allocate new ones,
>>> which would get on-disk state and in-memory state back in sync again.
>>
>> Yes, I thought of the same.  But this implies that the allocation is
>> able to modify the L1 table, and I find that unlikely if that
>> bdrv_flush() failed already...
>>
>> So I concluded not to have an opinion on which order is better.
> 
> Well, let me ask the other way round: Is there any disadvantage in first
> zeroing the in-memory table and then writing to the image?

If bdrv_flush/drv_pwrite_zeroes function failed, the subsequent writes
to truncated area lead to allocation L2 tables. This implies two things:

1. We need call qcow2_free_clusters() after bdrv_flush/drv_pwrite_zeroes
anyway, otherwise it may lead to the situation when the l1 table will
have two identical offsets.

2. Old l2 blocks may be lost and will be dead weight for the image.

> If I have a choice between "always safe" and "not completely safe, but I
> think it's unlikely to happen", especially in image formats, then I will
> certainly take the "always safe".

In my understanding both cases are "unsafe", because both cases may lead
to inconsistent state between image and memory. When writing this code I
was looking at an existing approach in qcow2*.c to such kind of issues:
...
static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
...
     trace_qcow2_l2_allocate_write_l1(bs, l1_index);
     s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
     ret = qcow2_write_l1_entry(bs, l1_index);
     if (ret < 0) {
         goto fail;
     }

     *table = l2_table;
     trace_qcow2_l2_allocate_done(bs, l1_index, 0);
     return 0;

fail:
     trace_qcow2_l2_allocate_done(bs, l1_index, ret);
     if (l2_table != NULL) {
         qcow2_cache_put(bs, s->l2_table_cache, (void**) table);
     }
     s->l1_table[l1_index] = old_l2_offset;
     if (l2_offset > 0) {
         qcow2_free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t),
                             QCOW2_DISCARD_ALWAYS);
     }
     return ret;

Do I understand correctly? that if qcow2_write_l1_entr call fails
somewhere in the middle of bdrv_flush() then l1_table still contain the
old l2_offset, but the image may already contain the new l2_offset. We
can continue to use the image and write data, but in this case we lose
the data after reopening image.

So it seemed to me that in current conditions we can't escape from such
kind of problem completely. But I understand your desire to do better
even in little things, I agree that would be a little safer in the case
"first zeroing the in-memory table and then writing". So if this
solution suits all, let me write the next patch-set version :)

>>>> +    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
>>>> +    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
>>>> +        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
>>>> +            continue;
>>>> +        }
>>>> +        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
>>>> +                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
>>>> +        s->l1_table[i] = 0;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>>                           bool exact_size)
>>>>   {
>>>
>>> I haven't checked qcow2_shrink_reftable() for similar kinds of problems,
>>> I hope Max has.
>>
>> Well, it's exactly the same there.
> 
> Ok, so I'll object to this code without really having looked at it.
> 
> Kevin
>

next prev parent reply	other threads:[~2017-07-13 17:20 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-12 11:46 [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 1/4] qemu-img: add --shrink flag for resize Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 2/4] qcow2: add qcow2_cache_discard Pavel Butsykin
2017-07-12 14:45   ` Kevin Wolf
2017-07-13 17:18     ` Pavel Butsykin
2017-07-12 11:46 ` [Qemu-devel] [PATCH v5 3/4] qcow2: add shrink image support Pavel Butsykin
2017-07-12 14:52   ` Kevin Wolf
2017-07-12 16:58     ` Max Reitz
2017-07-13  8:41       ` Kevin Wolf
2017-07-13 14:36         ` Max Reitz
2017-07-13 17:28           ` Pavel Butsykin
2017-07-14  7:28             ` Kevin Wolf
2017-07-13 17:18         ` Pavel Butsykin [this message]
2017-07-14  7:44           ` Kevin Wolf
2017-07-12 11:47 ` [Qemu-devel] [PATCH v5 4/4] qemu-iotests: add shrinking image test Pavel Butsykin
2017-07-12 14:04 ` [Qemu-devel] [PATCH v5 0/4] Add shrink image for qcow2 Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6d3c11d-a020-457c-d38f-4114e3023dc8@virtuozzo.com \
    --to=pbutsykin@virtuozzo.com \
    --cc=armbru@redhat.com \
    --cc=den@openvz.org \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).