qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Eric Blake <eblake@redhat.com>
To: Max Reitz <mreitz@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Peter Lieven <pl@kamp.de>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment
Date: Wed, 12 Nov 2014 06:50:12 -0700	[thread overview]
Message-ID: <54636594.8010701@redhat.com> (raw)
In-Reply-To: <54632E96.4070806@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 8880 bytes --]

On 11/12/2014 02:55 AM, Max Reitz wrote:
> On 2014-11-12 at 05:15, Eric Blake wrote:
>> On 11/10/2014 06:45 AM, Max Reitz wrote:
>>> Add a function qcow2_change_refcount_order() which allows changing the
>>> refcount order of a qcow2 image.
>>>
>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>>> ---
>>>   block/qcow2-refcount.c | 424
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   block/qcow2.h          |   4 +
>>>   2 files changed, 428 insertions(+)
>> This is fairly big; you may want to get a second review from a
>> maintainer rather than blindly trusting me.
> 
> I didn't really see the point in splitting it up. Introducing the static
> helper functions first would not be very helpful either, so I thought
> what I'd do as a reviewer, and that was "apply the patch and just read
> through all the code". Splitting it into multiple patches would not have
> helped there (and I don't see how I could split this patch into logical
> changes, where at the start we have some inefficient but simple
> implementation and it gets better over time).

Yes, I agree that your patch is not divisible.  Big doesn't always mean
bad :)


>>> +                refcount = s->get_refcount(refblock, refblock_index);
>>> +                if (new_refcount_bits < 64 && refcount >>
>>> new_refcount_bits) {
>> Technically, this get_refcount() call is dead code on the second walk,
>> since the first walk already validated things, so you could push all of
>> this code...
>>
>>> +                    uint64_t offset;
>>> +
>>> +                    qcow2_cache_put(bs, s->refcount_block_cache,
>>> &refblock);
>>> +
>>> +                    offset = ((reftable_index <<
>>> s->refcount_block_bits)
>>> +                              + refblock_index) << s->cluster_bits;
>>> +
>>> +                    error_setg(errp, "Cannot decrease refcount entry
>>> width to "
>>> +                               "%i bits: Cluster at offset %#"
>>> PRIx64 " has a "
>>> +                               "refcount of %" PRIu64,
>>> new_refcount_bits,
>>> +                               offset, refcount);
>>> +                    return -EINVAL;
>>> +                }
>>> +
>>> +                if (new_set_refcount) {
>>> +                    new_set_refcount(new_refblock,
>>> new_refblock_index++, refcount);
>>> +                } else {
>> ...here, in the branch only run on the first walk.
> 
> Well, yes, but I wanted to keep this function as agnostic to what the
> caller wants to do with it as possible. I'd rather decide depending on
> whether index == 0 because that's a better way of discerning the first
> walk.

I was thinking a bit more about how to avoid the allocation corner case
bug, and wonder if three walks instead of 2 is the right solution, in
which case the first two walks are both allocations, and index==0 is no
longer a reliable witness of whether these checks are needed.  More on
that below.

> 
>>> +                    new_refblock_index++;
>>> +                }
>>> +                new_refblock_empty = new_refblock_empty && refcount
>>> == 0;
>> Worth condensing to 'new_refblock_empty &= !refcount'?  Maybe not.
> 
> I personally would find that harder to read.

No need to change it then.


>>> +                if (new_set_refcount) {
>>> +                    new_set_refcount(new_refblock,
>>> new_refblock_index++, 0);
>> Would it be worth guaranteeing that every new refblock is 0-initialized
>> when allocated, so that you can skip setting a refcount to 0?  This
>> question depends on the answer about block allocation asked at [4] above.
> 
> This function sets a value in the buffer new_refblock, not in the
> cluster on disk. Therefore, in order to be able to omit this call, we'd
> have to call a memset() with 0 on new_refblock after each call to
> operation(). I don't think it's worth it. This is more explicit and
> won't cost much performance.

Yeah, that's about the same conclusion I came to after finishing the
whole review, although I didn't state it very well (this was one of my
earlier comments in my non-linear review, that I didn't touch up later).


>>> +    /* The new_reftable_size is now valid and will not be changed
>>> anymore,
>>> +     * so we can now allocate the reftable */
>>> +    new_reftable_offset = qcow2_alloc_clusters(bs, new_reftable_size *
>>> +                                                   sizeof(uint64_t));
>> And here is your bug, that I hinted at with the mention of [3] above.
>> This allocation can potentially cause an overflow of the existing
>> reftable to occupy one more cluster.
> 
> An additional bug is that the new reftable may be referenced by an
> existing refblock which was completely empty, though (or at least
> referenced by part of an existing refblock which was to be turned into a
> new refblock which was then completely empty, and thus omitted from
> allocation).
> 
>> Remember my thought experiment
>> above, how an 8 megabyte image rolls from 1 to 2 clusters during the
>> course of allocating refblocks for the new table?  What if the original
>> image wasn't completely full, but things are perfectly sized with enough
>> free clusters, then all of the refblock allocations done during the
>> first walk will still fit, and it is only this final allocation of the
>> new reftable that will cause the rollover, at which point we've failed
>> to account for the new refblock size.  That is, I think I could craft an
>> image that would trigger either an assertion failure or an out-of-bounds
>> array access during the second walk.
> 
> You're completely right, this will be a pain to fix, though... The
> simplest way would probably to check whether the new_reftable_size
> should be increased due to this operation and if it did, rerun
> walk_over_reftable() with the alloc_refblock() function only allocating
> refblocks if the new reftable does not already point to a refblock for a
> certain index. This would be repeated until the new_reftable_size is
> constant. And the really simplest incarnation of this would be to have a
> flag whether any allocations were done and repeat until everything is fine.
> 
> Another way would be to somehow integrate the allocation of the new
> reftable into walk_over_reftable() and then to only mind the additional
> reftable entries.
> 
> But probably the first way is the correct one because due to
> reallocation of the old reftable, some intermediate refblocks which were
> empty before are now filled.
> 
> I'll have to craft some test images myself, not least to be able to
> include them in the iotest.

Overnight, I had been thinking about it (I don't know if it's a good or
a bad thing when a patch is so mentally engaging that it becomes the
thing on my mind).  My initial idea was to teach the walk function how
to start and stop at given limits, maybe by passing an address to
dereference as the end point.  Then do something like:

original_limit = s->refcount_table_size;
walk allocations with limits of 0, &original_limit
allocate a contiguous block of limit+1 clusters for the new reftable
walk allocations with limits of original_limit, &s->refcount_table_size
if limit+1 is too big after all (most of the time), then free the tail
walk refcount assignment with limits of 0, &s->refcount_table_size

Allocating limit+1 for the new refblock should always be safe.  In the
majority of cases, it overallocates; but as we are already having
several small holes in the image after freeing the original reftable
means it won't hurt to have one more cluster hole.  It's better to have
the algorithm be safe and waste a few clusters than to figure out how to
pack it into the most efficient space possible at the expense of more
code complexity.  And maybe someday we'll implement a cluster
defragmenter for packing an image down into no holes.  In the minority
of cases, the +1 should always be sufficient to cover any additional
allocation spillovers.

But you raise a good point that the original image may have holes on the
first walk, where allocating refblocks for the new table will turn those
holes into refcounts that must be picked up, so I think you are right
that you have to walk the ENTIRE image on the second pass, rather than
just the tail of the image where you stopped early on the first pass.

Good luck on coming up with a plan to tackle it.

>> Interesting patch.  Hope my review helps you prepare a better v2.
> 
> If everything else fails, I'll just split the amend stuff from this
> series. But I'll work it out somehow. And your review will definitely
> help, thanks a lot!

Glad to hear it.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 539 bytes --]

  reply	other threads:[~2014-11-12 13:50 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-10 13:45 [Qemu-devel] [PATCH 00/21] qcow2: Support refcount orders != 4 Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 01/21] qcow2: Add two new fields to BDRVQcowState Max Reitz
2014-11-10 19:00   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 02/21] qcow2: Add refcount_width to format-specific info Max Reitz
2014-11-10 19:06   ` Eric Blake
2014-11-11  8:11     ` Max Reitz
2014-11-11 15:49       ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 03/21] qcow2: Use 64 bits for refcount values Max Reitz
2014-11-10 20:59   ` Eric Blake
2014-11-11  8:12     ` Max Reitz
2014-11-11  9:22     ` Kevin Wolf
2014-11-11  9:25       ` Max Reitz
2014-11-11  9:26         ` Max Reitz
2014-11-11  9:36         ` Kevin Wolf
2014-11-10 13:45 ` [Qemu-devel] [PATCH 04/21] qcow2: Respect error in qcow2_alloc_bytes() Max Reitz
2014-11-10 21:05   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 05/21] qcow2: Refcount overflow and qcow2_alloc_bytes() Max Reitz
2014-11-10 21:12   ` Eric Blake
2014-11-11  8:22     ` Max Reitz
2014-11-11 16:13       ` Eric Blake
2014-11-11 16:18         ` Max Reitz
2014-11-11 19:49           ` Eric Blake
2014-11-12  8:52             ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 06/21] qcow2: Helper function for refcount modification Max Reitz
2014-11-10 22:30   ` Eric Blake
2014-11-11  8:35     ` Max Reitz
2014-11-11  9:43       ` Max Reitz
2014-11-11 10:56       ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 07/21] qcow2: Helper for refcount array size calculation Max Reitz
2014-11-10 22:49   ` Eric Blake
2014-11-11  8:37     ` Max Reitz
2014-11-11 10:08       ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 08/21] qcow2: More helpers for refcount modification Max Reitz
2014-11-11  0:29   ` Eric Blake
2014-11-11  8:42     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 09/21] qcow2: Open images with refcount order != 4 Max Reitz
2014-11-10 17:03   ` Eric Blake
2014-11-10 17:06     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 10/21] qcow2: refcount_order parameter for qcow2_create2 Max Reitz
2014-11-11  5:40   ` Eric Blake
2014-11-11  8:48     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 11/21] iotests: Prepare for refcount_width option Max Reitz
2014-11-11 17:57   ` Eric Blake
2014-11-12  8:41     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 12/21] qcow2: Allow creation with refcount order != 4 Max Reitz
2014-11-11 18:05   ` Eric Blake
2014-11-12  8:47     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 13/21] block: Add opaque value to the amend CB Max Reitz
2014-11-11 18:08   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 14/21] qcow2: Use error_report() in qcow2_amend_options() Max Reitz
2014-11-11 18:11   ` Eric Blake
2014-11-12  8:47     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 15/21] qcow2: Use abort() instead of assert(false) Max Reitz
2014-11-11 18:12   ` Eric Blake
2014-11-12  8:48     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 16/21] qcow2: Split upgrade/downgrade paths for amend Max Reitz
2014-11-11 18:14   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 17/21] qcow2: Use intermediate helper CB " Max Reitz
2014-11-11 21:05   ` Eric Blake
2014-11-12  9:10     ` Max Reitz
2014-11-10 13:45 ` [Qemu-devel] [PATCH 18/21] qcow2: Add function for refcount order amendment Max Reitz
2014-11-12  4:15   ` Eric Blake
2014-11-12  9:55     ` Max Reitz
2014-11-12 13:50       ` Eric Blake [this message]
2014-11-12 14:19   ` Eric Blake
2014-11-12 14:21     ` Max Reitz
2014-11-12 17:45       ` Max Reitz
2014-11-12 20:21         ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 19/21] qcow2: Invoke refcount order amendment function Max Reitz
2014-11-12  4:36   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 20/21] qcow2: Point to amend function in check Max Reitz
2014-11-12  4:38   ` Eric Blake
2014-11-10 13:45 ` [Qemu-devel] [PATCH 21/21] iotests: Add test for different refcount widths Max Reitz
2014-11-11 19:53   ` Eric Blake
2014-11-12  8:58     ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54636594.8010701@redhat.com \
    --to=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).