Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: John Snow <jsnow@redhat.com>
To: Eric Blake <eblake@redhat.com>, Alberto Garcia <berto@igalia.com>,
	qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org, mreitz@redhat.com
Subject: Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
Date: Wed, 21 Feb 2018 13:32:34 -0500	[thread overview]
Message-ID: <83a5559e-1e15-ae68-7507-14778ad6e7f9@redhat.com> (raw)
In-Reply-To: <e2247007-78c1-9f34-f0c9-cc689111f895@redhat.com>



On 02/21/2018 10:59 AM, Eric Blake wrote:
> On 02/21/2018 09:00 AM, Eric Blake wrote:
>> On 02/21/2018 04:04 AM, Alberto Garcia wrote:
>>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
>>>
>>> I was also preparing a patch to change this, but you arrived first :-)
>>>
>>>> So, it's time to cut back on the waste.  A compressed cluster
>>>> will NEVER occupy more than an uncompressed cluster
> 
> 
>> And here, csize can only get smaller than the length picked by
>> nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.
>>
>>>
>>> - Solution a: check that csize < s->cluster_size and return an error if
>>>    it's not. However! although QEMU won't produce an image with a
>>>    compressed cluster that is larger than the uncompressed one, the
>>> qcow2
>>>    on-disk format in principle allows for that, so arguably we should
>>>    accept it.
>>
>> No, the qcow2 on-disk format does not have enough bits reserved for
>> that.  It is impossible to store an inflated cluster, because you
>> don't have enough bits to record it.
> 
> Okay, the spec is WRONG, compared to our code base.
> 
>>
>> That said, we MAY have a bug, more likely to be visible in 512-byte
>> clusters but possible on any size.  While the 'number sectors' field
>> IS sufficient for any compressed cluster starting at offset 0 in the
>> cluster, we run into issues if the starting offset is later in the
>> cluster.  That is, even though the compressed length of a 512-byte
>> cluster is <= 512 (or we wouldn't compress it), if we start at offset
>> 510, we NEED to read the next cluster to get the rest of the
>> compressed stream - but at 512-byte clusters, there are 0 bits
>> reserved for 'number sectors'.  Per the above calculations with the
>> example offset of 510, nb_csectors would be 1 (it can't be anything
>> else for a 512-byte cluster image), and csize would then be 2 bytes,
>> which is insufficient for reading back enough data to reconstruct the
>> cluster.
> 
> In fact, here's a demonstration of a discrepancy, where qemu-img and
> John's qcheck tool [1] disagree about the validity of an image created
> by qemu (although it may just be that qcheck hasn't yet learned about
> compressed clusters):
> 
> [1]https://github.com/jnsnow/qcheck
> 

I wouldn't trust my tool's ability to understand compressed clusters :)

I didn't get very far, though I did run across the fact that there
appeared to be a discrepancy between the spec and the actual
implementation, IIRC.

It looked like you came to the same conclusion when you stepped through
it manually.

> $ f=12345678
> $ f=$f$f$f$f # 32
> $ f=$f$f$f$f # 128
> $ f=$f$f$f$f # 512
> $ f=$f$f$f$f # 2k
> $ f=$f$f$f$f # 8k
> $ f=$f$f$f$f # 32k
> $ f=$f$f$f$f # 128k
> $ printf "$f" > data
> $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
>   data data.qcow2
> $ qemu-img check data.qcow2
> No errors were found on the image.
> 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed
> clusters
> Image end offset: 18944
> $ ./qcheck data.qcow2
> ...
> == L2 Tables ==
> 
> == L2 cluster l1[0] : 0x0000000000000800 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[1] : 0x0000000000000e00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[2] : 0x0000000000001400 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[3] : 0x0000000000001a00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> L2 tables: Could not complete analysis, 257 problems found
> 
> 
> == Reference Count Analysis ==
> 
> Refcount analysis: 00 vacant clusters
> Refcount analysis: 04 leaked clusters
> Refcount analysis: 00 ghost clusters
> Refcount analysis: 04 miscounted clusters
> Refcount analysis: 8 problems found
> 
> 
> == Cluster Counts ==
> 
> Metadata: 0x1000
> Data: 0x800
> Leaked: 0x800
> Vacant: 0x0
> total: 0x2000
> qcheck: 73 problems found
> 
> 
>>
>> Not true.  It is (cluster_bits - 9) or (cluster_size / 512). 
>> Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x =
>> 61.  The 'number sectors' field is then bits x+1 - 61 (but you can't
>> have a bitfield occupying bit 62 upto 61; especially since bit 62 is
>> the bit for compressed cluster).
> 
> So instead of blindly reading the spec, I'm now going to single-stepping
> through the 'qemu-img convert' command line above, to see what REALLY
> happens:
> 
> Line numbers from commit a6e0344fa0
> $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512
> data data.qcow2
> ...
> (gdb) b qcow2_alloc_bytes
> Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
> (gdb) r
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
>     bs=bs@entry=0x555555d87f50, size=size@entry=15)
>     at block/qcow2-refcount.c:1052
> 1052    {
> (gdb)
> 
> So we are compressing 512 bytes down to 15 every time, which means that
> after 34 clusters compressed, we should be at offset 510.  Let's resume
> debugging:
> 
> (gdb) c 34
> Will ignore next 33 crossings of breakpoint 1.  Continuing.
> [Thread 0x7fffe3cfe700 (LWP 32229) exited]
> [New Thread 0x7fffe3cfe700 (LWP 32300)]
> [New Thread 0x7fffe25ed700 (LWP 32301)]
> 
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
>     bs=bs@entry=0x555555d87f50, size=size@entry=15)
>     at block/qcow2-refcount.c:1052
> 1052    {
> (gdb) n
> 1053        BDRVQcow2State *s = bs->opaque;
> (gdb)
> 1058        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
> (gdb)
> 1059        assert(size > 0 && size <= s->cluster_size);
> (gdb) p s->free_byte_offset
> $2 = 3070
> (gdb) p 3070%512
> $3 = 510
> ...
> (gdb)
> 1076        free_in_cluster = s->cluster_size - offset_into_cluster(s,
> offset);
> (gdb)
> 1078            if (!offset || free_in_cluster < size) {
> (gdb) p free_in_cluster
> $4 = 2
> 1079                int64_t new_cluster = alloc_clusters_noref(bs,
> s->cluster_size);
> (gdb)
> 1080                if (new_cluster < 0) {
> (gdb)
> 1084                if (new_cluster == 0) {
> (gdb)
> 1091                if (!offset || ROUND_UP(offset, s->cluster_size) !=
> new_cluster) {
> (gdb)
> 1095                    free_in_cluster += s->cluster_size;
> (gdb)
> 1099            assert(offset);
> 
> so we got a contiguous cluster, and our goal is to let the caller bleed
> the compressed cluster into to the tail of the current sector and into
> the head of the next cluster.  Continuing:
> 
> (gdb) fin
> Run till exit from #0  qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50,
>     size=size@entry=15) at block/qcow2-refcount.c:1118
> [Thread 0x7fffe25ed700 (LWP 32301) exited]
> [Thread 0x7fffe3cfe700 (LWP 32300) exited]
> qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50,
>     offset=offset@entry=17408, compressed_size=compressed_size@entry=15)
>     at block/qcow2-cluster.c:768
> 768        if (cluster_offset < 0) {
> Value returned is $5 = 3070
> 
> (gdb) n
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 774                      (cluster_offset >> 9);
> (gdb)
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 777                          ((uint64_t)nb_csectors << s->csize_shift);
> (gdb) l
> 772   
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> 774                      (cluster_offset >> 9);
> 775   
> 776        cluster_offset |= QCOW_OFLAG_COMPRESSED |
> 777                          ((uint64_t)nb_csectors << s->csize_shift);
> 778   
> 779        /* update L2 table */
> 780   
> 781        /* compressed clusters never have the copied flag */
> (gdb) p nb_csectors
> $6 = 1
> (gdb) p s->csize_shift
> $7 = 61
> (gdb) p/x cluster_offset
> $8 = 0xbfe
> (gdb) n
> 776        cluster_offset |= QCOW_OFLAG_COMPRESSED |
> (gdb)
> 783        BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
> (gdb) p/x cluster_offset
> $9 = 0x6000000000000bfe
> 
> Where is s->csize_shift initialized?  In qcow2_do_open():
> 
>     s->csize_shift = (62 - (s->cluster_bits - 8));
>     s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
>     s->cluster_offset_mask = (1LL << s->csize_shift) - 1;
> 
> Revisiting the wording in the spec:
> 
> Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
> 
>     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
>                     cluster boundary!
> 
>        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
> 
> which says bits 0-61 are the host cluster offset, and 62-61 is the
> number of sectors.  But our code sets s->csize_shift to divide this
> differently, at 0-60 and 61-61.  Which means your earlier claim that
> there are enough 'number sector' bits to allow for up to 2*cluster_size
> as the size of the compressed stream (rather than my claim of exactly
> cluster_size) is right, and other implementations CAN inflate a cluster
> (if we don't document otherwise), and that even if they DON'T inflate,
> they can STILL cause a read larger than a cluster size when the offset
> is near the tail of one sector (most likely at 512-byte clusters, but
> remotely possible at other cluster sizes as well).
> 

-- 
—js

next prev parent reply	other threads:[~2018-02-21 18:33 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-20 22:24 [Qemu-devel] [PATCH 0/2] qcow2: minor compression improvements Eric Blake
2018-02-20 22:24 ` [Qemu-devel] [PATCH 1/2] qcow2: Prefer byte-based calls into bs->file Eric Blake
2018-02-21  9:42   ` Alberto Garcia
2018-02-20 22:24 ` [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images Eric Blake
2018-02-21 10:04   ` Alberto Garcia
2018-02-21 15:00     ` Eric Blake
2018-02-21 15:22       ` Alberto Garcia
2018-02-21 15:59       ` Eric Blake
2018-02-21 18:32         ` John Snow [this message]
2018-02-21 16:51   ` Kevin Wolf
2018-02-21 16:59     ` Eric Blake
2018-02-21 17:39       ` Kevin Wolf
2018-02-21 18:32         ` Eric Blake
2018-02-21 18:48           ` Kevin Wolf
2018-02-22 13:57       ` Alberto Garcia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83a5559e-1e15-ae68-7507-14778ad6e7f9@redhat.com \
    --to=jsnow@redhat.com \
    --cc=berto@igalia.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).