From: John Snow <jsnow@redhat.com>
To: Eric Blake <eblake@redhat.com>, Alberto Garcia <berto@igalia.com>,
qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org, mreitz@redhat.com
Subject: Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
Date: Wed, 21 Feb 2018 13:32:34 -0500 [thread overview]
Message-ID: <83a5559e-1e15-ae68-7507-14778ad6e7f9@redhat.com> (raw)
In-Reply-To: <e2247007-78c1-9f34-f0c9-cc689111f895@redhat.com>
On 02/21/2018 10:59 AM, Eric Blake wrote:
> On 02/21/2018 09:00 AM, Eric Blake wrote:
>> On 02/21/2018 04:04 AM, Alberto Garcia wrote:
>>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
>>>
>>> I was also preparing a patch to change this, but you arrived first :-)
>>>
>>>> So, it's time to cut back on the waste. A compressed cluster
>>>> will NEVER occupy more than an uncompressed cluster
>
>
>> And here, csize can only get smaller than the length picked by
>> nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size.
>>
>>>
>>> - Solution a: check that csize < s->cluster_size and return an error if
>>> it's not. However! although QEMU won't produce an image with a
>>> compressed cluster that is larger than the uncompressed one, the
>>> qcow2
>>> on-disk format in principle allows for that, so arguably we should
>>> accept it.
>>
>> No, the qcow2 on-disk format does not have enough bits reserved for
>> that. It is impossible to store an inflated cluster, because you
>> don't have enough bits to record it.
>
> Okay, the spec is WRONG, compared to our code base.
>
>>
>> That said, we MAY have a bug, more likely to be visible in 512-byte
>> clusters but possible on any size. While the 'number sectors' field
>> IS sufficient for any compressed cluster starting at offset 0 in the
>> cluster, we run into issues if the starting offset is later in the
>> cluster. That is, even though the compressed length of a 512-byte
>> cluster is <= 512 (or we wouldn't compress it), if we start at offset
>> 510, we NEED to read the next cluster to get the rest of the
>> compressed stream - but at 512-byte clusters, there are 0 bits
>> reserved for 'number sectors'. Per the above calculations with the
>> example offset of 510, nb_csectors would be 1 (it can't be anything
>> else for a 512-byte cluster image), and csize would then be 2 bytes,
>> which is insufficient for reading back enough data to reconstruct the
>> cluster.
>
> In fact, here's a demonstration of a discrepancy, where qemu-img and
> John's qcheck tool [1] disagree about the validity of an image created
> by qemu (although it may just be that qcheck hasn't yet learned about
> compressed clusters):
>
> [1]https://github.com/jnsnow/qcheck
>
I wouldn't trust my tool's ability to understand compressed clusters :)
I didn't get very far, though I did run across the fact that there
appeared to be a discrepancy between the spec and the actual
implementation, IIRC.
It looked like you came to the same conclusion when you stepped through
it manually.
> $ f=12345678
> $ f=$f$f$f$f # 32
> $ f=$f$f$f$f # 128
> $ f=$f$f$f$f # 512
> $ f=$f$f$f$f # 2k
> $ f=$f$f$f$f # 8k
> $ f=$f$f$f$f # 32k
> $ f=$f$f$f$f # 128k
> $ printf "$f" > data
> $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
> data data.qcow2
> $ qemu-img check data.qcow2
> No errors were found on the image.
> 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed
> clusters
> Image end offset: 18944
> $ ./qcheck data.qcow2
> ...
> == L2 Tables ==
>
> == L2 cluster l1[0] : 0x0000000000000800 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[1] : 0x0000000000000e00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[2] : 0x0000000000001400 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[3] : 0x0000000000001a00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> L2 tables: Could not complete analysis, 257 problems found
>
>
> == Reference Count Analysis ==
>
> Refcount analysis: 00 vacant clusters
> Refcount analysis: 04 leaked clusters
> Refcount analysis: 00 ghost clusters
> Refcount analysis: 04 miscounted clusters
> Refcount analysis: 8 problems found
>
>
> == Cluster Counts ==
>
> Metadata: 0x1000
> Data: 0x800
> Leaked: 0x800
> Vacant: 0x0
> total: 0x2000
> qcheck: 73 problems found
>
>
>>
>> Not true. It is (cluster_bits - 9) or (cluster_size / 512).
>> Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x =
>> 61. The 'number sectors' field is then bits x+1 - 61 (but you can't
>> have a bitfield occupying bit 62 upto 61; especially since bit 62 is
>> the bit for compressed cluster).
>
> So instead of blindly reading the spec, I'm now going to single-stepping
> through the 'qemu-img convert' command line above, to see what REALLY
> happens:
>
> Line numbers from commit a6e0344fa0
> $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512
> data data.qcow2
> ...
> (gdb) b qcow2_alloc_bytes
> Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
> (gdb) r
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
> bs=bs@entry=0x555555d87f50, size=size@entry=15)
> at block/qcow2-refcount.c:1052
> 1052 {
> (gdb)
>
> So we are compressing 512 bytes down to 15 every time, which means that
> after 34 clusters compressed, we should be at offset 510. Let's resume
> debugging:
>
> (gdb) c 34
> Will ignore next 33 crossings of breakpoint 1. Continuing.
> [Thread 0x7fffe3cfe700 (LWP 32229) exited]
> [New Thread 0x7fffe3cfe700 (LWP 32300)]
> [New Thread 0x7fffe25ed700 (LWP 32301)]
>
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
> bs=bs@entry=0x555555d87f50, size=size@entry=15)
> at block/qcow2-refcount.c:1052
> 1052 {
> (gdb) n
> 1053 BDRVQcow2State *s = bs->opaque;
> (gdb)
> 1058 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
> (gdb)
> 1059 assert(size > 0 && size <= s->cluster_size);
> (gdb) p s->free_byte_offset
> $2 = 3070
> (gdb) p 3070%512
> $3 = 510
> ...
> (gdb)
> 1076 free_in_cluster = s->cluster_size - offset_into_cluster(s,
> offset);
> (gdb)
> 1078 if (!offset || free_in_cluster < size) {
> (gdb) p free_in_cluster
> $4 = 2
> 1079 int64_t new_cluster = alloc_clusters_noref(bs,
> s->cluster_size);
> (gdb)
> 1080 if (new_cluster < 0) {
> (gdb)
> 1084 if (new_cluster == 0) {
> (gdb)
> 1091 if (!offset || ROUND_UP(offset, s->cluster_size) !=
> new_cluster) {
> (gdb)
> 1095 free_in_cluster += s->cluster_size;
> (gdb)
> 1099 assert(offset);
>
> so we got a contiguous cluster, and our goal is to let the caller bleed
> the compressed cluster into to the tail of the current sector and into
> the head of the next cluster. Continuing:
>
> (gdb) fin
> Run till exit from #0 qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50,
> size=size@entry=15) at block/qcow2-refcount.c:1118
> [Thread 0x7fffe25ed700 (LWP 32301) exited]
> [Thread 0x7fffe3cfe700 (LWP 32300) exited]
> qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50,
> offset=offset@entry=17408, compressed_size=compressed_size@entry=15)
> at block/qcow2-cluster.c:768
> 768 if (cluster_offset < 0) {
> Value returned is $5 = 3070
>
> (gdb) n
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 774 (cluster_offset >> 9);
> (gdb)
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 777 ((uint64_t)nb_csectors << s->csize_shift);
> (gdb) l
> 772
> 773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> 774 (cluster_offset >> 9);
> 775
> 776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
> 777 ((uint64_t)nb_csectors << s->csize_shift);
> 778
> 779 /* update L2 table */
> 780
> 781 /* compressed clusters never have the copied flag */
> (gdb) p nb_csectors
> $6 = 1
> (gdb) p s->csize_shift
> $7 = 61
> (gdb) p/x cluster_offset
> $8 = 0xbfe
> (gdb) n
> 776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
> (gdb)
> 783 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
> (gdb) p/x cluster_offset
> $9 = 0x6000000000000bfe
>
> Where is s->csize_shift initialized? In qcow2_do_open():
>
> s->csize_shift = (62 - (s->cluster_bits - 8));
> s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
> s->cluster_offset_mask = (1LL << s->csize_shift) - 1;
>
> Revisiting the wording in the spec:
>
> Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
>
> Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
> cluster boundary!
>
> x+1 - 61: Compressed size of the images in sectors of 512 bytes
>
> which says bits 0-61 are the host cluster offset, and 62-61 is the
> number of sectors. But our code sets s->csize_shift to divide this
> differently, at 0-60 and 61-61. Which means your earlier claim that
> there are enough 'number sector' bits to allow for up to 2*cluster_size
> as the size of the compressed stream (rather than my claim of exactly
> cluster_size) is right, and other implementations CAN inflate a cluster
> (if we don't document otherwise), and that even if they DON'T inflate,
> they can STILL cause a read larger than a cluster size when the offset
> is near the tail of one sector (most likely at 512-byte clusters, but
> remotely possible at other cluster sizes as well).
>
--
—js
next prev parent reply other threads:[~2018-02-21 18:33 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-20 22:24 [Qemu-devel] [PATCH 0/2] qcow2: minor compression improvements Eric Blake
2018-02-20 22:24 ` [Qemu-devel] [PATCH 1/2] qcow2: Prefer byte-based calls into bs->file Eric Blake
2018-02-21 9:42 ` Alberto Garcia
2018-02-20 22:24 ` [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images Eric Blake
2018-02-21 10:04 ` Alberto Garcia
2018-02-21 15:00 ` Eric Blake
2018-02-21 15:22 ` Alberto Garcia
2018-02-21 15:59 ` Eric Blake
2018-02-21 18:32 ` John Snow [this message]
2018-02-21 16:51 ` Kevin Wolf
2018-02-21 16:59 ` Eric Blake
2018-02-21 17:39 ` Kevin Wolf
2018-02-21 18:32 ` Eric Blake
2018-02-21 18:48 ` Kevin Wolf
2018-02-22 13:57 ` Alberto Garcia
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=83a5559e-1e15-ae68-7507-14778ad6e7f9@redhat.com \
--to=jsnow@redhat.com \
--cc=berto@igalia.com \
--cc=eblake@redhat.com \
--cc=kwolf@redhat.com \
--cc=mreitz@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).