From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36063) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eoZCI-00021P-Ll for qemu-devel@nongnu.org; Wed, 21 Feb 2018 13:33:02 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eoZCC-0006gA-F0 for qemu-devel@nongnu.org; Wed, 21 Feb 2018 13:32:54 -0500 References: <20180220222459.8461-1-eblake@redhat.com> <20180220222459.8461-3-eblake@redhat.com> <37fa8b17-116b-ea53-6f65-68412fdb813c@redhat.com> From: John Snow Message-ID: <83a5559e-1e15-ae68-7507-14778ad6e7f9@redhat.com> Date: Wed, 21 Feb 2018 13:32:34 -0500 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake , Alberto Garcia , qemu-devel@nongnu.org Cc: Kevin Wolf , qemu-block@nongnu.org, mreitz@redhat.com On 02/21/2018 10:59 AM, Eric Blake wrote: > On 02/21/2018 09:00 AM, Eric Blake wrote: >> On 02/21/2018 04:04 AM, Alberto Garcia wrote: >>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote: >>> >>> I was also preparing a patch to change this, but you arrived first :-= ) >>> >>>> So, it's time to cut back on the waste.=C2=A0 A compressed cluster >>>> will NEVER occupy more than an uncompressed cluster >=20 >=20 >> And here, csize can only get smaller than the length picked by >> nb_csectors.=C2=A0 Therefore, csize is GUARANTEED to be <=3D c->sector= _size. >> >>> >>> - Solution a: check that csize < s->cluster_size and return an error = if >>> =C2=A0=C2=A0 it's not. However! although QEMU won't produce an image = with a >>> =C2=A0=C2=A0 compressed cluster that is larger than the uncompressed = one, the >>> qcow2 >>> =C2=A0=C2=A0 on-disk format in principle allows for that, so arguably= we should >>> =C2=A0=C2=A0 accept it. >> >> No, the qcow2 on-disk format does not have enough bits reserved for >> that.=C2=A0 It is impossible to store an inflated cluster, because you >> don't have enough bits to record it. >=20 > Okay, the spec is WRONG, compared to our code base. >=20 >> >> That said, we MAY have a bug, more likely to be visible in 512-byte >> clusters but possible on any size.=C2=A0 While the 'number sectors' fi= eld >> IS sufficient for any compressed cluster starting at offset 0 in the >> cluster, we run into issues if the starting offset is later in the >> cluster.=C2=A0 That is, even though the compressed length of a 512-byt= e >> cluster is <=3D 512 (or we wouldn't compress it), if we start at offse= t >> 510, we NEED to read the next cluster to get the rest of the >> compressed stream - but at 512-byte clusters, there are 0 bits >> reserved for 'number sectors'.=C2=A0 Per the above calculations with t= he >> example offset of 510, nb_csectors would be 1 (it can't be anything >> else for a 512-byte cluster image), and csize would then be 2 bytes, >> which is insufficient for reading back enough data to reconstruct the >> cluster. >=20 > In fact, here's a demonstration of a discrepancy, where qemu-img and > John's qcheck tool [1] disagree about the validity of an image created > by qemu (although it may just be that qcheck hasn't yet learned about > compressed clusters): >=20 > [1]https://github.com/jnsnow/qcheck >=20 I wouldn't trust my tool's ability to understand compressed clusters :) I didn't get very far, though I did run across the fact that there appeared to be a discrepancy between the spec and the actual implementation, IIRC. It looked like you came to the same conclusion when you stepped through it manually. > $ f=3D12345678 > $ f=3D$f$f$f$f # 32 > $ f=3D$f$f$f$f # 128 > $ f=3D$f$f$f$f # 512 > $ f=3D$f$f$f$f # 2k > $ f=3D$f$f$f$f # 8k > $ f=3D$f$f$f$f # 32k > $ f=3D$f$f$f$f # 128k > $ printf "$f" > data > $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=3D512 \ > =C2=A0 data data.qcow2 > $ qemu-img check data.qcow2 > No errors were found on the image. > 256/256 =3D 100.00% allocated, 100.00% fragmented, 100.00% compressed > clusters > Image end offset: 18944 > $ ./qcheck data.qcow2 > ... > =3D=3D L2 Tables =3D=3D >=20 > =3D=3D L2 cluster l1[0] : 0x0000000000000800 =3D=3D > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.98= 4375 > =3D=3D L2 cluster l1[1] : 0x0000000000000e00 =3D=3D > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.98= 4375 > =3D=3D L2 cluster l1[2] : 0x0000000000001400 =3D=3D > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.98= 4375 > =3D=3D L2 cluster l1[3] : 0x0000000000001a00 =3D=3D > Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.98= 4375 > L2 tables: Could not complete analysis, 257 problems found >=20 >=20 > =3D=3D Reference Count Analysis =3D=3D >=20 > Refcount analysis: 00 vacant clusters > Refcount analysis: 04 leaked clusters > Refcount analysis: 00 ghost clusters > Refcount analysis: 04 miscounted clusters > Refcount analysis: 8 problems found >=20 >=20 > =3D=3D Cluster Counts =3D=3D >=20 > Metadata: 0x1000 > Data: 0x800 > Leaked: 0x800 > Vacant: 0x0 > total: 0x2000 > qcheck: 73 problems found >=20 >=20 >> >> Not true.=C2=A0 It is (cluster_bits - 9) or (cluster_size / 512).=C2=A0 >> Remember, x =3D 62 - (cluster_bits - 8); for a 512-byte cluster, x =3D >> 61.=C2=A0 The 'number sectors' field is then bits x+1 - 61 (but you ca= n't >> have a bitfield occupying bit 62 upto 61; especially since bit 62 is >> the bit for compressed cluster). >=20 > So instead of blindly reading the spec, I'm now going to single-steppin= g > through the 'qemu-img convert' command line above, to see what REALLY > happens: >=20 > Line numbers from commit a6e0344fa0 > $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=3D51= 2 > data data.qcow2 > ... > (gdb) b qcow2_alloc_bytes > Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052. > (gdb) r > Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( > =C2=A0=C2=A0=C2=A0 bs=3Dbs@entry=3D0x555555d87f50, size=3Dsize@entry=3D= 15) > =C2=A0=C2=A0=C2=A0 at block/qcow2-refcount.c:1052 > 1052=C2=A0=C2=A0=C2=A0 { > (gdb) >=20 > So we are compressing 512 bytes down to 15 every time, which means that > after 34 clusters compressed, we should be at offset 510.=C2=A0 Let's r= esume > debugging: >=20 > (gdb) c 34 > Will ignore next 33 crossings of breakpoint 1.=C2=A0 Continuing. > [Thread 0x7fffe3cfe700 (LWP 32229) exited] > [New Thread 0x7fffe3cfe700 (LWP 32300)] > [New Thread 0x7fffe25ed700 (LWP 32301)] >=20 > Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes ( > =C2=A0=C2=A0=C2=A0 bs=3Dbs@entry=3D0x555555d87f50, size=3Dsize@entry=3D= 15) > =C2=A0=C2=A0=C2=A0 at block/qcow2-refcount.c:1052 > 1052=C2=A0=C2=A0=C2=A0 { > (gdb) n > 1053=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BDRVQcow2State *s =3D bs= ->opaque; > (gdb) > 1058=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BLKDBG_EVENT(bs->file, B= LKDBG_CLUSTER_ALLOC_BYTES); > (gdb) > 1059=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 assert(size > 0 && size = <=3D s->cluster_size); > (gdb) p s->free_byte_offset > $2 =3D 3070 > (gdb) p 3070%512 > $3 =3D 510 > ... > (gdb) > 1076=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 free_in_cluster =3D s->c= luster_size - offset_into_cluster(s, > offset); > (gdb) > 1078=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = if (!offset || free_in_cluster < size) { > (gdb) p free_in_cluster > $4 =3D 2 > 1079=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 int64_t new_cluster =3D alloc_clusters_noref(bs, > s->cluster_size); > (gdb) > 1080=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (new_cluster < 0) { > (gdb) > 1084=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (new_cluster =3D=3D 0) { > (gdb) > 1091=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (!offset || ROUND_UP(offset, s->cluster_size) !=3D > new_cluster) { > (gdb) > 1095=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 free_in_cluster +=3D s->clu= ster_size; > (gdb) > 1099=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = assert(offset); >=20 > so we got a contiguous cluster, and our goal is to let the caller bleed > the compressed cluster into to the tail of the current sector and into > the head of the next cluster.=C2=A0 Continuing: >=20 > (gdb) fin > Run till exit from #0=C2=A0 qcow2_alloc_bytes (bs=3Dbs@entry=3D0x555555= d87f50, > =C2=A0=C2=A0=C2=A0 size=3Dsize@entry=3D15) at block/qcow2-refcount.c:11= 18 > [Thread 0x7fffe25ed700 (LWP 32301) exited] > [Thread 0x7fffe3cfe700 (LWP 32300) exited] > qcow2_alloc_compressed_cluster_offset (bs=3Dbs@entry=3D0x555555d87f50, > =C2=A0=C2=A0=C2=A0 offset=3Doffset@entry=3D17408, compressed_size=3Dcom= pressed_size@entry=3D15) > =C2=A0=C2=A0=C2=A0 at block/qcow2-cluster.c:768 > 768=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (cluster_offset < 0) { > Value returned is $5 =3D 3070 >=20 > (gdb) n > 773=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nb_csectors =3D ((cluster= _offset + compressed_size - 1) >> 9) - > (gdb) > 774=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (cluster_offset= >> 9); > (gdb) > 773=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nb_csectors =3D ((cluster= _offset + compressed_size - 1) >> 9) - > (gdb) > 777=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 ((uint64_t)nb_csectors << s->csize_shift); > (gdb) l > 772=C2=A0=C2=A0=C2=A0 > 773=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 nb_csectors =3D ((cluster= _offset + compressed_size - 1) >> 9) - > 774=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 (cluster_offset= >> 9); > 775=C2=A0=C2=A0=C2=A0 > 776=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cluster_offset |=3D QCOW_= OFLAG_COMPRESSED | > 777=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 ((uint64_t)nb_csectors << s->csize_shift); > 778=C2=A0=C2=A0=C2=A0 > 779=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* update L2 table */ > 780=C2=A0=C2=A0=C2=A0 > 781=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* compressed clusters ne= ver have the copied flag */ > (gdb) p nb_csectors > $6 =3D 1 > (gdb) p s->csize_shift > $7 =3D 61 > (gdb) p/x cluster_offset > $8 =3D 0xbfe > (gdb) n > 776=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cluster_offset |=3D QCOW_= OFLAG_COMPRESSED | > (gdb) > 783=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 BLKDBG_EVENT(bs->file, BL= KDBG_L2_UPDATE_COMPRESSED); > (gdb) p/x cluster_offset > $9 =3D 0x6000000000000bfe >=20 > Where is s->csize_shift initialized?=C2=A0 In qcow2_do_open(): >=20 > =C2=A0=C2=A0=C2=A0 s->csize_shift =3D (62 - (s->cluster_bits - 8)); > =C2=A0=C2=A0=C2=A0 s->csize_mask =3D (1 << (s->cluster_bits - 8)) - 1; > =C2=A0=C2=A0=C2=A0 s->cluster_offset_mask =3D (1LL << s->csize_shift) -= 1; >=20 > Revisiting the wording in the spec: >=20 > Compressed Clusters Descriptor (x =3D 62 - (cluster_bits - 8)): >=20 > =C2=A0=C2=A0=C2=A0 Bit=C2=A0 0 -=C2=A0 x:=C2=A0=C2=A0=C2=A0 Host cluste= r offset. This is usually _not_ aligned to a > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cluster boundary! >=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 x+1 - 61:=C2=A0=C2=A0=C2=A0 Compre= ssed size of the images in sectors of 512 bytes >=20 > which says bits 0-61 are the host cluster offset, and 62-61 is the > number of sectors.=C2=A0 But our code sets s->csize_shift to divide thi= s > differently, at 0-60 and 61-61.=C2=A0 Which means your earlier claim th= at > there are enough 'number sector' bits to allow for up to 2*cluster_size > as the size of the compressed stream (rather than my claim of exactly > cluster_size) is right, and other implementations CAN inflate a cluster > (if we don't document otherwise), and that even if they DON'T inflate, > they can STILL cause a read larger than a cluster size when the offset > is near the tail of one sector (most likely at 512-byte clusters, but > remotely possible at other cluster sizes as well). >=20 --=20 =E2=80=94js