From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1O6Osh-00054u-G2
	for qemu-devel@nongnu.org; Mon, 26 Apr 2010 10:01:51 -0400
Received: from [140.186.70.92] (port=35428 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1O6Osf-000548-4r
	for qemu-devel@nongnu.org; Mon, 26 Apr 2010 10:01:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <stefanha@gmail.com>) id 1O6Osc-0005R9-I5
	for qemu-devel@nongnu.org; Mon, 26 Apr 2010 10:01:48 -0400
Received: from mail-bw0-f209.google.com ([209.85.218.209]:52576)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <stefanha@gmail.com>) id 1O6Osc-0005Qi-9h
	for qemu-devel@nongnu.org; Mon, 26 Apr 2010 10:01:46 -0400
Received: by bwz1 with SMTP id 1so15869467bwz.2
	for <qemu-devel@nongnu.org>; Mon, 26 Apr 2010 07:01:44 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <4BD57D31.6080508@redhat.com>
References: <1272096733-6070-1-git-send-email-stefanha@linux.vnet.ibm.com>
	<1272096733-6070-2-git-send-email-stefanha@linux.vnet.ibm.com>
	<4BD57D31.6080508@redhat.com>
Date: Mon, 26 Apr 2010 15:01:42 +0100
Message-ID: <i2sfbd9d3991004260701t82497a3z1c61edbdfd8dbf55@mail.gmail.com>
Subject: Re: [Qemu-devel] Re: [PATCH 2/2] qcow2: Implement bdrv_truncate() for
	growing images
From: Stefan Hajnoczi <stefanha@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, qemu-devel@nongnu.org

On Mon, Apr 26, 2010 at 12:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 24.04.2010 10:12, schrieb Stefan Hajnoczi:
>> This patch adds the ability to grow qcow2 images in-place using
>> bdrv_truncate(). =A0This enables qemu-img resize command support for
>> qcow2.
>>
>> Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
>> The notion of resizing an image with snapshots could lead to confusion:
>> users may expect snapshots to remain unchanged, but this is not possible
>> with the current qcow2 on-disk format where the header.size field is
>> global instead of per-snapshot. =A0Others may expect snapshots to change
>> size along with the current image data. =A0I think it is safest to not
>> support snapshots and perhaps add behavior later if there is a
>> consensus.
>
> I think it would make most sense if we kept the disk size per snapshot
> (so your first option). The snapshot header is extensible, so it should
> be possible.
>
> Just disabling it is okay with me, too, but in that case this patch is
> much more complicated than it needs to be: If there are no snapshots,
> there is no vmstate area.

Good point, things are simplified greatly if there is no vm state
data.  I haven't read the save/load vm yet and failed to make the
connection here ;).

I will eliminate most of the code as you explained and send a v2
patch.  I have responded below because I hope to learn more by
discussing the comments with you, but most of the issues will not
apply to a slimmed down v2 so feel free to ignore if you are busy.

>> Backing images continue to work. =A0If the image is now larger than its
>> backing image, zeroes are read when accessing beyond the end of the
>> backing image.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> ---
>> This applies to kevin/block.
>>
>> =A0block/qcow2-cluster.c =A0| =A0 64 +++++++++++++++++++++++++++++++++++=
+++---------
>> =A0block/qcow2-snapshot.c | =A0 =A02 +-
>> =A0block/qcow2.c =A0 =A0 =A0 =A0 =A0| =A0 43 +++++++++++++++++++++++++++=
++---
>> =A0block/qcow2.h =A0 =A0 =A0 =A0 =A0| =A0 =A09 ++++++-
>> =A04 files changed, 99 insertions(+), 19 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index c11680d..20c8426 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -28,30 +28,39 @@
>> =A0#include "block_int.h"
>> =A0#include "block/qcow2.h"
>>
>> -int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>> +/*
>> + * qcow2_grow_l1_table_common
>> + *
>> + * Grows the L1 table and updates the header on disk.
>> + *
>> + * Setting new_l1_vm_state_index to s->l1_vm_state_index grows the vm s=
tate
>> + * area.
>> + *
>> + * Setting new_l1_vm_state_index to the new end of image grows the imag=
e data
>> + * area.
>
> I don't understand this distinction. I'm sure you describe something
> trivial here, but it sounds like magic.
>
> There's really not much difference between data and vmstate clusters and
> callers don't know on which of both they are working.

The new_l1_vm_state_index argument is used to position the vm state
data in the new L1 table.  If we're growing the L1 table to make room
for more vm state data, then new_l1_vm_state_index should keep its old
value (s->l1_vm_state_index).  If we're growing the image data and
need to move the vm state out of the way, then new_l1_vm_state_index
should be recalculated for the new image size.

The function allocates the new L1 table and performs two separate
copies: one of the image data L1 entries and one for the vm state L1
entries.  The new_l1_vm_state_index argument controls where the vm
state data entries are copied; it provides the ability the move the vm
state data to make room for new image data entries.

>> + *
>> + * Returns 0 on success, -errno in failure case.
>> + */
>> +static int qcow2_grow_l1_table_common(BlockDriverState *bs,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0int new_l1_vm_state_index,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0int new_l1_size)
>> =A0{
>> =A0 =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> - =A0 =A0int new_l1_size, new_l1_size2, ret, i;
>> + =A0 =A0int new_l1_size2, ret, i;
>> =A0 =A0 =A0uint64_t *new_l1_table;
>> =A0 =A0 =A0int64_t new_l1_table_offset;
>> =A0 =A0 =A0uint8_t data[12];
>>
>> - =A0 =A0new_l1_size =3D s->l1_size;
>> - =A0 =A0if (min_size <=3D new_l1_size)
>> - =A0 =A0 =A0 =A0return 0;
>> - =A0 =A0if (new_l1_size =3D=3D 0) {
>> - =A0 =A0 =A0 =A0new_l1_size =3D 1;
>> - =A0 =A0}
>> - =A0 =A0while (min_size > new_l1_size) {
>> - =A0 =A0 =A0 =A0new_l1_size =3D (new_l1_size * 3 + 1) / 2;
>> - =A0 =A0}
>> =A0#ifdef DEBUG_ALLOC2
>> =A0 =A0 =A0printf("grow l1_table from %d to %d\n", s->l1_size, new_l1_si=
ze);
>> =A0#endif
>>
>> =A0 =A0 =A0new_l1_size2 =3D sizeof(uint64_t) * new_l1_size;
>> =A0 =A0 =A0new_l1_table =3D qemu_mallocz(align_offset(new_l1_size2, 512)=
);
>> - =A0 =A0memcpy(new_l1_table, s->l1_table, s->l1_size * sizeof(uint64_t)=
);
>> + =A0 =A0memcpy(new_l1_table, s->l1_table, s->l1_vm_state_index * sizeof=
(uint64_t));
>
> This change smells like a buffer overflow. It both can read beyond the
> end of the old L1 table and can overflow the new one if the L1 table
> hasn't yet reached its full size.
>
> I think this is the main problem of this patch: You seem to assume that
> the L1 table is created with the full size needed to cover all of the
> data area. Which seems to be true at least for images created by
> qemu-img, but I don't think we can rely on it (nor do I think we should,
> because it introduces a special case where there is none).
>
> I rather consider this preallocation a performance optimization. If we
> stop treating it as such, we would need to refuse opening any images
> that don't fulfill this requirement.

>>From qcow_open():

/* read the level 1 table */
s->l1_size =3D header.l1_size;
shift =3D s->cluster_bits + s->l2_bits;
s->l1_vm_state_index =3D (header.size + (1LL << shift) - 1) >> shift;
/* the L1 table must contain at least enough entries to put
   header.size bytes */
if (s->l1_size < s->l1_vm_state_index)
    goto fail;

Images where L1 size is not at least l1_vm_state_index cannot be
opened by QEMU.  Right now the special case exists and perhaps some
qcow2 code already relies on this assumption.

>> + =A0 =A0memcpy(&new_l1_table[new_l1_vm_state_index],
>> + =A0 =A0 =A0 =A0 =A0 &s->l1_table[s->l1_vm_state_index],
>> + =A0 =A0 =A0 =A0 =A0 (s->l1_size - s->l1_vm_state_index) * sizeof(uint6=
4_t));
>>
>> =A0 =A0 =A0/* write new table (align to cluster) */
>> =A0 =A0 =A0BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ALLOC_TABLE);
>> @@ -83,6 +92,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_=
size)
>> =A0 =A0 =A0s->l1_table_offset =3D new_l1_table_offset;
>> =A0 =A0 =A0s->l1_table =3D new_l1_table;
>> =A0 =A0 =A0s->l1_size =3D new_l1_size;
>> + =A0 =A0s->l1_vm_state_index =3D new_l1_vm_state_index;
>> =A0 =A0 =A0return 0;
>> =A0 fail:
>> =A0 =A0 =A0qemu_free(new_l1_table);
>> @@ -90,6 +100,34 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int mi=
n_size)
>> =A0 =A0 =A0return ret < 0 ? ret : -EIO;
>> =A0}
>>
>> +int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size)
>> +{
>> + =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> + =A0 =A0int new_l1_size;
>> +
>> + =A0 =A0new_l1_size =3D s->l1_size;
>> + =A0 =A0if (min_size <=3D new_l1_size)
>> + =A0 =A0 =A0 =A0return 0;
>> + =A0 =A0if (new_l1_size =3D=3D 0) {
>> + =A0 =A0 =A0 =A0new_l1_size =3D 1;
>> + =A0 =A0}
>> + =A0 =A0while (min_size > new_l1_size) {
>> + =A0 =A0 =A0 =A0new_l1_size =3D (new_l1_size * 3 + 1) / 2;
>> + =A0 =A0}
>> +
>> + =A0 =A0return qcow2_grow_l1_table_common(bs, s->l1_vm_state_index, new=
_l1_size);
>> +}
>> +
>> +int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size)
>> +{
>> + =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> + =A0 =A0int new_l1_vm_state_index;
>> +
>> + =A0 =A0new_l1_vm_state_index =3D new_l1_size;
>> + =A0 =A0new_l1_size +=3D s->l1_size - s->l1_vm_state_index;
>
> Same assumption as above. This might turn into a negative number added
> to new_l1_size.
>
>> + =A0 =A0return qcow2_grow_l1_table_common(bs, new_l1_vm_state_index, ne=
w_l1_size);
>> +}
>> +
>> =A0void qcow2_l2_cache_reset(BlockDriverState *bs)
>> =A0{
>> =A0 =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> @@ -524,7 +562,7 @@ static int get_cluster_table(BlockDriverState *bs, u=
int64_t offset,
>>
>> =A0 =A0 =A0l1_index =3D offset >> (s->l2_bits + s->cluster_bits);
>> =A0 =A0 =A0if (l1_index >=3D s->l1_size) {qcow2_grow_l1_vm_state(
>> - =A0 =A0 =A0 =A0ret =3D qcow2_grow_l1_table(bs, l1_index + 1);
>> + =A0 =A0 =A0 =A0ret =3D qcow2_grow_l1_vm_state(bs, l1_index + 1);
>
> Once you agree that get_cluster_table can't know if it's working on data
> or vmstate, qcow2_grow_l1_vm_state is a misnomer at least.
>
>> =A0 =A0 =A0 =A0 =A0if (ret < 0) {
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0return ret;
>> =A0 =A0 =A0 =A0 =A0}
>> diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
>> index 2a21c17..7f0d810 100644
>> --- a/block/qcow2-snapshot.c
>> +++ b/block/qcow2-snapshot.c
>> @@ -326,7 +326,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const =
char *snapshot_id)
>> =A0 =A0 =A0if (qcow2_update_snapshot_refcount(bs, s->l1_table_offset, s-=
>l1_size, -1) < 0)
>> =A0 =A0 =A0 =A0 =A0goto fail;
>>
>> - =A0 =A0if (qcow2_grow_l1_table(bs, sn->l1_size) < 0)
>> + =A0 =A0if (qcow2_grow_l1_vm_state(bs, sn->l1_size) < 0)
>> =A0 =A0 =A0 =A0 =A0goto fail;
>>
>> =A0 =A0 =A0s->l1_size =3D sn->l1_size;
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 4a7ab66..ab622a2 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -140,7 +140,7 @@ static int qcow_read_extensions(BlockDriverState *bs=
, uint64_t start_offset,
>> =A0static int qcow_open(BlockDriverState *bs, int flags)
>> =A0{
>> =A0 =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> - =A0 =A0int len, i, shift;
>> + =A0 =A0int len, i;
>> =A0 =A0 =A0QCowHeader header;
>> =A0 =A0 =A0uint64_t ext_end;
>>
>> @@ -188,8 +188,7 @@ static int qcow_open(BlockDriverState *bs, int flags=
)
>>
>> =A0 =A0 =A0/* read the level 1 table */
>> =A0 =A0 =A0s->l1_size =3D header.l1_size;
>> - =A0 =A0shift =3D s->cluster_bits + s->l2_bits;
>> - =A0 =A0s->l1_vm_state_index =3D (header.size + (1LL << shift) - 1) >> =
shift;
>> + =A0 =A0s->l1_vm_state_index =3D size_to_l1(s, header.size);
>> =A0 =A0 =A0/* the L1 table must contain at least enough entries to put
>> =A0 =A0 =A0 =A0 header.size bytes */
>> =A0 =A0 =A0if (s->l1_size < s->l1_vm_state_index)
>> @@ -1095,6 +1094,40 @@ static int qcow_make_empty(BlockDriverState *bs)
>> =A0 =A0 =A0return 0;
>> =A0}
>>
>> +static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
>> +{
>> + =A0 =A0BDRVQcowState *s =3D bs->opaque;
>> + =A0 =A0int ret, new_l1_size;
>> +
>> + =A0 =A0if (offset & 511) {
>> + =A0 =A0 =A0 =A0return -EINVAL;
>> + =A0 =A0}
>> +
>> + =A0 =A0/* cannot proceed if image has snapshots */
>> + =A0 =A0if (s->nb_snapshots) {
>> + =A0 =A0 =A0 =A0return -ENOTSUP;
>> + =A0 =A0}
>> +
>> + =A0 =A0/* shrinking is currently not supported */
>> + =A0 =A0if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
>> + =A0 =A0 =A0 =A0return -ENOTSUP;
>> + =A0 =A0}
>> +
>> + =A0 =A0new_l1_size =3D size_to_l1(s, offset);
>> + =A0 =A0ret =3D qcow2_grow_l1_image_data(bs, new_l1_size);
>> + =A0 =A0if (ret < 0) {
>> + =A0 =A0 =A0 =A0return ret;
>> + =A0 =A0}
>
> Instead of the confusing changes above you could just increase the L1
> table size using the old function and keep the data/vmstate thing local
> to qcow2_truncate (or maybe better a qcow2_move_vmstate(bs, offset)
> which internally grows the L1 table as needed)
>
> Actually, I think this is not that different from your patch (you called
> the almost same function qcow2_grow_l1_image_data and avoided the normal
> calculation of the next L1 table size for some reason), but probably a
> lot less confusing.

I like the qcow2_move_vmstate() name.  It is clearer than
qcow2_grow_l1_image_data().

If I understand correctly, the next L1 table size calculation tries to
grow the table in steps large enough so that the grow operation does
not need to be performed frequently.  This makes sense when appending
arbitrary vm state data, but the truncate operation knows the exact
new size of the image and doesn't need to speculatively allocate more
L1 table space.

>> +
>> + =A0 =A0/* write updated header.size */
>> + =A0 =A0offset =3D cpu_to_be64(offset);
>> + =A0 =A0if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0sizeof(uint64_t)) !=3D sizeof(u=
int64_t)) {
>> + =A0 =A0 =A0 =A0return -EIO;
>> + =A0 =A0}
>
> The vmstate_offset field needs to be updated somewhere, too. In my
> suggestion this would be in qcow2_move_vmstate.

Which field (I can't find a vmstate_offset field)?  The patch updates
s->l1_vm_state_index in qcow2_grow_l1_table_common(), is that the one
you mean?

Stefan