From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:33075)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <freddy77@gmail.com>) id 1QbUkh-00035j-2E
	for qemu-devel@nongnu.org; Tue, 28 Jun 2011 05:38:40 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <freddy77@gmail.com>) id 1QbUkf-00022V-3u
	for qemu-devel@nongnu.org; Tue, 28 Jun 2011 05:38:38 -0400
Received: from mail-yx0-f173.google.com ([209.85.213.173]:51588)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <freddy77@gmail.com>) id 1QbUke-00022N-P3
	for qemu-devel@nongnu.org; Tue, 28 Jun 2011 05:38:36 -0400
Received: by yxt3 with SMTP id 3so8685yxt.4
	for <qemu-devel@nongnu.org>; Tue, 28 Jun 2011 02:38:35 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <1309187514-26562-1-git-send-email-kwolf@redhat.com>
References: <1309187514-26562-1-git-send-email-kwolf@redhat.com>
Date: Tue, 28 Jun 2011 11:38:35 +0200
Message-ID: <BANLkTikeT_xsD-jE-V0GCXVE0U-a0AKemA@mail.gmail.com>
From: Frediano Ziglio <freddy77@gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Kevin Wolf <kwolf@redhat.com>
Cc: ctang@us.ibm.com, stefanha@gmail.com, hch@lst.de, qemu-devel@nongnu.org, avi@redhat.com

2011/6/27 Kevin Wolf <kwolf@redhat.com>:
> This is the second draft for what I think could be added when we increase=
 qcow2's
> version number to 3. This includes points that have been made by several =
people
> over the past few months. We're probably not going to implement this next=
 week,
> but I think it's important to get discussions started early, so here it i=
s.
>
> Changes implemented in this RFC:
>
> - Added compatible/incompatible/auto-clear feature bits plus an optional
> =C2=A0feature name table to allow useful error messages even if an older =
version
> =C2=A0doesn't know some feature at all.
>
> - Added a dirty flag which tells that the refcount may not be accurate ("=
QED
> =C2=A0mode"). This means that we can save writes to the refcount table wi=
th
> =C2=A0cache=3Dwritethrough, but isn't really useful otherwise since Qcow2=
Cache.
>
> - Configurable refcount width. If you don't want to use internal snapshot=
s,
> =C2=A0make refcounts one bit and save cache space and I/O.
>
> - Added subclusters. This separate the COW size (one subcluster, I'm thin=
king
> =C2=A0of 64k default size here) from the allocation size (one cluster, 2M=
). Less
> =C2=A0fragmentation, less metadata, but still reasonable COW granularity.
>
> =C2=A0This also allows to preallocate clusters, but none of their subclus=
ters. You
> =C2=A0can have an image that is like raw + COW metadata, and you can also
> =C2=A0preallocate metadata for images with backing files.
>
> - Zero cluster flags. This allows discard even with a backing file that d=
oesn't
> =C2=A0contain zeros. It is also useful for copy-on-read/image streaming, =
as you'll
> =C2=A0want to keep sparseness without accessing the remote image for an u=
nallocated
> =C2=A0cluster all the time.
>
> - Fixed internal snapshot metadata to use 64 bit VM state size. You can't=
 save
> =C2=A0a snapshot of a VM with >=3D 4 GB RAM today.
>
> Possible future additions:
>
> - Add per-L2-table dirty flag to L1?
> - Add per-refcount-block full flag to refcount table?

Hi,
  thinking about image improvement I would add

- GUID for image and backing file
- relative path for backing file

This would help finding images in a distributed environment or if file
are moved, ie: gfs/nfs/ocfs mounted in different mount points, backing
used a template in a different images directory and move this
directory somewhere else. Also with GUID a possible higher level could
manage a GUID <-> file image db.

I was also think about a "backing file length" field to support
resizing but probably can be implemented with zero cluster. Assume you
have a image of 5gb, create a new image with first image as backing
one, now resize second image from 5gb to 3gb then resize it again
(after some works) to 10gb, part from 3gb to 5gb should not be read
from backing file.

Also a bit in l2 offset to say "there is no l2 table" cause all
clusters in l2 are contiguous so we avoid entirely l2. Obviously this
require an optimization step to detect or create such condition.

For check perhaps it would be helpful to save not only a flag but also
a size where data are ok (for instance already allocated and with
refcount saved correctly).

A possible optimization for refcount would be to initialize refcount
to 1 instead of 0. When clusters are allocated at end-of-file this
would not require refcount change and would be easy to check file size
to see which clusters are marked as allocated but not present.

Fields for sectors and heads to support old CHS systems ??

This mail sound quite strange to me, I thought qed would be the future
of qcow2 but I must be really wrong.

I think a big limit for current qed and qcow2 implementation is the
serialization of metadata informations (qcow2 use synchronous
operation while qed use a queue). I used bonnie++ program to test
speed and performances allocating data is about 15-20% of allocated
one. I'm working (in the few spare time I have) improving it.
VirtualBox and ESX use large clusters (1mb) to mitigate
allocation/metadata problem. Perhaps raising default cluster size
would help changing a spread idea of bad qemu i/o performance.

Regards
  Frediano Ziglio