From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Kaop9-0007C8-LB
	for qemu-devel@nongnu.org; Wed, 03 Sep 2008 05:38:51 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Kaop7-0007A2-VY
	for qemu-devel@nongnu.org; Wed, 03 Sep 2008 05:38:50 -0400
Received: from [199.232.76.173] (port=56554 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Kaop7-00079i-OL
	for qemu-devel@nongnu.org; Wed, 03 Sep 2008 05:38:49 -0400
Received: from ns.suse.de ([195.135.220.2]:58757 helo=mx1.suse.de)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <kwolf@suse.de>) id 1Kaop7-0005zC-BW
	for qemu-devel@nongnu.org; Wed, 03 Sep 2008 05:38:49 -0400
Message-ID: <48BE5B26.6040707@suse.de>
Date: Wed, 03 Sep 2008 11:38:46 +0200
From: Kevin Wolf <kwolf@suse.de>
MIME-Version: 1.0
Subject: Re: [Qemu-devel][PATCH,RFC] Zero cluster dedup
References: <ED2414DB5FBBCF4FA66ECE71F290E9A28EC264@EXVBE011-2.exch011.intermedia.net>
In-Reply-To: <ED2414DB5FBBCF4FA66ECE71F290E9A28EC264@EXVBE011-2.exch011.intermedia.net>
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: quoted-printable
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Shahar Frank <shaharf@qumranet.com>
Cc: Laurent Vivier <Laurent.Vivier@bull.net>, qemu-devel@nongnu.org

Shahar Frank schrieb:
> Hi All,
>=20
> The following patch is implementing a zero cluster dedup feature to the=
 qcow2 image format.
>=20
> The incentive for this feature is the image COW "inflation" problem (i.=
e. COW image growth). The causes for the COW image pollution are many; fo=
r example, Windows with NTFS do not reuse recently de-allocated space and=
 by that pollute many blocks. A na=EFve solution would be to identify de-=
allocated space and de-allocate it from the image block space. The proble=
ms with this approach are that 1) this requires a windows side interface =
to the image, and 2) there is no de-allocate verb for the images.
>=20
> The suggested solution is simple:
> 	1) Use windows side cleanup/wipe utilities such as "Erase" "Free Wipe =
Wizard" or "Disk Redactor" (or any other free/non free wipe utility) to p=
eriodically wipe free space and cleanup temporaries. Most utilities can b=
e used to write *zeros* on the wiped blocks.=20
> 	2) Make qcow2 to identify zero cluster writes and use them as de-alloc=
ation hints (or "make hole" verb). To implement such feature with minimal=
 effort and minimal changes, I suggest to use the internal COW mechanism =
of the qcow2 to create a shared zero cluster. To avoid image format chang=
es, such page is created on demand and then can be shared with all zero w=
rites. A non zero write on the zero cluster will cause a normal COW opera=
tion (similar to the shared kernel zero page).=20
>=20
> The following patch is also suggesting an initial generic dedup infrast=
ructure to allow future offline/batch/online dedup featues. Such dedup fe=
atures can handle many other causes of image inflations such as Windows u=
pdates, applications updates, etc.
>=20
> An additional included change is a small improvement for the alloc_clus=
ters_noref function. The original version supported a very simple first f=
it algorithm, with a simple pointer to remember the last position that we=
 got to searching for a free space. The main problem with the original fu=
nction was that the pointer scheme is not optimized for different allocat=
ion sizes. If for example you allocate a large space of 64k (8 clusters) =
the pointer would skip many smaller free spaces such that any future allo=
cation, even for one cluster will not be able to use the skipped space. T=
he included improvement is very simple- it uses two pointers, one for sin=
gle cluster allocations, and one for bigger allocations. This may be stil=
l too simple, but it should solve the simple one cluster allocation (whic=
h should be very common).

This is a different issue and needs to go in a separate patch, IMHO.

>=20
> I tested this feature by some instrumentation that I added to the qemu-=
img to display the logical to physical mapping. I will post it very soon.
>=20
> Limitations: this version is very na=EFve. Only aligned full zero clust=
ers are supported. There is much room for improvement, to detect partial =
zero writes on zero clusters, and to detect that the result of a partial =
write is a full zero cluster.
>=20
> Shahar
>=20
> Signed-off-by: Shahar Frank <shaharf@qumranet.com>
>=20
> diff --git a/block-qcow2.c b/block-qcow2.c
> index b9f1688..ca3faf1 100644
> --- a/block-qcow2.c
> +++ b/block-qcow2.c
> @@ -65,6 +65,14 @@
>  #define offsetof(type, field) ((size_t) &((type *)0)->field)
>  #endif
> =20
> +int zero_dedup =3D 1;

This value is never changed, only checked in an if condition. You
probably want to allow the user to set it (or drop it). Would
enable_zero_dedup be a better name?

> +
> +#if defined(DEBUG_ALLOC) || defined(DEBUG_ALLOC2)
> +#define DEBUG(msg, args...)     fprintf(stderr, "(%d) %s: " msg "\n", =
getpid(),  __FUNCTION__, ##args)
> +#else
> +#define DEBUG(msg, args...)
> +#endif
> +
>  typedef struct QCowHeader {
>      uint32_t magic;
>      uint32_t version;
> @@ -140,8 +148,10 @@ typedef struct BDRVQcowState {
>      uint32_t refcount_table_size;
>      uint64_t refcount_block_cache_offset;
>      uint16_t *refcount_block_cache;
> -    int64_t free_cluster_index;
> +    int64_t free_cluster_index1;
> +    int64_t free_cluster_index2;

These names are not exactly intuitive. At least you should immediately
see which one is for big chunks and which one for single clusters.

>      int64_t free_byte_offset;
> +    uint64_t zero_cluster;
> =20
>      uint32_t crypt_method; /* current crypt method, 0 if no key yet */
>      uint32_t crypt_method_header;
> @@ -171,6 +181,8 @@ static int64_t alloc_clusters(BlockDriverState *bs,=
 int64_t size);
>  static int64_t alloc_bytes(BlockDriverState *bs, int size);
>  static void free_clusters(BlockDriverState *bs,
>                            int64_t offset, int64_t size);
> +static int64_t alloc_clusters_noref(BlockDriverState *bs, int64_t size=
);
> +
>  #ifdef DEBUG_ALLOC
>  static void check_refcounts(BlockDriverState *bs);
>  #endif
> @@ -1123,20 +1135,19 @@ static int qcow_read(BlockDriverState *bs, int6=
4_t sector_num,
>      return 0;
>  }
> =20
> -static int qcow_write(BlockDriverState *bs, int64_t sector_num,
> +static int write_cluster(BlockDriverState *bs, int64_t sector_num, uin=
t64_t cluster_offset,
>                       const uint8_t *buf, int nb_sectors)
>  {
>      BDRVQcowState *s =3D bs->opaque;
> -    int ret, index_in_cluster, n;
> -    uint64_t cluster_offset;
> -    int n_end;
> +    int index_in_cluster =3D sector_num & (s->cluster_sectors - 1);
> +    int n_end =3D index_in_cluster + nb_sectors;
> +    int n =3D nb_sectors;
> +    int ret;
> =20
> -    while (nb_sectors > 0) {
> -        index_in_cluster =3D sector_num & (s->cluster_sectors - 1);
> -        n_end =3D index_in_cluster + nb_sectors;
> -        if (s->crypt_method &&
> -            n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
> +    if (s->crypt_method && n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluste=
r_sectors)
>              n_end =3D QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
> +
> +    if (!cluster_offset)
>          cluster_offset =3D alloc_cluster_offset(bs, sector_num << 9,
>                                                index_in_cluster,
>                                                n_end, &n);
> @@ -1152,6 +1163,89 @@ static int qcow_write(BlockDriverState *bs, int6=
4_t sector_num,
>          }
>          if (ret !=3D n * 512)
>              return -1;
> +    return 0;
> +}

What is this change good for? I think qcow_write is small enough and
doesn't need further splitting.

> +
> +static int is_not_zero(const uint8_t *buf, int len)
> +{
> +    int i;
> +    len >>=3D 2;
> +    for(i =3D 0;i < len; i++) {
> +        if (((uint32_t *)buf)[i] !=3D 0)
> +            return 1;
> +    }
> +    return 0;
> +}

Why negative logic? I'd prefer buf_is_zero(). Also, could probably be an
inline function.

> +
> +static int qcow_remap(BlockDriverState *bs, uint64_t loffset, uint64_t=
 poffset)
> +{
> +    uint64_t l2_offset, *l2_table;
> +    BDRVQcowState *s =3D bs->opaque;
> +    uint64_t ooffset;
> +    int l2_index, ret;
> +
> +    ret =3D get_cluster_table(bs, loffset, &l2_table, &l2_offset, &l2_=
index);
> +    if (ret =3D=3D 0)
> +        return 0;
> +
> +    ooffset =3D be64_to_cpu(l2_table[l2_index]);
> +    if (ooffset =3D=3D poffset)
> +        return 1;	/* nothing to do, already mapped to poffset */
> +
> +    if (ooffset) {
> +        DEBUG("free old offset %llx (%llx)", ooffset, ooffset & ~QCOW_=
OFLAG_COPIED);
> +        free_any_clusters(bs, ooffset & ~QCOW_OFLAG_COPIED, 1);
> +    }
> +
> +    /* update L2 table */
> +    l2_table[l2_index] =3D cpu_to_be64(poffset);
> +
> +    if (bdrv_pwrite(s->hd,
> +                    l2_offset + l2_index * sizeof(uint64_t),
> +                    l2_table + l2_index, sizeof(uint64_t)) !=3D sizeof=
(uint64_t))
> +        return 0;
> +
> +    update_refcount(bs, poffset, s->cluster_size, 1);

Might be better for robustness to update the refcount first and then
write the L2 table.

> +    return 1;
> +}
> +
> +static int qcow_dedup(BlockDriverState *bs, int64_t sector_num,
> +                             const uint8_t *buf, int *nb_sectors)
> +{
> +    BDRVQcowState *s =3D bs->opaque;
> +    int cluster =3D s->cluster_sectors;

You're using this merely as an abbreviation for s->cluster_sectors, not
to have different semantics. Now while it may have saved you typing a
few characters, everybody knows what s->cluster_sectors means but has to
look up what cluster might be here.

> +    uint64_t cluster_offset =3D sector_num << 9, poffset;
> +
> +    DEBUG("%s 0x%llx", bs->filename, cluster_offset);
> +    if (!zero_dedup || (sector_num & (s->cluster_sectors - 1)) ||
> +        *nb_sectors < cluster || is_not_zero(buf, s->cluster_size))
> +        return 0;		/* cannot/should not dedup */

Why do you check the value of nb_sectors? It's never used and only an
output parameter.

> +
> +    DEBUG("%s zero dedup 0x%llx", bs->filename, cluster_offset);
> +    *nb_sectors =3D cluster;	/* allocate exactly one cluster */
> +
> +    if (!s->zero_cluster) {
> +        if (!(poffset =3D alloc_clusters_noref(bs, s->cluster_size)) |=
|
> +            write_cluster(bs, sector_num, poffset, buf, cluster) < 0)
> +                return 0;
> +        s->zero_cluster =3D poffset;
> +    }

I'm not sure how one could avoid this best, but you are accumulating
zero clusters because you create a new one for each VM run.

> +
> +    return qcow_remap(bs, cluster_offset, s->zero_cluster);
> +}
> +
> +static int qcow_write(BlockDriverState *bs, int64_t sector_num,
> +                     const uint8_t *buf, int nb_sectors)
> +{
> +    BDRVQcowState *s =3D bs->opaque;
> +    int n;
> +
> +    while (nb_sectors > 0) {
> +        n =3D nb_sectors;
> +        if (qcow_dedup(bs, sector_num, buf, &n) <=3D 0 &&
> +            write_cluster(bs, sector_num, sector_num << 9, buf, n) < 0=
)
> +            return -1;
> +
>          nb_sectors -=3D n;
>          sector_num +=3D n;
>          buf +=3D n * 512;

So this are the three lines which made you split up the function?

I'd prefer a nested if over this && usage.

> @@ -1298,6 +1392,7 @@ static void qcow_aio_write_cb(void *opaque, int r=
et)
> =20
>      acb->hd_aiocb =3D NULL;
> =20
> +    DEBUG("<<");

Left-over debug message.

>      if (ret < 0) {
>      fail:
>          acb->common.cb(acb->common.opaque, ret);
> @@ -1305,6 +1400,7 @@ static void qcow_aio_write_cb(void *opaque, int r=
et)
>          return;
>      }
> =20
> +    do {
>      acb->nb_sectors -=3D acb->n;
>      acb->sector_num +=3D acb->n;
>      acb->buf +=3D acb->n * 512;

Indentation?

> @@ -1315,6 +1411,13 @@ static void qcow_aio_write_cb(void *opaque, int =
ret)
>          qemu_aio_release(acb);
>          return;
>      }
> +        acb->n =3D acb->nb_sectors;
> +    } while ((ret =3D qcow_dedup(bs, acb->sector_num, acb->buf, &acb->=
n)) > 0);

I'm not sure this loop is the right approach. When you get a big write
request for zeros, what you are doing is to split it up into clusters
and change and write the L2 table to disk for each single cluster.

This is different from the other read/write functions which try to use
as big chunks as possible to minimize the overhead.

> +
> +    if (ret < 0) {
> +        ret =3D -EIO;
> +        goto fail;
> +    }
> =20
>      index_in_cluster =3D acb->sector_num & (s->cluster_sectors - 1);
>      n_end =3D index_in_cluster + acb->nb_sectors;
> @@ -2172,25 +2275,33 @@ static int64_t alloc_clusters_noref(BlockDriver=
State *bs, int64_t size)
>  {
>      BDRVQcowState *s =3D bs->opaque;
>      int i, nb_clusters;
> +    uint64_t free_cluster_index =3D s->free_cluster_index1;
> =20
>      nb_clusters =3D (size + s->cluster_size - 1) >> s->cluster_bits;
> +    if (nb_clusters > 1)
> +        free_cluster_index =3D s->free_cluster_index2;
> +
>      for(;;) {
> -        if (get_refcount(bs, s->free_cluster_index) =3D=3D 0) {
> -            s->free_cluster_index++;
> +        if (get_refcount(bs, free_cluster_index) =3D=3D 0) {
> +            free_cluster_index++;
>              for(i =3D 1; i < nb_clusters; i++) {
> -                if (get_refcount(bs, s->free_cluster_index) !=3D 0)
> +                if (get_refcount(bs, free_cluster_index) !=3D 0)
>                      goto not_found;
> -                s->free_cluster_index++;
> +                free_cluster_index++;
>              }
>  #ifdef DEBUG_ALLOC2
>              printf("alloc_clusters: size=3D%lld -> %lld\n",
>                     size,
> -                   (s->free_cluster_index - nb_clusters) << s->cluster=
_bits);
> +                   (free_cluster_index - nb_clusters) << s->cluster_bi=
ts);
>  #endif
> -            return (s->free_cluster_index - nb_clusters) << s->cluster=
_bits;
> +            if (nb_clusters > 1)
> +                s->free_cluster_index2 =3D free_cluster_index;
> +            else
> +                s->free_cluster_index1 =3D free_cluster_index;
> +            return (free_cluster_index - nb_clusters) << s->cluster_bi=
ts;
>          } else {
>          not_found:
> -            s->free_cluster_index++;
> +            free_cluster_index++;
>          }
>      }
>  }
> @@ -2374,8 +2485,13 @@ static int update_cluster_refcount(BlockDriverSt=
ate *bs,
>      refcount +=3D addend;
>      if (refcount < 0 || refcount > 0xffff)
>          return -EINVAL;
> -    if (refcount =3D=3D 0 && cluster_index < s->free_cluster_index) {
> -        s->free_cluster_index =3D cluster_index;
> +    if (refcount =3D=3D 0) {
> +        if (cluster_index < s->free_cluster_index1)
> +            s->free_cluster_index1 =3D cluster_index;
> +        if (cluster_index < s->free_cluster_index2)
> +            s->free_cluster_index2 =3D cluster_index;
> +        if (cluster_index =3D=3D s->zero_cluster)
> +            s->zero_cluster =3D 0;
>      }
>      s->refcount_block_cache[block_index] =3D cpu_to_be16(refcount);

As said before, this last part belongs into its own patch, so I didn't
review it now.

And then, one general thing: Recently, Laurent has done a great job
adding comments to the functions dealing with clusters. I think it
wouldn't hurt to add some comments to the new functions you introduce, to=
o.

Kevin