From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LYTiI-000775-9u
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 18:14:22 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LYTiH-00076K-B2
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 18:14:21 -0500
Received: from [199.232.76.173] (port=33508 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LYTiF-00076B-Fn
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 18:14:19 -0500
Received: from yw-out-1718.google.com ([74.125.46.153]:47351)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <anthony@codemonkey.ws>) id 1LYTiE-0002Bd-LQ
	for qemu-devel@nongnu.org; Sat, 14 Feb 2009 18:14:19 -0500
Received: by yw-out-1718.google.com with SMTP id 6so1696149ywa.82
	for <qemu-devel@nongnu.org>; Sat, 14 Feb 2009 15:14:17 -0800 (PST)
Message-ID: <4997502D.1080401@codemonkey.ws>
Date: Sat, 14 Feb 2009 17:13:49 -0600
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH] Revert block-qcow2.c to kvm-72 version due
	to corruption reports
References: <4988AD96.6090308@codemonkey.ws>
	<20090213084023.GA1020@kos.to>	<20090213163043.GJ18471@shareable.org>	<4995A723.9010208@codemonkey.ws>
	<20090213190419.GB20328@shareable.org>
In-Reply-To: <20090213190419.GB20328@shareable.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Jamie Lokier wrote:
> Anthony Liguori wrote:
>   
>>> Simply reverting the qcow2 code appears to fix those problems, so it
>>> needn't hold up cutting a release.  That's what I recommend.
>>>       
>> Send some patches.
>>     
>
> I did already.
>
> Here it is again.  This should fix my bug and Marc's bug according to
> his report that reverting qcow2.c fixes it.
>   

Well such a large reversion is a bad idea.  Can you git bisect to the 
actual changeset that introduced the bug you see?

You're effectively reverting a very large number of changes whereas only 
one is likely causing your problem

Regards,

Anthony Liguori

> -- Jamie
>
>
> Subject: Revert block-qcow2.c to kvm-72 version due to corruption reports
>
> This fixes two kinds of qcow2 corruption observed in kvm-83 (actually
> kvm-73 and later), from three bug reports.
>
>
> Bug 1: Windows 2000 guests complain of corrupt registry.
>
> Many Windows 2000 guests which boot and runs fine in kvm-72, fail with
> a blue-screen indicating file corruption errors in kvm-73 through to
> kvm-83 (the latest), and succeed if we replace block-qcow2.c with the
> version from kvm-72.
>
> The blue screen appears towards the end of the boot sequence, and
> shows only briefly before rebooting.  It says:
>
>     STOP: c0000218 (Registry File Failure)
>     The registry cannot load the hive (file):
>     \SystemRoot\System32\Config\SOFTWARE
>     or its log or alternate.
>     It is corrupt, absent, or not writable.
>
>     Beginning dump of physical memory
>     Physical memory dump complete. Contact your system administrator or
>     technical support [...?]
>
> This is narrowed down to the difference in block-qcow2.c between
> kvm-72 and kvm-73 (not -83).  From kvm-73 to kvm-83, there have been
> more changes block-qcow2.c, but the observed corruption still occurs.
>
> The bug isn't evident when only reading.  When using "qemu-img
> convert" to convert a qcow2 file to a raw file, with broken and fixed
> versions of block-qcow2.c it produces the same raw file.  Also, when
> using "-snapshot" with qemu, the blue screen doesn't occur.
>
> This bug was observed by Jamie Lokier <jamie@shareable.org> and 
> confirmed for multiple Windows 2000 guests by
> Marc Bevand <m.bevand@gmail.com>.
>
>
> Bug 2: Windows 2003 guests complain of corrupt registry.
>
> According to
> http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599
>
> Windows 2003 32-bit guests randomly spew disk corruption messages
> like this:
>
>     Windows – Registry Hive Recovered
>     Registry hive (file): SOFTWARE was corrupted and it has
>     been recovered. Some data might have been lost.
>
> and
>
>     The system cannot log on due to the following error:
>     Unable to complete the requested operation because of
>     either a catastrophic media failure or a data structure
>     corruption on the disk.
>
> This bug was reported by <gerdwachs@users.sourceforge.net> and
> confirmed by Marc Bevand, noting:
>
>     kvm-73+ also causes some of my Windows 2003 guests to exhibit this
>     exact registry corruption error.  [...]  This bug is also fixed by
>     reverting block-qcow2.c to the version from kvm-72.
>
> Worryingly, gerdwachs' bug report says it's for kvm-70, implying this
> patch may not fix all the Windows 2003 guest corruption problems.
>
> At least Marc says his observed problem goes away with kvm-72's qcow2.
>
>
> Bug 3: Corruption of qcow2 index rendering the file unusable.
>
> Marc Bevand writes:
>
>     I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because
>     of the qcow2 performance regression caused by the default writethrough
>     caching policy) but it randomly triggers an even worse bug: the moment
>     I shut down a guest by typing "quit" in the monitor, it sometimes
>     overwrite the first 4kB of the disk image with mostly NUL bytes (!)
>     which completely destroys it. I am familiar with the qcow2 format and
>     apparently this 4kB block seems to be an L2 table with most entries
>     set to zero. I have had to restore at least 6 or 7 disk images from
>     backup after occurences of that bug. My intuition tells me this may be
>     the qcow2 code trying to allocate a cluster to write a new L2 table,
>     but not noticing the allocation failed (represented by a 0 offset),
>     and writing the L2 table at that 0 offset, overwriting the qcow2
>     header.
>
>     Fortunately this bug is also fixed by running kvm-75 with
>     block-qcow2.c reverted to its kvm-72 version.
>
>     Basically qcow2 in kvm-73 or newer is completely unreliable.
>
>
> Reverting block-qcow2.c to the version in kvm-72 appears to fix the
> corruption symptoms reported by Marc and Jamie, although gerdwachs'
> related bug is against kvm-70 so it may not fix that.
>
> Unfortunately this reverts some optimisations, but fixing corruption
> is more important until the new code is reliable.
>
> This patch reverts block-qcow2.c in kvm-83 to the version in kvm-72,
> except the "cache=writeback" default performance tweak is retained and
> there's no need to define "offsetof".
>
> Signed-Off-By: Jamie Lokier <jamie@shareable.org>
>
>
> --- kvm-83-real/qemu/block-qcow2.c	2009-01-13 13:29:42.000000000 +0000
> +++ kvm-83/qemu/block-qcow2.c	2009-02-13 18:51:12.000000000 +0000
> @@ -52,8 +52,6 @@
>  #define QCOW_CRYPT_NONE 0
>  #define QCOW_CRYPT_AES  1
>  
> -#define QCOW_MAX_CRYPT_CLUSTERS 32
> -
>  /* indicate that the refcount of the referenced cluster is exactly one. */
>  #define QCOW_OFLAG_COPIED     (1LL << 63)
>  /* indicate that the cluster is compressed (they never have the copied flag) */
> @@ -269,8 +267,7 @@
>      if (!s->cluster_cache)
>          goto fail;
>      /* one more sector for decompressed data alignment */
> -    s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size
> -                                  + 512);
> +    s->cluster_data = qemu_malloc(s->cluster_size + 512);
>      if (!s->cluster_data)
>          goto fail;
>      s->cluster_cache_offset = -1;
> @@ -437,7 +434,8 @@
>      int new_l1_size, new_l1_size2, ret, i;
>      uint64_t *new_l1_table;
>      uint64_t new_l1_table_offset;
> -    uint8_t data[12];
> +    uint64_t data64;
> +    uint32_t data32;
>  
>      new_l1_size = s->l1_size;
>      if (min_size <= new_l1_size)
> @@ -467,10 +465,13 @@
>          new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
>  
>      /* set new table */
> -    cpu_to_be32w((uint32_t*)data, new_l1_size);
> -    cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset);
> -    if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size), data,
> -                sizeof(data)) != sizeof(data))
> +    data64 = cpu_to_be64(new_l1_table_offset);
> +    if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_table_offset),
> +                    &data64, sizeof(data64)) != sizeof(data64))
> +        goto fail;
> +    data32 = cpu_to_be32(new_l1_size);
> +    if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size),
> +                    &data32, sizeof(data32)) != sizeof(data32))
>          goto fail;
>      qemu_free(s->l1_table);
>      free_clusters(bs, s->l1_table_offset, s->l1_size * sizeof(uint64_t));
> @@ -483,549 +484,169 @@
>      return -EIO;
>  }
>  
> -/*
> - * seek_l2_table
> +/* 'allocate' is:
>   *
> - * seek l2_offset in the l2_cache table
> - * if not found, return NULL,
> - * if found,
> - *   increments the l2 cache hit count of the entry,
> - *   if counter overflow, divide by two all counters
> - *   return the pointer to the l2 cache entry
> + * 0 not to allocate.
>   *
> - */
> -
> -static uint64_t *seek_l2_table(BDRVQcowState *s, uint64_t l2_offset)
> -{
> -    int i, j;
> -
> -    for(i = 0; i < L2_CACHE_SIZE; i++) {
> -        if (l2_offset == s->l2_cache_offsets[i]) {
> -            /* increment the hit count */
> -            if (++s->l2_cache_counts[i] == 0xffffffff) {
> -                for(j = 0; j < L2_CACHE_SIZE; j++) {
> -                    s->l2_cache_counts[j] >>= 1;
> -                }
> -            }
> -            return s->l2_cache + (i << s->l2_bits);
> -        }
> -    }
> -    return NULL;
> -}
> -
> -/*
> - * l2_load
> + * 1 to allocate a normal cluster (for sector indexes 'n_start' to
> + * 'n_end')
>   *
> - * Loads a L2 table into memory. If the table is in the cache, the cache
> - * is used; otherwise the L2 table is loaded from the image file.
> + * 2 to allocate a compressed cluster of size
> + * 'compressed_size'. 'compressed_size' must be > 0 and <
> + * cluster_size
>   *
> - * Returns a pointer to the L2 table on success, or NULL if the read from
> - * the image file failed.
> + * return 0 if not allocated.
>   */
> -
> -static uint64_t *l2_load(BlockDriverState *bs, uint64_t l2_offset)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int min_index;
> -    uint64_t *l2_table;
> -
> -    /* seek if the table for the given offset is in the cache */
> -
> -    l2_table = seek_l2_table(s, l2_offset);
> -    if (l2_table != NULL)
> -        return l2_table;
> -
> -    /* not found: load a new entry in the least used one */
> -
> -    min_index = l2_cache_new_entry(bs);
> -    l2_table = s->l2_cache + (min_index << s->l2_bits);
> -    if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) !=
> -        s->l2_size * sizeof(uint64_t))
> -        return NULL;
> -    s->l2_cache_offsets[min_index] = l2_offset;
> -    s->l2_cache_counts[min_index] = 1;
> -
> -    return l2_table;
> -}
> -
> -/*
> - * l2_allocate
> - *
> - * Allocate a new l2 entry in the file. If l1_index points to an already
> - * used entry in the L2 table (i.e. we are doing a copy on write for the L2
> - * table) copy the contents of the old L2 table into the newly allocated one.
> - * Otherwise the new table is initialized with zeros.
> - *
> - */
> -
> -static uint64_t *l2_allocate(BlockDriverState *bs, int l1_index)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int min_index;
> -    uint64_t old_l2_offset, tmp;
> -    uint64_t *l2_table, l2_offset;
> -
> -    old_l2_offset = s->l1_table[l1_index];
> -
> -    /* allocate a new l2 entry */
> -
> -    l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
> -
> -    /* update the L1 entry */
> -
> -    s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
> -
> -    tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED);
> -    if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp),
> -                    &tmp, sizeof(tmp)) != sizeof(tmp))
> -        return NULL;
> -
> -    /* allocate a new entry in the l2 cache */
> -
> -    min_index = l2_cache_new_entry(bs);
> -    l2_table = s->l2_cache + (min_index << s->l2_bits);
> -
> -    if (old_l2_offset == 0) {
> -        /* if there was no old l2 table, clear the new table */
> -        memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
> -    } else {
> -        /* if there was an old l2 table, read it from the disk */
> -        if (bdrv_pread(s->hd, old_l2_offset,
> -                       l2_table, s->l2_size * sizeof(uint64_t)) !=
> -            s->l2_size * sizeof(uint64_t))
> -            return NULL;
> -    }
> -    /* write the l2 table to the file */
> -    if (bdrv_pwrite(s->hd, l2_offset,
> -                    l2_table, s->l2_size * sizeof(uint64_t)) !=
> -        s->l2_size * sizeof(uint64_t))
> -        return NULL;
> -
> -    /* update the l2 cache entry */
> -
> -    s->l2_cache_offsets[min_index] = l2_offset;
> -    s->l2_cache_counts[min_index] = 1;
> -
> -    return l2_table;
> -}
> -
> -static int size_to_clusters(BDRVQcowState *s, int64_t size)
> -{
> -    return (size + (s->cluster_size - 1)) >> s->cluster_bits;
> -}
> -
> -static int count_contiguous_clusters(uint64_t nb_clusters, int cluster_size,
> -        uint64_t *l2_table, uint64_t start, uint64_t mask)
> -{
> -    int i;
> -    uint64_t offset = be64_to_cpu(l2_table[0]) & ~mask;
> -
> -    if (!offset)
> -        return 0;
> -
> -    for (i = start; i < start + nb_clusters; i++)
> -        if (offset + i * cluster_size != (be64_to_cpu(l2_table[i]) & ~mask))
> -            break;
> -
> -	return (i - start);
> -}
> -
> -static int count_contiguous_free_clusters(uint64_t nb_clusters, uint64_t *l2_table)
> -{
> -    int i = 0;
> -
> -    while(nb_clusters-- && l2_table[i] == 0)
> -        i++;
> -
> -    return i;
> -}
> -
> -/*
> - * get_cluster_offset
> - *
> - * For a given offset of the disk image, return cluster offset in
> - * qcow2 file.
> - *
> - * on entry, *num is the number of contiguous clusters we'd like to
> - * access following offset.
> - *
> - * on exit, *num is the number of contiguous clusters we can read.
> - *
> - * Return 1, if the offset is found
> - * Return 0, otherwise.
> - *
> - */
> -
>  static uint64_t get_cluster_offset(BlockDriverState *bs,
> -                                   uint64_t offset, int *num)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int l1_index, l2_index;
> -    uint64_t l2_offset, *l2_table, cluster_offset;
> -    int l1_bits, c;
> -    int index_in_cluster, nb_available, nb_needed, nb_clusters;
> -
> -    index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1);
> -    nb_needed = *num + index_in_cluster;
> -
> -    l1_bits = s->l2_bits + s->cluster_bits;
> -
> -    /* compute how many bytes there are between the offset and
> -     * the end of the l1 entry
> -     */
> -
> -    nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1));
> -
> -    /* compute the number of available sectors */
> -
> -    nb_available = (nb_available >> 9) + index_in_cluster;
> -
> -    cluster_offset = 0;
> -
> -    /* seek the the l2 offset in the l1 table */
> -
> -    l1_index = offset >> l1_bits;
> -    if (l1_index >= s->l1_size)
> -        goto out;
> -
> -    l2_offset = s->l1_table[l1_index];
> -
> -    /* seek the l2 table of the given l2 offset */
> -
> -    if (!l2_offset)
> -        goto out;
> -
> -    /* load the l2 table in memory */
> -
> -    l2_offset &= ~QCOW_OFLAG_COPIED;
> -    l2_table = l2_load(bs, l2_offset);
> -    if (l2_table == NULL)
> -        return 0;
> -
> -    /* find the cluster offset for the given disk offset */
> -
> -    l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
> -    cluster_offset = be64_to_cpu(l2_table[l2_index]);
> -    nb_clusters = size_to_clusters(s, nb_needed << 9);
> -
> -    if (!cluster_offset) {
> -        /* how many empty clusters ? */
> -        c = count_contiguous_free_clusters(nb_clusters, &l2_table[l2_index]);
> -    } else {
> -        /* how many allocated clusters ? */
> -        c = count_contiguous_clusters(nb_clusters, s->cluster_size,
> -                &l2_table[l2_index], 0, QCOW_OFLAG_COPIED);
> -    }
> -
> -   nb_available = (c * s->cluster_sectors);
> -out:
> -    if (nb_available > nb_needed)
> -        nb_available = nb_needed;
> -
> -    *num = nb_available - index_in_cluster;
> -
> -    return cluster_offset & ~QCOW_OFLAG_COPIED;
> -}
> -
> -/*
> - * free_any_clusters
> - *
> - * free clusters according to its type: compressed or not
> - *
> - */
> -
> -static void free_any_clusters(BlockDriverState *bs,
> -                              uint64_t cluster_offset, int nb_clusters)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -
> -    /* free the cluster */
> -
> -    if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
> -        int nb_csectors;
> -        nb_csectors = ((cluster_offset >> s->csize_shift) &
> -                       s->csize_mask) + 1;
> -        free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511,
> -                      nb_csectors * 512);
> -        return;
> -    }
> -
> -    free_clusters(bs, cluster_offset, nb_clusters << s->cluster_bits);
> -
> -    return;
> -}
> -
> -/*
> - * get_cluster_table
> - *
> - * for a given disk offset, load (and allocate if needed)
> - * the l2 table.
> - *
> - * the l2 table offset in the qcow2 file and the cluster index
> - * in the l2 table are given to the caller.
> - *
> - */
> -
> -static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
> -                             uint64_t **new_l2_table,
> -                             uint64_t *new_l2_offset,
> -                             int *new_l2_index)
> +                                   uint64_t offset, int allocate,
> +                                   int compressed_size,
> +                                   int n_start, int n_end)
>  {
>      BDRVQcowState *s = bs->opaque;
> -    int l1_index, l2_index, ret;
> -    uint64_t l2_offset, *l2_table;
> -
> -    /* seek the the l2 offset in the l1 table */
> +    int min_index, i, j, l1_index, l2_index, ret;
> +    uint64_t l2_offset, *l2_table, cluster_offset, tmp, old_l2_offset;
>  
>      l1_index = offset >> (s->l2_bits + s->cluster_bits);
>      if (l1_index >= s->l1_size) {
> -        ret = grow_l1_table(bs, l1_index + 1);
> -        if (ret < 0)
> +        /* outside l1 table is allowed: we grow the table if needed */
> +        if (!allocate)
> +            return 0;
> +        if (grow_l1_table(bs, l1_index + 1) < 0)
>              return 0;
>      }
>      l2_offset = s->l1_table[l1_index];
> +    if (!l2_offset) {
> +        if (!allocate)
> +            return 0;
> +    l2_allocate:
> +        old_l2_offset = l2_offset;
> +        /* allocate a new l2 entry */
> +        l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
> +        /* update the L1 entry */
> +        s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
> +        tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED);
> +        if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp),
> +                        &tmp, sizeof(tmp)) != sizeof(tmp))
> +            return 0;
> +        min_index = l2_cache_new_entry(bs);
> +        l2_table = s->l2_cache + (min_index << s->l2_bits);
>  
> -    /* seek the l2 table of the given l2 offset */
> -
> -    if (l2_offset & QCOW_OFLAG_COPIED) {
> -        /* load the l2 table in memory */
> -        l2_offset &= ~QCOW_OFLAG_COPIED;
> -        l2_table = l2_load(bs, l2_offset);
> -        if (l2_table == NULL)
> +        if (old_l2_offset == 0) {
> +            memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
> +        } else {
> +            if (bdrv_pread(s->hd, old_l2_offset,
> +                           l2_table, s->l2_size * sizeof(uint64_t)) !=
> +                s->l2_size * sizeof(uint64_t))
> +                return 0;
> +        }
> +        if (bdrv_pwrite(s->hd, l2_offset,
> +                        l2_table, s->l2_size * sizeof(uint64_t)) !=
> +            s->l2_size * sizeof(uint64_t))
>              return 0;
>      } else {
> -        if (l2_offset)
> -            free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t));
> -        l2_table = l2_allocate(bs, l1_index);
> -        if (l2_table == NULL)
> +        if (!(l2_offset & QCOW_OFLAG_COPIED)) {
> +            if (allocate) {
> +                free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t));
> +                goto l2_allocate;
> +            }
> +        } else {
> +            l2_offset &= ~QCOW_OFLAG_COPIED;
> +        }
> +        for(i = 0; i < L2_CACHE_SIZE; i++) {
> +            if (l2_offset == s->l2_cache_offsets[i]) {
> +                /* increment the hit count */
> +                if (++s->l2_cache_counts[i] == 0xffffffff) {
> +                    for(j = 0; j < L2_CACHE_SIZE; j++) {
> +                        s->l2_cache_counts[j] >>= 1;
> +                    }
> +                }
> +                l2_table = s->l2_cache + (i << s->l2_bits);
> +                goto found;
> +            }
> +        }
> +        /* not found: load a new entry in the least used one */
> +        min_index = l2_cache_new_entry(bs);
> +        l2_table = s->l2_cache + (min_index << s->l2_bits);
> +        if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) !=
> +            s->l2_size * sizeof(uint64_t))
>              return 0;
> -        l2_offset = s->l1_table[l1_index] & ~QCOW_OFLAG_COPIED;
>      }
> -
> -    /* find the cluster offset for the given disk offset */
> -
> +    s->l2_cache_offsets[min_index] = l2_offset;
> +    s->l2_cache_counts[min_index] = 1;
> + found:
>      l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
> -
> -    *new_l2_table = l2_table;
> -    *new_l2_offset = l2_offset;
> -    *new_l2_index = l2_index;
> -
> -    return 1;
> -}
> -
> -/*
> - * alloc_compressed_cluster_offset
> - *
> - * For a given offset of the disk image, return cluster offset in
> - * qcow2 file.
> - *
> - * If the offset is not found, allocate a new compressed cluster.
> - *
> - * Return the cluster offset if successful,
> - * Return 0, otherwise.
> - *
> - */
> -
> -static uint64_t alloc_compressed_cluster_offset(BlockDriverState *bs,
> -                                                uint64_t offset,
> -                                                int compressed_size)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int l2_index, ret;
> -    uint64_t l2_offset, *l2_table, cluster_offset;
> -    int nb_csectors;
> -
> -    ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index);
> -    if (ret == 0)
> -        return 0;
> -
>      cluster_offset = be64_to_cpu(l2_table[l2_index]);
> -    if (cluster_offset & QCOW_OFLAG_COPIED)
> -        return cluster_offset & ~QCOW_OFLAG_COPIED;
> -
> -    if (cluster_offset)
> -        free_any_clusters(bs, cluster_offset, 1);
> -
> -    cluster_offset = alloc_bytes(bs, compressed_size);
> -    nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> -                  (cluster_offset >> 9);
> -
> -    cluster_offset |= QCOW_OFLAG_COMPRESSED |
> -                      ((uint64_t)nb_csectors << s->csize_shift);
> -
> -    /* update L2 table */
> -
> -    /* compressed clusters never have the copied flag */
> -
> -    l2_table[l2_index] = cpu_to_be64(cluster_offset);
> -    if (bdrv_pwrite(s->hd,
> -                    l2_offset + l2_index * sizeof(uint64_t),
> -                    l2_table + l2_index,
> -                    sizeof(uint64_t)) != sizeof(uint64_t))
> -        return 0;
> -
> -    return cluster_offset;
> -}
> -
> -typedef struct QCowL2Meta
> -{
> -    uint64_t offset;
> -    int n_start;
> -    int nb_available;
> -    int nb_clusters;
> -} QCowL2Meta;
> -
> -static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
> -        QCowL2Meta *m)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int i, j = 0, l2_index, ret;
> -    uint64_t *old_cluster, start_sect, l2_offset, *l2_table;
> -
> -    if (m->nb_clusters == 0)
> -        return 0;
> -
> -    if (!(old_cluster = qemu_malloc(m->nb_clusters * sizeof(uint64_t))))
> -        return -ENOMEM;
> -
> -    /* copy content of unmodified sectors */
> -    start_sect = (m->offset & ~(s->cluster_size - 1)) >> 9;
> -    if (m->n_start) {
> -        ret = copy_sectors(bs, start_sect, cluster_offset, 0, m->n_start);
> -        if (ret < 0)
> -            goto err;
> +    if (!cluster_offset) {
> +        if (!allocate)
> +            return cluster_offset;
> +    } else if (!(cluster_offset & QCOW_OFLAG_COPIED)) {
> +        if (!allocate)
> +            return cluster_offset;
> +        /* free the cluster */
> +        if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
> +            int nb_csectors;
> +            nb_csectors = ((cluster_offset >> s->csize_shift) &
> +                           s->csize_mask) + 1;
> +            free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511,
> +                          nb_csectors * 512);
> +        } else {
> +            free_clusters(bs, cluster_offset, s->cluster_size);
> +        }
> +    } else {
> +        cluster_offset &= ~QCOW_OFLAG_COPIED;
> +        return cluster_offset;
>      }
> -
> -    if (m->nb_available & (s->cluster_sectors - 1)) {
> -        uint64_t end = m->nb_available & ~(uint64_t)(s->cluster_sectors - 1);
> -        ret = copy_sectors(bs, start_sect + end, cluster_offset + (end << 9),
> -                m->nb_available - end, s->cluster_sectors);
> -        if (ret < 0)
> -            goto err;
> +    if (allocate == 1) {
> +        /* allocate a new cluster */
> +        cluster_offset = alloc_clusters(bs, s->cluster_size);
> +
> +        /* we must initialize the cluster content which won't be
> +           written */
> +        if ((n_end - n_start) < s->cluster_sectors) {
> +            uint64_t start_sect;
> +
> +            start_sect = (offset & ~(s->cluster_size - 1)) >> 9;
> +            ret = copy_sectors(bs, start_sect,
> +                               cluster_offset, 0, n_start);
> +            if (ret < 0)
> +                return 0;
> +            ret = copy_sectors(bs, start_sect,
> +                               cluster_offset, n_end, s->cluster_sectors);
> +            if (ret < 0)
> +                return 0;
> +        }
> +        tmp = cpu_to_be64(cluster_offset | QCOW_OFLAG_COPIED);
> +    } else {
> +        int nb_csectors;
> +        cluster_offset = alloc_bytes(bs, compressed_size);
> +        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> +            (cluster_offset >> 9);
> +        cluster_offset |= QCOW_OFLAG_COMPRESSED |
> +            ((uint64_t)nb_csectors << s->csize_shift);
> +        /* compressed clusters never have the copied flag */
> +        tmp = cpu_to_be64(cluster_offset);
>      }
> -
> -    ret = -EIO;
>      /* update L2 table */
> -    if (!get_cluster_table(bs, m->offset, &l2_table, &l2_offset, &l2_index))
> -        goto err;
> -
> -    for (i = 0; i < m->nb_clusters; i++) {
> -        if(l2_table[l2_index + i] != 0)
> -            old_cluster[j++] = l2_table[l2_index + i];
> -
> -        l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
> -                    (i << s->cluster_bits)) | QCOW_OFLAG_COPIED);
> -     }
> -
> -    if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t),
> -                l2_table + l2_index, m->nb_clusters * sizeof(uint64_t)) !=
> -            m->nb_clusters * sizeof(uint64_t))
> -        goto err;
> -
> -    for (i = 0; i < j; i++)
> -        free_any_clusters(bs, old_cluster[i], 1);
> -
> -    ret = 0;
> -err:
> -    qemu_free(old_cluster);
> -    return ret;
> - }
> -
> -/*
> - * alloc_cluster_offset
> - *
> - * For a given offset of the disk image, return cluster offset in
> - * qcow2 file.
> - *
> - * If the offset is not found, allocate a new cluster.
> - *
> - * Return the cluster offset if successful,
> - * Return 0, otherwise.
> - *
> - */
> -
> -static uint64_t alloc_cluster_offset(BlockDriverState *bs,
> -                                     uint64_t offset,
> -                                     int n_start, int n_end,
> -                                     int *num, QCowL2Meta *m)
> -{
> -    BDRVQcowState *s = bs->opaque;
> -    int l2_index, ret;
> -    uint64_t l2_offset, *l2_table, cluster_offset;
> -    int nb_clusters, i = 0;
> -
> -    ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index);
> -    if (ret == 0)
> +    l2_table[l2_index] = tmp;
> +    if (bdrv_pwrite(s->hd,
> +                    l2_offset + l2_index * sizeof(tmp), &tmp, sizeof(tmp)) != sizeof(tmp))
>          return 0;
> -
> -    nb_clusters = size_to_clusters(s, n_end << 9);
> -
> -    nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
> -
> -    cluster_offset = be64_to_cpu(l2_table[l2_index]);
> -
> -    /* We keep all QCOW_OFLAG_COPIED clusters */
> -
> -    if (cluster_offset & QCOW_OFLAG_COPIED) {
> -        nb_clusters = count_contiguous_clusters(nb_clusters, s->cluster_size,
> -                &l2_table[l2_index], 0, 0);
> -
> -        cluster_offset &= ~QCOW_OFLAG_COPIED;
> -        m->nb_clusters = 0;
> -
> -        goto out;
> -    }
> -
> -    /* for the moment, multiple compressed clusters are not managed */
> -
> -    if (cluster_offset & QCOW_OFLAG_COMPRESSED)
> -        nb_clusters = 1;
> -
> -    /* how many available clusters ? */
> -
> -    while (i < nb_clusters) {
> -        i += count_contiguous_clusters(nb_clusters - i, s->cluster_size,
> -                &l2_table[l2_index], i, 0);
> -
> -        if(be64_to_cpu(l2_table[l2_index + i]))
> -            break;
> -
> -        i += count_contiguous_free_clusters(nb_clusters - i,
> -                &l2_table[l2_index + i]);
> -
> -        cluster_offset = be64_to_cpu(l2_table[l2_index + i]);
> -
> -        if ((cluster_offset & QCOW_OFLAG_COPIED) ||
> -                (cluster_offset & QCOW_OFLAG_COMPRESSED))
> -            break;
> -    }
> -    nb_clusters = i;
> -
> -    /* allocate a new cluster */
> -
> -    cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size);
> -
> -    /* save info needed for meta data update */
> -    m->offset = offset;
> -    m->n_start = n_start;
> -    m->nb_clusters = nb_clusters;
> -
> -out:
> -    m->nb_available = MIN(nb_clusters << (s->cluster_bits - 9), n_end);
> -
> -    *num = m->nb_available - n_start;
> -
>      return cluster_offset;
>  }
>  
>  static int qcow_is_allocated(BlockDriverState *bs, int64_t sector_num,
>                               int nb_sectors, int *pnum)
>  {
> +    BDRVQcowState *s = bs->opaque;
> +    int index_in_cluster, n;
>      uint64_t cluster_offset;
>  
> -    *pnum = nb_sectors;
> -    cluster_offset = get_cluster_offset(bs, sector_num << 9, pnum);
> -
> +    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0);
> +    index_in_cluster = sector_num & (s->cluster_sectors - 1);
> +    n = s->cluster_sectors - index_in_cluster;
> +    if (n > nb_sectors)
> +        n = nb_sectors;
> +    *pnum = n;
>      return (cluster_offset != 0);
>  }
>  
> @@ -1102,9 +723,11 @@
>      uint64_t cluster_offset;
>  
>      while (nb_sectors > 0) {
> -        n = nb_sectors;
> -        cluster_offset = get_cluster_offset(bs, sector_num << 9, &n);
> +        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0);
>          index_in_cluster = sector_num & (s->cluster_sectors - 1);
> +        n = s->cluster_sectors - index_in_cluster;
> +        if (n > nb_sectors)
> +            n = nb_sectors;
>          if (!cluster_offset) {
>              if (bs->backing_hd) {
>                  /* read from the base image */
> @@ -1143,18 +766,15 @@
>      BDRVQcowState *s = bs->opaque;
>      int ret, index_in_cluster, n;
>      uint64_t cluster_offset;
> -    int n_end;
> -    QCowL2Meta l2meta;
>  
>      while (nb_sectors > 0) {
>          index_in_cluster = sector_num & (s->cluster_sectors - 1);
> -        n_end = index_in_cluster + nb_sectors;
> -        if (s->crypt_method &&
> -            n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
> -            n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
> -        cluster_offset = alloc_cluster_offset(bs, sector_num << 9,
> -                                              index_in_cluster,
> -                                              n_end, &n, &l2meta);
> +        n = s->cluster_sectors - index_in_cluster;
> +        if (n > nb_sectors)
> +            n = nb_sectors;
> +        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1, 0,
> +                                            index_in_cluster,
> +                                            index_in_cluster + n);
>          if (!cluster_offset)
>              return -1;
>          if (s->crypt_method) {
> @@ -1165,10 +785,8 @@
>          } else {
>              ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512);
>          }
> -        if (ret != n * 512 || alloc_cluster_link_l2(bs, cluster_offset, &l2meta) < 0) {
> -            free_any_clusters(bs, cluster_offset, l2meta.nb_clusters);
> +        if (ret != n * 512)
>              return -1;
> -        }
>          nb_sectors -= n;
>          sector_num += n;
>          buf += n * 512;
> @@ -1186,33 +804,8 @@
>      uint64_t cluster_offset;
>      uint8_t *cluster_data;
>      BlockDriverAIOCB *hd_aiocb;
> -    QEMUBH *bh;
> -    QCowL2Meta l2meta;
>  } QCowAIOCB;
>  
> -static void qcow_aio_read_cb(void *opaque, int ret);
> -static void qcow_aio_read_bh(void *opaque)
> -{
> -    QCowAIOCB *acb = opaque;
> -    qemu_bh_delete(acb->bh);
> -    acb->bh = NULL;
> -    qcow_aio_read_cb(opaque, 0);
> -}
> -
> -static int qcow_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
> -{
> -    if (acb->bh)
> -        return -EIO;
> -
> -    acb->bh = qemu_bh_new(cb, acb);
> -    if (!acb->bh)
> -        return -EIO;
> -
> -    qemu_bh_schedule(acb->bh);
> -
> -    return 0;
> -}
> -
>  static void qcow_aio_read_cb(void *opaque, int ret)
>  {
>      QCowAIOCB *acb = opaque;
> @@ -1222,12 +815,13 @@
>  
>      acb->hd_aiocb = NULL;
>      if (ret < 0) {
> -fail:
> +    fail:
>          acb->common.cb(acb->common.opaque, ret);
>          qemu_aio_release(acb);
>          return;
>      }
>  
> + redo:
>      /* post process the read buffer */
>      if (!acb->cluster_offset) {
>          /* nothing to do */
> @@ -1253,9 +847,12 @@
>      }
>  
>      /* prepare next AIO request */
> -    acb->n = acb->nb_sectors;
> -    acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, &acb->n);
> +    acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
> +                                             0, 0, 0, 0);
>      index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
> +    acb->n = s->cluster_sectors - index_in_cluster;
> +    if (acb->n > acb->nb_sectors)
> +        acb->n = acb->nb_sectors;
>  
>      if (!acb->cluster_offset) {
>          if (bs->backing_hd) {
> @@ -1268,16 +865,12 @@
>                  if (acb->hd_aiocb == NULL)
>                      goto fail;
>              } else {
> -                ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
> -                if (ret < 0)
> -                    goto fail;
> +                goto redo;
>              }
>          } else {
>              /* Note: in this case, no need to wait */
>              memset(acb->buf, 0, 512 * acb->n);
> -            ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
> -            if (ret < 0)
> -                goto fail;
> +            goto redo;
>          }
>      } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
>          /* add AIO support for compressed blocks ? */
> @@ -1285,9 +878,7 @@
>              goto fail;
>          memcpy(acb->buf,
>                 s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
> -        ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
> -        if (ret < 0)
> -            goto fail;
> +        goto redo;
>      } else {
>          if ((acb->cluster_offset & 511) != 0) {
>              ret = -EIO;
> @@ -1316,7 +907,6 @@
>      acb->nb_sectors = nb_sectors;
>      acb->n = 0;
>      acb->cluster_offset = 0;
> -    acb->l2meta.nb_clusters = 0;
>      return acb;
>  }
>  
> @@ -1340,8 +930,8 @@
>      BlockDriverState *bs = acb->common.bs;
>      BDRVQcowState *s = bs->opaque;
>      int index_in_cluster;
> +    uint64_t cluster_offset;
>      const uint8_t *src_buf;
> -    int n_end;
>  
>      acb->hd_aiocb = NULL;
>  
> @@ -1352,11 +942,6 @@
>          return;
>      }
>  
> -    if (alloc_cluster_link_l2(bs, acb->cluster_offset, &acb->l2meta) < 0) {
> -        free_any_clusters(bs, acb->cluster_offset, acb->l2meta.nb_clusters);
> -        goto fail;
> -    }
> -
>      acb->nb_sectors -= acb->n;
>      acb->sector_num += acb->n;
>      acb->buf += acb->n * 512;
> @@ -1369,22 +954,19 @@
>      }
>  
>      index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
> -    n_end = index_in_cluster + acb->nb_sectors;
> -    if (s->crypt_method &&
> -        n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
> -        n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
> -
> -    acb->cluster_offset = alloc_cluster_offset(bs, acb->sector_num << 9,
> -                                          index_in_cluster,
> -                                          n_end, &acb->n, &acb->l2meta);
> -    if (!acb->cluster_offset || (acb->cluster_offset & 511) != 0) {
> +    acb->n = s->cluster_sectors - index_in_cluster;
> +    if (acb->n > acb->nb_sectors)
> +        acb->n = acb->nb_sectors;
> +    cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, 1, 0,
> +                                        index_in_cluster,
> +                                        index_in_cluster + acb->n);
> +    if (!cluster_offset || (cluster_offset & 511) != 0) {
>          ret = -EIO;
>          goto fail;
>      }
>      if (s->crypt_method) {
>          if (!acb->cluster_data) {
> -            acb->cluster_data = qemu_mallocz(QCOW_MAX_CRYPT_CLUSTERS *
> -                                             s->cluster_size);
> +            acb->cluster_data = qemu_mallocz(s->cluster_size);
>              if (!acb->cluster_data) {
>                  ret = -ENOMEM;
>                  goto fail;
> @@ -1397,7 +979,7 @@
>          src_buf = acb->buf;
>      }
>      acb->hd_aiocb = bdrv_aio_write(s->hd,
> -                                   (acb->cluster_offset >> 9) + index_in_cluster,
> +                                   (cluster_offset >> 9) + index_in_cluster,
>                                     src_buf, acb->n,
>                                     qcow_aio_write_cb, acb);
>      if (acb->hd_aiocb == NULL)
> @@ -1571,7 +1153,7 @@
>  
>      memset(s->l1_table, 0, l1_length);
>      if (bdrv_pwrite(s->hd, s->l1_table_offset, s->l1_table, l1_length) < 0)
> -        return -1;
> +	return -1;
>      ret = bdrv_truncate(s->hd, s->l1_table_offset + l1_length);
>      if (ret < 0)
>          return ret;
> @@ -1637,10 +1219,8 @@
>          /* could not compress: write normal cluster */
>          qcow_write(bs, sector_num, buf, s->cluster_sectors);
>      } else {
> -        cluster_offset = alloc_compressed_cluster_offset(bs, sector_num << 9,
> -                                              out_len);
> -        if (!cluster_offset)
> -            return -1;
> +        cluster_offset = get_cluster_offset(bs, sector_num << 9, 2,
> +                                            out_len, 0, 0);
>          cluster_offset &= s->cluster_offset_mask;
>          if (bdrv_pwrite(s->hd, cluster_offset, out_buf, out_len) != out_len) {
>              qemu_free(out_buf);
> @@ -2225,19 +1805,26 @@
>      BDRVQcowState *s = bs->opaque;
>      int i, nb_clusters;
>  
> -    nb_clusters = size_to_clusters(s, size);
> -retry:
> -    for(i = 0; i < nb_clusters; i++) {
> -        int64_t i = s->free_cluster_index++;
> -        if (get_refcount(bs, i) != 0)
> -            goto retry;
> -    }
> +    nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
> +    for(;;) {
> +        if (get_refcount(bs, s->free_cluster_index) == 0) {
> +            s->free_cluster_index++;
> +            for(i = 1; i < nb_clusters; i++) {
> +                if (get_refcount(bs, s->free_cluster_index) != 0)
> +                    goto not_found;
> +                s->free_cluster_index++;
> +            }
>  #ifdef DEBUG_ALLOC2
> -    printf("alloc_clusters: size=%lld -> %lld\n",
> -            size,
> -            (s->free_cluster_index - nb_clusters) << s->cluster_bits);
> +            printf("alloc_clusters: size=%lld -> %lld\n",
> +                   size,
> +                   (s->free_cluster_index - nb_clusters) << s->cluster_bits);
>  #endif
> -    return (s->free_cluster_index - nb_clusters) << s->cluster_bits;
> +            return (s->free_cluster_index - nb_clusters) << s->cluster_bits;
> +        } else {
> +        not_found:
> +            s->free_cluster_index++;
> +        }
> +    }
>  }
>  
>  static int64_t alloc_clusters(BlockDriverState *bs, int64_t size)
> @@ -2301,7 +1888,8 @@
>      int new_table_size, new_table_size2, refcount_table_clusters, i, ret;
>      uint64_t *new_table;
>      int64_t table_offset;
> -    uint8_t data[12];
> +    uint64_t data64;
> +    uint32_t data32;
>      int old_table_size;
>      int64_t old_table_offset;
>  
> @@ -2340,10 +1928,13 @@
>      for(i = 0; i < s->refcount_table_size; i++)
>          be64_to_cpus(&new_table[i]);
>  
> -    cpu_to_be64w((uint64_t*)data, table_offset);
> -    cpu_to_be32w((uint32_t*)(data + 8), refcount_table_clusters);
> +    data64 = cpu_to_be64(table_offset);
>      if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_offset),
> -                    data, sizeof(data)) != sizeof(data))
> +                    &data64, sizeof(data64)) != sizeof(data64))
> +        goto fail;
> +    data32 = cpu_to_be32(refcount_table_clusters);
> +    if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_clusters),
> +                    &data32, sizeof(data32)) != sizeof(data32))
>          goto fail;
>      qemu_free(s->refcount_table);
>      old_table_offset = s->refcount_table_offset;
> @@ -2572,7 +2163,7 @@
>      uint16_t *refcount_table;
>  
>      size = bdrv_getlength(s->hd);
> -    nb_clusters = size_to_clusters(s, size);
> +    nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
>      refcount_table = qemu_mallocz(nb_clusters * sizeof(uint16_t));
>  
>      /* header */
> @@ -2624,7 +2215,7 @@
>      int refcount;
>  
>      size = bdrv_getlength(s->hd);
> -    nb_clusters = size_to_clusters(s, size);
> +    nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
>      for(k = 0; k < nb_clusters;) {
>          k1 = k;
>          refcount = get_refcount(bs, k);
>
>
>