qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Race condition in VMDK (QCOW*) formats.
@ 2007-01-16 13:59 Igor Lvovsky
  2007-01-16 19:35 ` Fabrice Bellard
  2007-01-16 19:35 ` Fabrice Bellard
  0 siblings, 2 replies; 8+ messages in thread
From: Igor Lvovsky @ 2007-01-16 13:59 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1602 bytes --]

 Hi all,

I have doubt about the race condition during the write operation on snapshot.

I think the problem exists in VMDK and QCOW* formats (I didn't checked the others). 

 

The example from the block_vmdk.c.

 

static int vmdk_write(BlockDriverState *bs, int64_t sector_num, 

                     const uint8_t *buf, int nb_sectors)

{

    BDRVVmdkState *s = bs->opaque;

    int ret, index_in_cluster, n;

    uint64_t cluster_offset;

 

    while (nb_sectors > 0) {

        index_in_cluster = sector_num & (s->cluster_sectors - 1);

        n = s->cluster_sectors - index_in_cluster;

        if (n > nb_sectors)

            n = nb_sectors;

        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);

        if (!cluster_offset)

            return -1;

        lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET);

        ret = write(s->fd, buf, n * 512);

        if (ret != n * 512)

            return -1;

        nb_sectors -= n;

        sector_num += n;

        buf += n * 512;

    }

    return 0;

}

 

The get_cluster_offset(…) routine update the L2 table of the metadata and return the cluster_offset.

After that the vmdk_write(…) routine actually write the grain at right place.

So, we have timing hole here.

 

Assume, VM that perform write operation will be destroyed at this moment.

So, we have corrupted image (with updated L2 table, but without the grain itself).

 

            Regards,

                        Igor Lvovsky

 

 

 

 

 


[-- Attachment #2: Type: text/html, Size: 9018 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats.
  2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky
@ 2007-01-16 19:35 ` Fabrice Bellard
  2007-01-16 19:35 ` Fabrice Bellard
  1 sibling, 0 replies; 8+ messages in thread
From: Fabrice Bellard @ 2007-01-16 19:35 UTC (permalink / raw)
  To: qemu-devel

Well, it was never said that the QCOW* code was safe if you interrupted 
QEMU at some point.

But I agree that it could be safer to write the sector first and update 
the links after. It could be interesting to analyze the QCOW2 snapshots 
handling too (what if QEMU is stopped during the creation of a snapshot ?).

Regards,

Fabrice.

Igor Lvovsky wrote:
> 
> 
>  Hi all,
> 
> I have doubt about the race condition during the *write operation on 
> snapshot*.
> 
> I think the problem exists in VMDK and QCOW* formats (I didn't checked 
> the others).
> 
>  
> 
> The example from the block_vmdk.c.
> 
>  
> 
> static int vmdk_write(BlockDriverState *bs, int64_t sector_num,
> 
>                      const uint8_t *buf, int nb_sectors)
> 
> {
> 
>     BDRVVmdkState *s = bs->opaque;
> 
>     int ret, index_in_cluster, n;
> 
>     uint64_t cluster_offset;
> 
>  
> 
>     while (nb_sectors > 0) {
> 
>         index_in_cluster = sector_num & (s->cluster_sectors - 1);
> 
>         n = s->cluster_sectors - index_in_cluster;
> 
>         if (n > nb_sectors)
> 
>             n = nb_sectors;
> 
>         cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
> 
>         if (!cluster_offset)
> 
>             return -1;
> 
>         lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET);
> 
>         ret = write(s->fd, buf, n * 512);
> 
>         if (ret != n * 512)
> 
>             return -1;
> 
>         nb_sectors -= n;
> 
>         sector_num += n;
> 
>         buf += n * 512;
> 
>     }
> 
>     return 0;
> 
> }
> 
>  
> 
> The /get_cluster_offset(…)/ routine update the L2 table of the metadata 
> and return the /cluster_offset. /
> 
> After that the /vmdk_write(…)/ routine/ /actually write the grain at 
> right place.
> 
> So, we have timing hole here.
> 
>  
> 
> Assume, VM that perform write operation will be destroyed at this moment.
> 
> So, we have corrupted image (with updated L2 table, but without the 
> grain itself).
> 
>  
> 
>             Regards,
> 
>                         Igor Lvovsky
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats.
  2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky
  2007-01-16 19:35 ` Fabrice Bellard
@ 2007-01-16 19:35 ` Fabrice Bellard
  2007-05-13 11:13   ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky
  1 sibling, 1 reply; 8+ messages in thread
From: Fabrice Bellard @ 2007-01-16 19:35 UTC (permalink / raw)
  To: qemu-devel

Well, it was never said that the QCOW* code was safe if you interrupted 
QEMU at some point.

But I agree that it could be safer to write the sector first and update 
the links after. It could be interesting to analyze the QCOW2 snapshots 
handling too (what if QEMU is stopped during the creation of a snapshot ?).

Regards,

Fabrice.

Igor Lvovsky wrote:
> 
> 
>  Hi all,
> 
> I have doubt about the race condition during the *write operation on 
> snapshot*.
> 
> I think the problem exists in VMDK and QCOW* formats (I didn't checked 
> the others).
> 
>  
> 
> The example from the block_vmdk.c.
> 
>  
> 
> static int vmdk_write(BlockDriverState *bs, int64_t sector_num,
> 
>                      const uint8_t *buf, int nb_sectors)
> 
> {
> 
>     BDRVVmdkState *s = bs->opaque;
> 
>     int ret, index_in_cluster, n;
> 
>     uint64_t cluster_offset;
> 
>  
> 
>     while (nb_sectors > 0) {
> 
>         index_in_cluster = sector_num & (s->cluster_sectors - 1);
> 
>         n = s->cluster_sectors - index_in_cluster;
> 
>         if (n > nb_sectors)
> 
>             n = nb_sectors;
> 
>         cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
> 
>         if (!cluster_offset)
> 
>             return -1;
> 
>         lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET);
> 
>         ret = write(s->fd, buf, n * 512);
> 
>         if (ret != n * 512)
> 
>             return -1;
> 
>         nb_sectors -= n;
> 
>         sector_num += n;
> 
>         buf += n * 512;
> 
>     }
> 
>     return 0;
> 
> }
> 
>  
> 
> The /get_cluster_offset(…)/ routine update the L2 table of the metadata 
> and return the /cluster_offset. /
> 
> After that the /vmdk_write(…)/ routine/ /actually write the grain at 
> right place.
> 
> So, we have timing hole here.
> 
>  
> 
> Assume, VM that perform write operation will be destroyed at this moment.
> 
> So, we have corrupted image (with updated L2 table, but without the 
> grain itself).
> 
>  
> 
>             Regards,
> 
>                         Igor Lvovsky
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.
  2007-01-16 19:35 ` Fabrice Bellard
@ 2007-05-13 11:13   ` Igor Lvovsky
  2007-05-17 13:54     ` Igor Lvovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Igor Lvovsky @ 2007-05-13 11:13 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4084 bytes --]

 Hi,
In this patch I fixed two issues:
1. A race condition during write operations on snapshots.
	Now we write the grain of data first and the L2 metadata after.
	So, the snapshot will stay correct if the VM will be destroyed in the 	middle of the write. 
2. Non-leaf images growing during writes.
	Assume we have snapshots chain (Base->Snap1->Snap2->...->Leaf) and we      	run a VM with the latest image of this chain (leaf image).
	We have a problem with non-leaf images growing in the snapshot-chain    	(most noticeable when the VM performs aggressive writes).
	It's an incorrect behavior according to VMDK spec.
      For every write operation into an unknown offset, the active image 	must query its ancestors for this offset, and if exists in any of them 	perform a read-from-ancestor/modify/write-to-active the whole grain of 	that offset.
	The problem happened upon read-from-ancestor/modify/write-to-active 	where the ancestor was 2 or more generations above the active (leaf) 	image (not a direct parent), as its direct child was modified.
      
	Fixed by always write to the 'active' (leaf) image.

		Regards, 
			Igor Lvovsky
	 
	



-----Original Message-----
From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Fabrice Bellard
Sent: Tuesday, January 16, 2007 9:36 PM
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats.

Well, it was never said that the QCOW* code was safe if you interrupted 
QEMU at some point.

But I agree that it could be safer to write the sector first and update 
the links after. It could be interesting to analyze the QCOW2 snapshots 
handling too (what if QEMU is stopped during the creation of a snapshot ?).

Regards,

Fabrice.

Igor Lvovsky wrote:
> 
> 
>  Hi all,
> 
> I have doubt about the race condition during the *write operation on 
> snapshot*.
> 
> I think the problem exists in VMDK and QCOW* formats (I didn't checked 
> the others).
> 
>  
> 
> The example from the block_vmdk.c.
> 
>  
> 
> static int vmdk_write(BlockDriverState *bs, int64_t sector_num,
> 
>                      const uint8_t *buf, int nb_sectors)
> 
> {
> 
>     BDRVVmdkState *s = bs->opaque;
> 
>     int ret, index_in_cluster, n;
> 
>     uint64_t cluster_offset;
> 
>  
> 
>     while (nb_sectors > 0) {
> 
>         index_in_cluster = sector_num & (s->cluster_sectors - 1);
> 
>         n = s->cluster_sectors - index_in_cluster;
> 
>         if (n > nb_sectors)
> 
>             n = nb_sectors;
> 
>         cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
> 
>         if (!cluster_offset)
> 
>             return -1;
> 
>         lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET);
> 
>         ret = write(s->fd, buf, n * 512);
> 
>         if (ret != n * 512)
> 
>             return -1;
> 
>         nb_sectors -= n;
> 
>         sector_num += n;
> 
>         buf += n * 512;
> 
>     }
> 
>     return 0;
> 
> }
> 
>  
> 
> The /get_cluster_offset(…)/ routine update the L2 table of the metadata 
> and return the /cluster_offset. /
> 
> After that the /vmdk_write(…)/ routine/ /actually write the grain at 
> right place.
> 
> So, we have timing hole here.
> 
>  
> 
> Assume, VM that perform write operation will be destroyed at this moment.
> 
> So, we have corrupted image (with updated L2 table, but without the 
> grain itself).
> 
>  
> 
>             Regards,
> 
>                         Igor Lvovsky
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel




_______________________________________________
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[-- Attachment #2: block-vmdk.diff --]
[-- Type: application/octet-stream, Size: 9478 bytes --]

Index: block-vmdk.c
===================================================================
RCS file: /sources/qemu/qemu/block-vmdk.c,v
retrieving revision 1.10
diff -u -r1.10 block-vmdk.c
--- block-vmdk.c	24 Jan 2007 21:05:24 -0000	1.10
+++ block-vmdk.c	13 May 2007 09:01:13 -0000
@@ -75,8 +75,24 @@
 
     unsigned int cluster_sectors;
     uint32_t parent_cid;
+    int child;   // =1, if have a child
 } BDRVVmdkState;
 
+typedef struct VmdkMetaData {
+    uint32_t offset;
+    unsigned int l1_index;
+    unsigned int l2_index;
+    unsigned int l2_offset;
+    int valid;
+} VmdkMetaData;
+
+typedef struct ActiveBDRVState{
+    BlockDriverState *hd;            // active image handler
+    uint64_t cluster_offset;         // current write offset
+}ActiveBDRVState;
+ActiveBDRVState activeBDRV;
+
+
 static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     uint32_t magic;
@@ -305,7 +321,7 @@
         bdrv_close(bs->backing_hd);
 }
 
-
+int parent_open = 0;
 static int vmdk_parent_open(BlockDriverState *bs, const char * filename)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -339,7 +355,8 @@
             bdrv_close(s->hd);
             return -1;
         }
-        if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0)
+        parent_open = 1;
+        if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
             goto failure;
     }
 
@@ -352,6 +369,11 @@
     uint32_t magic;
     int l1_size, i, ret;
 
+    if (parent_open)
+        // Parent must be open as RO.
+        flags = BDRV_O_RDONLY;  
+    fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename);
+
     ret = bdrv_file_open(&s->hd, filename, flags);
     if (ret < 0)
         return ret;
@@ -386,6 +408,11 @@
             / s->l1_entry_sectors;
         s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9;
         s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9;
+         
+        if (parent_open)
+            s->child = 1;  // parent
+        else
+            s->child = 0;  // child
 
         // try to open parent images, if exist
         if (vmdk_parent_open(bs, filename) != 0)
@@ -430,8 +457,7 @@
     return -1;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate);
-
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate);
 static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset,
                              uint64_t offset, int allocate)
 {
@@ -446,19 +472,43 @@
 
         if (!vmdk_is_cid_valid(bs))
             return -1;
-        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate);
-        if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != 
-                                                                            ps->cluster_sectors*512)
-            return -1;
 
-        if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != 
-                                                                            sizeof(whole_grain))
+        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate);
+
+        if (parent_cluster_offset) {
+            BDRVVmdkState *act_s = activeBDRV.hd->opaque;
+
+            if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512)
+                return -1;
+
+            //Write grain only into the active image
+            if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain))
+                return -1;
+        }
+    }
+    return 0;
+}
+
+static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data)
+{
+    BDRVVmdkState *s = bs->opaque;
+
+    /* update L2 table */
+    if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), 
+                    &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
+        return -1;
+    /* update backup L2 table */
+    if (s->l1_backup_table_offset != 0) {
+        m_data->l2_offset = s->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), 
+                        &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
             return -1;
     }
+
     return 0;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs,
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
                                    uint64_t offset, int allocate)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -466,7 +516,11 @@
     int min_index, i, j;
     uint32_t min_count, *l2_table, tmp;
     uint64_t cluster_offset;
-    
+    int status;
+
+    if (m_data)
+        m_data->valid = 0;
+
     l1_index = (offset >> 9) / s->l1_entry_sectors;
     if (l1_index >= s->l1_size)
         return 0;
@@ -504,32 +558,45 @@
  found:
     l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size;
     cluster_offset = le32_to_cpu(l2_table[l2_index]);
+
     if (!cluster_offset) {
         struct stat file_buf;
 
         if (!allocate)
             return 0;
-        stat(s->hd->filename, &file_buf);
-        cluster_offset = file_buf.st_size;
-        bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
-
-        cluster_offset >>= 9;
-        /* update L2 table */
-        tmp = cpu_to_le32(cluster_offset);
-        l2_table[l2_index] = tmp;
-        if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                        &tmp, sizeof(tmp)) != sizeof(tmp))
-            return 0;
-        /* update backup L2 table */
-        if (s->l1_backup_table_offset != 0) {
-            l2_offset = s->l1_backup_table[l1_index];
-            if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                            &tmp, sizeof(tmp)) != sizeof(tmp))
+        // Avoid the L2 tables update for the images that have snapshots.
+        if (!s->child) {
+            status = stat(s->hd->filename, &file_buf);
+            if (status == -1) {
+                fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", 
+                                s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno));
                 return 0;
-        }
+            }
+            cluster_offset = file_buf.st_size;
+            bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
 
+            cluster_offset >>= 9;
+            tmp = cpu_to_le32(cluster_offset);
+            l2_table[l2_index] = tmp;
+            // Save the active image state
+            activeBDRV.cluster_offset = cluster_offset;
+            activeBDRV.hd = bs;
+        }
+        /* First of all we write grain itself, to avoid race condition
+         * that may to corrupt the image.
+         * This problem may occur because of insufficient space on host disk
+         * or inappropriate VM shutdown.
+         */
         if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1)
             return 0;
+
+        if (m_data) {
+            m_data->offset = tmp;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->valid = 1;
+        }
     }
     cluster_offset <<= 9;
     return cluster_offset;
@@ -542,7 +609,7 @@
     int index_in_cluster, n;
     uint64_t cluster_offset;
 
-    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+    cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
     index_in_cluster = sector_num % s->cluster_sectors;
     n = s->cluster_sectors - index_in_cluster;
     if (n > nb_sectors)
@@ -559,7 +626,7 @@
     uint64_t cluster_offset;
 
     while (nb_sectors > 0) {
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+        cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
         index_in_cluster = sector_num % s->cluster_sectors;
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
@@ -590,20 +657,32 @@
                      const uint8_t *buf, int nb_sectors)
 {
     BDRVVmdkState *s = bs->opaque;
+    VmdkMetaData m_data;
     int index_in_cluster, n;
     uint64_t cluster_offset;
     static int cid_update = 0;
 
+    if (sector_num > bs->total_sectors) {
+        fprintf(stderr, "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", sector_num, bs->total_sectors);
+        return -1;
+    }
+
     while (nb_sectors > 0) {
         index_in_cluster = sector_num & (s->cluster_sectors - 1);
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
             n = nb_sectors;
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
+        cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1);
         if (!cluster_offset)
             return -1;
+
         if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512)
             return -1;
+        if (m_data.valid) {
+            /* update L2 tables */
+            if (vmdk_L2update(bs, &m_data) == -1) 
+                return -1;
+        }
         nb_sectors -= n;
         sector_num += n;
         buf += n * 512;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.
  2007-05-13 11:13   ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky
@ 2007-05-17 13:54     ` Igor Lvovsky
  2007-05-19 21:39       ` Thiemo Seufer
  0 siblings, 1 reply; 8+ messages in thread
From: Igor Lvovsky @ 2007-05-17 13:54 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4562 bytes --]


	Hi,
The bug was in my last patch.
This is a new one.
I'll very appreciate any comments.

    Regards,
		Igor Lvovsky




-----Original Message-----
From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Igor Lvovsky
Sent: Sunday, May 13, 2007 2:13 PM
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [PATCH] Fix a race condition and non-leaf imagesgrowing in VMDK chains. 

 Hi,
In this patch I fixed two issues:
1. A race condition during write operations on snapshots.
	Now we write the grain of data first and the L2 metadata after.
	So, the snapshot will stay correct if the VM will be destroyed in the 	middle of the write. 
2. Non-leaf images growing during writes.
	Assume we have snapshots chain (Base->Snap1->Snap2->...->Leaf) and we      	run a VM with the latest image of this chain (leaf image).
	We have a problem with non-leaf images growing in the snapshot-chain    	(most noticeable when the VM performs aggressive writes).
	It's an incorrect behavior according to VMDK spec.
      For every write operation into an unknown offset, the active image 	must query its ancestors for this offset, and if exists in any of them 	perform a read-from-ancestor/modify/write-to-active the whole grain of 	that offset.
	The problem happened upon read-from-ancestor/modify/write-to-active 	where the ancestor was 2 or more generations above the active (leaf) 	image (not a direct parent), as its direct child was modified.
      
	Fixed by always write to the 'active' (leaf) image.

		Regards, 
			Igor Lvovsky
	 
	



-----Original Message-----
From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Fabrice Bellard
Sent: Tuesday, January 16, 2007 9:36 PM
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats.

Well, it was never said that the QCOW* code was safe if you interrupted 
QEMU at some point.

But I agree that it could be safer to write the sector first and update 
the links after. It could be interesting to analyze the QCOW2 snapshots 
handling too (what if QEMU is stopped during the creation of a snapshot ?).

Regards,

Fabrice.

Igor Lvovsky wrote:
> 
> 
>  Hi all,
> 
> I have doubt about the race condition during the *write operation on 
> snapshot*.
> 
> I think the problem exists in VMDK and QCOW* formats (I didn't checked 
> the others).
> 
>  
> 
> The example from the block_vmdk.c.
> 
>  
> 
> static int vmdk_write(BlockDriverState *bs, int64_t sector_num,
> 
>                      const uint8_t *buf, int nb_sectors)
> 
> {
> 
>     BDRVVmdkState *s = bs->opaque;
> 
>     int ret, index_in_cluster, n;
> 
>     uint64_t cluster_offset;
> 
>  
> 
>     while (nb_sectors > 0) {
> 
>         index_in_cluster = sector_num & (s->cluster_sectors - 1);
> 
>         n = s->cluster_sectors - index_in_cluster;
> 
>         if (n > nb_sectors)
> 
>             n = nb_sectors;
> 
>         cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
> 
>         if (!cluster_offset)
> 
>             return -1;
> 
>         lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET);
> 
>         ret = write(s->fd, buf, n * 512);
> 
>         if (ret != n * 512)
> 
>             return -1;
> 
>         nb_sectors -= n;
> 
>         sector_num += n;
> 
>         buf += n * 512;
> 
>     }
> 
>     return 0;
> 
> }
> 
>  
> 
> The /get_cluster_offset(…)/ routine update the L2 table of the metadata 
> and return the /cluster_offset. /
> 
> After that the /vmdk_write(…)/ routine/ /actually write the grain at 
> right place.
> 
> So, we have timing hole here.
> 
>  
> 
> Assume, VM that perform write operation will be destroyed at this moment.
> 
> So, we have corrupted image (with updated L2 table, but without the 
> grain itself).
> 
>  
> 
>             Regards,
> 
>                         Igor Lvovsky
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel




_______________________________________________
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

[-- Attachment #2: block-vmdk.diff --]
[-- Type: application/octet-stream, Size: 9520 bytes --]

Index: block-vmdk.c
===================================================================
RCS file: /sources/qemu/qemu/block-vmdk.c,v
retrieving revision 1.10
diff -u -r1.10 block-vmdk.c
--- block-vmdk.c	24 Jan 2007 21:05:24 -0000	1.10
+++ block-vmdk.c	17 May 2007 13:40:44 -0000
@@ -75,8 +75,24 @@
 
     unsigned int cluster_sectors;
     uint32_t parent_cid;
+    int child;   // =1, if have a child
 } BDRVVmdkState;
 
+typedef struct VmdkMetaData {
+    uint32_t offset;
+    unsigned int l1_index;
+    unsigned int l2_index;
+    unsigned int l2_offset;
+    int valid;
+} VmdkMetaData;
+
+typedef struct ActiveBDRVState{
+    BlockDriverState *hd;            // active image handler
+    uint64_t cluster_offset;         // current write offset
+}ActiveBDRVState;
+ActiveBDRVState activeBDRV;
+
+
 static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     uint32_t magic;
@@ -305,7 +321,7 @@
         bdrv_close(bs->backing_hd);
 }
 
-
+int parent_open = 0;
 static int vmdk_parent_open(BlockDriverState *bs, const char * filename)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -339,8 +355,10 @@
             bdrv_close(s->hd);
             return -1;
         }
-        if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0)
+        parent_open = 1;
+        if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
             goto failure;
+        parent_open = 0;
     }
 
     return 0;
@@ -352,6 +370,11 @@
     uint32_t magic;
     int l1_size, i, ret;
 
+    if (parent_open)
+        // Parent must be open as RO.
+        flags = BDRV_O_RDONLY;  
+    fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename);
+
     ret = bdrv_file_open(&s->hd, filename, flags);
     if (ret < 0)
         return ret;
@@ -386,6 +409,11 @@
             / s->l1_entry_sectors;
         s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9;
         s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9;
+         
+        if (parent_open)
+            s->child = 1;  // parent
+        else
+            s->child = 0;  // child
 
         // try to open parent images, if exist
         if (vmdk_parent_open(bs, filename) != 0)
@@ -430,8 +458,7 @@
     return -1;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate);
-
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate);
 static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset,
                              uint64_t offset, int allocate)
 {
@@ -446,19 +473,43 @@
 
         if (!vmdk_is_cid_valid(bs))
             return -1;
-        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate);
-        if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != 
-                                                                            ps->cluster_sectors*512)
-            return -1;
 
-        if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != 
-                                                                            sizeof(whole_grain))
+        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate);
+
+        if (parent_cluster_offset) {
+            BDRVVmdkState *act_s = activeBDRV.hd->opaque;
+
+            if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512)
+                return -1;
+
+            //Write grain only into the active image
+            if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain))
+                return -1;
+        }
+    }
+    return 0;
+}
+
+static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data)
+{
+    BDRVVmdkState *s = bs->opaque;
+
+    /* update L2 table */
+    if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), 
+                    &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
+        return -1;
+    /* update backup L2 table */
+    if (s->l1_backup_table_offset != 0) {
+        m_data->l2_offset = s->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), 
+                        &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
             return -1;
     }
+
     return 0;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs,
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
                                    uint64_t offset, int allocate)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -466,7 +517,11 @@
     int min_index, i, j;
     uint32_t min_count, *l2_table, tmp;
     uint64_t cluster_offset;
-    
+    int status;
+
+    if (m_data)
+        m_data->valid = 0;
+
     l1_index = (offset >> 9) / s->l1_entry_sectors;
     if (l1_index >= s->l1_size)
         return 0;
@@ -504,32 +559,45 @@
  found:
     l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size;
     cluster_offset = le32_to_cpu(l2_table[l2_index]);
+
     if (!cluster_offset) {
         struct stat file_buf;
 
         if (!allocate)
             return 0;
-        stat(s->hd->filename, &file_buf);
-        cluster_offset = file_buf.st_size;
-        bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
-
-        cluster_offset >>= 9;
-        /* update L2 table */
-        tmp = cpu_to_le32(cluster_offset);
-        l2_table[l2_index] = tmp;
-        if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                        &tmp, sizeof(tmp)) != sizeof(tmp))
-            return 0;
-        /* update backup L2 table */
-        if (s->l1_backup_table_offset != 0) {
-            l2_offset = s->l1_backup_table[l1_index];
-            if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                            &tmp, sizeof(tmp)) != sizeof(tmp))
+        // Avoid the L2 tables update for the images that have snapshots.
+        if (!s->child) {
+            status = stat(s->hd->filename, &file_buf);
+            if (status == -1) {
+                fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", 
+                                s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno));
                 return 0;
-        }
+            }
+            cluster_offset = file_buf.st_size;
+            bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
 
+            cluster_offset >>= 9;
+            tmp = cpu_to_le32(cluster_offset);
+            l2_table[l2_index] = tmp;
+            // Save the active image state
+            activeBDRV.cluster_offset = cluster_offset;
+            activeBDRV.hd = bs;
+        }
+        /* First of all we write grain itself, to avoid race condition
+         * that may to corrupt the image.
+         * This problem may occur because of insufficient space on host disk
+         * or inappropriate VM shutdown.
+         */
         if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1)
             return 0;
+
+        if (m_data) {
+            m_data->offset = tmp;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->valid = 1;
+        }
     }
     cluster_offset <<= 9;
     return cluster_offset;
@@ -542,7 +610,7 @@
     int index_in_cluster, n;
     uint64_t cluster_offset;
 
-    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+    cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
     index_in_cluster = sector_num % s->cluster_sectors;
     n = s->cluster_sectors - index_in_cluster;
     if (n > nb_sectors)
@@ -559,7 +627,7 @@
     uint64_t cluster_offset;
 
     while (nb_sectors > 0) {
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+        cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
         index_in_cluster = sector_num % s->cluster_sectors;
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
@@ -590,20 +658,32 @@
                      const uint8_t *buf, int nb_sectors)
 {
     BDRVVmdkState *s = bs->opaque;
+    VmdkMetaData m_data;
     int index_in_cluster, n;
     uint64_t cluster_offset;
     static int cid_update = 0;
 
+    if (sector_num > bs->total_sectors) {
+        fprintf(stderr, "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", sector_num, bs->total_sectors);
+        return -1;
+    }
+
     while (nb_sectors > 0) {
         index_in_cluster = sector_num & (s->cluster_sectors - 1);
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
             n = nb_sectors;
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
+        cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1);
         if (!cluster_offset)
             return -1;
+
         if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512)
             return -1;
+        if (m_data.valid) {
+            /* update L2 tables */
+            if (vmdk_L2update(bs, &m_data) == -1) 
+                return -1;
+        }
         nb_sectors -= n;
         sector_num += n;
         buf += n * 512;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.
  2007-05-17 13:54     ` Igor Lvovsky
@ 2007-05-19 21:39       ` Thiemo Seufer
  2007-05-19 21:42         ` Thiemo Seufer
  0 siblings, 1 reply; 8+ messages in thread
From: Thiemo Seufer @ 2007-05-19 21:39 UTC (permalink / raw)
  To: Igor Lvovsky; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

Igor Lvovsky wrote:
> 
> 	Hi,
> The bug was in my last patch.
> This is a new one.
> I'll very appreciate any comments.

Appended is a variant which gets rid of the confusing "child" variable
which means "is_parent", and the duplicate of it (parent_open).

There was also an uninitialized variable "tmp" which was stored as
offset in m_data.

I quite possibly missed some bits which broke the patch, please test if
it still works for you.


Thiemo

[-- Attachment #2: block-vdmk.diff --]
[-- Type: text/x-diff, Size: 9259 bytes --]

Index: block-vmdk.c
===================================================================
--- block-vmdk.c.orig	2007-01-30 20:37:51.000000000 +0000
+++ block-vmdk.c	2007-05-19 22:32:01.000000000 +0100
@@ -75,8 +75,25 @@
 
     unsigned int cluster_sectors;
     uint32_t parent_cid;
+    int is_parent;
 } BDRVVmdkState;
 
+typedef struct VmdkMetaData {
+    uint32_t offset;
+    unsigned int l1_index;
+    unsigned int l2_index;
+    unsigned int l2_offset;
+    int valid;
+} VmdkMetaData;
+
+typedef struct ActiveBDRVState{
+    BlockDriverState *hd;            // active image handler
+    uint64_t cluster_offset;         // current write offset
+}ActiveBDRVState;
+
+static ActiveBDRVState activeBDRV;
+
+
 static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     uint32_t magic;
@@ -305,7 +322,6 @@
         bdrv_close(bs->backing_hd);
 }
 
-
 static int vmdk_parent_open(BlockDriverState *bs, const char * filename)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -336,11 +352,14 @@
         s->hd->backing_hd = bdrv_new("");
         if (!s->hd->backing_hd) {
             failure:
+            s->is_parent = 0;
             bdrv_close(s->hd);
             return -1;
         }
-        if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0)
+        s->is_parent = 1;
+        if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
             goto failure;
+        s->is_parent = 0;
     }
 
     return 0;
@@ -352,6 +371,11 @@
     uint32_t magic;
     int l1_size, i, ret;
 
+    if (s->is_parent)
+        // Parent must be opened as RO.
+        flags = BDRV_O_RDONLY;
+    fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename);
+
     ret = bdrv_file_open(&s->hd, filename, flags);
     if (ret < 0)
         return ret;
@@ -430,7 +454,8 @@
     return -1;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate);
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
+                                   uint64_t offset, int allocate);
 
 static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset,
                              uint64_t offset, int allocate)
@@ -446,27 +471,55 @@
 
         if (!vmdk_is_cid_valid(bs))
             return -1;
-        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate);
-        if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != 
-                                                                            ps->cluster_sectors*512)
-            return -1;
 
-        if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != 
-                                                                            sizeof(whole_grain))
+        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate);
+
+        if (parent_cluster_offset) {
+            BDRVVmdkState *act_s = activeBDRV.hd->opaque;
+
+            if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512)
+                return -1;
+
+            //Write grain only into the active image
+            if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain))
+                return -1;
+        }
+    }
+    return 0;
+}
+
+static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data)
+{
+    BDRVVmdkState *s = bs->opaque;
+
+    /* update L2 table */
+    if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                    &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
+        return -1;
+    /* update backup L2 table */
+    if (s->l1_backup_table_offset != 0) {
+        m_data->l2_offset = s->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                        &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
             return -1;
     }
+
     return 0;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs,
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
                                    uint64_t offset, int allocate)
 {
     BDRVVmdkState *s = bs->opaque;
     unsigned int l1_index, l2_offset, l2_index;
     int min_index, i, j;
-    uint32_t min_count, *l2_table, tmp;
+    uint32_t min_count, *l2_table, tmp = 0;
     uint64_t cluster_offset;
-    
+    int status;
+
+    if (m_data)
+        m_data->valid = 0;
+
     l1_index = (offset >> 9) / s->l1_entry_sectors;
     if (l1_index >= s->l1_size)
         return 0;
@@ -504,32 +557,45 @@
  found:
     l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size;
     cluster_offset = le32_to_cpu(l2_table[l2_index]);
+
     if (!cluster_offset) {
         struct stat file_buf;
 
         if (!allocate)
             return 0;
-        stat(s->hd->filename, &file_buf);
-        cluster_offset = file_buf.st_size;
-        bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
-
-        cluster_offset >>= 9;
-        /* update L2 table */
-        tmp = cpu_to_le32(cluster_offset);
-        l2_table[l2_index] = tmp;
-        if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                        &tmp, sizeof(tmp)) != sizeof(tmp))
-            return 0;
-        /* update backup L2 table */
-        if (s->l1_backup_table_offset != 0) {
-            l2_offset = s->l1_backup_table[l1_index];
-            if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                            &tmp, sizeof(tmp)) != sizeof(tmp))
+        // Avoid the L2 tables update for the images that have snapshots.
+        if (!s->is_parent) {
+            status = stat(s->hd->filename, &file_buf);
+            if (status == -1) {
+                fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n",
+                                s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno));
                 return 0;
-        }
+            }
+            cluster_offset = file_buf.st_size;
+            bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
 
+            cluster_offset >>= 9;
+            tmp = cpu_to_le32(cluster_offset);
+            l2_table[l2_index] = tmp;
+            // Save the active image state
+            activeBDRV.cluster_offset = cluster_offset;
+            activeBDRV.hd = bs;
+        }
+        /* First of all we write grain itself, to avoid race condition
+         * that may to corrupt the image.
+         * This problem may occur because of insufficient space on host disk
+         * or inappropriate VM shutdown.
+         */
         if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1)
             return 0;
+
+        if (m_data) {
+            m_data->offset = tmp;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->valid = 1;
+        }
     }
     cluster_offset <<= 9;
     return cluster_offset;
@@ -542,7 +608,7 @@
     int index_in_cluster, n;
     uint64_t cluster_offset;
 
-    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+    cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
     index_in_cluster = sector_num % s->cluster_sectors;
     n = s->cluster_sectors - index_in_cluster;
     if (n > nb_sectors)
@@ -559,7 +625,7 @@
     uint64_t cluster_offset;
 
     while (nb_sectors > 0) {
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+        cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
         index_in_cluster = sector_num % s->cluster_sectors;
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
@@ -590,20 +656,34 @@
                      const uint8_t *buf, int nb_sectors)
 {
     BDRVVmdkState *s = bs->opaque;
+    VmdkMetaData m_data;
     int index_in_cluster, n;
     uint64_t cluster_offset;
     static int cid_update = 0;
 
+    if (sector_num > bs->total_sectors) {
+        fprintf(stderr,
+                "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n",
+                sector_num, bs->total_sectors);
+        return -1;
+    }
+
     while (nb_sectors > 0) {
         index_in_cluster = sector_num & (s->cluster_sectors - 1);
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
             n = nb_sectors;
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
+        cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1);
         if (!cluster_offset)
             return -1;
+
         if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512)
             return -1;
+        if (m_data.valid) {
+            /* update L2 tables */
+            if (vmdk_L2update(bs, &m_data) == -1)
+                return -1;
+        }
         nb_sectors -= n;
         sector_num += n;
         buf += n * 512;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.
  2007-05-19 21:39       ` Thiemo Seufer
@ 2007-05-19 21:42         ` Thiemo Seufer
  2007-05-20 11:40           ` Igor Lvovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Thiemo Seufer @ 2007-05-19 21:42 UTC (permalink / raw)
  To: Igor Lvovsky; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

Thiemo Seufer wrote:
> Igor Lvovsky wrote:
> > 
> > 	Hi,
> > The bug was in my last patch.
> > This is a new one.
> > I'll very appreciate any comments.
> 
> Appended is a variant which gets rid of the confusing "child" variable
> which means "is_parent", and the duplicate of it (parent_open).
> 
> There was also an uninitialized variable "tmp" which was stored as
> offset in m_data.
> 
> I quite possibly missed some bits which broke the patch, please test if
> it still works for you.

Next try, this time with the latest version of the patch.


Thiemo

[-- Attachment #2: lvovsky.block-vdmk.diff --]
[-- Type: text/x-diff, Size: 9259 bytes --]

Index: block-vmdk.c
===================================================================
--- block-vmdk.c.orig	2007-01-30 20:37:51.000000000 +0000
+++ block-vmdk.c	2007-05-19 22:32:01.000000000 +0100
@@ -75,8 +75,25 @@
 
     unsigned int cluster_sectors;
     uint32_t parent_cid;
+    int is_parent;
 } BDRVVmdkState;
 
+typedef struct VmdkMetaData {
+    uint32_t offset;
+    unsigned int l1_index;
+    unsigned int l2_index;
+    unsigned int l2_offset;
+    int valid;
+} VmdkMetaData;
+
+typedef struct ActiveBDRVState{
+    BlockDriverState *hd;            // active image handler
+    uint64_t cluster_offset;         // current write offset
+}ActiveBDRVState;
+
+static ActiveBDRVState activeBDRV;
+
+
 static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     uint32_t magic;
@@ -305,7 +322,6 @@
         bdrv_close(bs->backing_hd);
 }
 
-
 static int vmdk_parent_open(BlockDriverState *bs, const char * filename)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -336,11 +352,14 @@
         s->hd->backing_hd = bdrv_new("");
         if (!s->hd->backing_hd) {
             failure:
+            s->is_parent = 0;
             bdrv_close(s->hd);
             return -1;
         }
-        if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0)
+        s->is_parent = 1;
+        if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
             goto failure;
+        s->is_parent = 0;
     }
 
     return 0;
@@ -352,6 +371,11 @@
     uint32_t magic;
     int l1_size, i, ret;
 
+    if (s->is_parent)
+        // Parent must be opened as RO.
+        flags = BDRV_O_RDONLY;
+    fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename);
+
     ret = bdrv_file_open(&s->hd, filename, flags);
     if (ret < 0)
         return ret;
@@ -430,7 +454,8 @@
     return -1;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate);
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
+                                   uint64_t offset, int allocate);
 
 static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset,
                              uint64_t offset, int allocate)
@@ -446,27 +471,55 @@
 
         if (!vmdk_is_cid_valid(bs))
             return -1;
-        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate);
-        if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != 
-                                                                            ps->cluster_sectors*512)
-            return -1;
 
-        if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != 
-                                                                            sizeof(whole_grain))
+        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate);
+
+        if (parent_cluster_offset) {
+            BDRVVmdkState *act_s = activeBDRV.hd->opaque;
+
+            if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512)
+                return -1;
+
+            //Write grain only into the active image
+            if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain))
+                return -1;
+        }
+    }
+    return 0;
+}
+
+static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data)
+{
+    BDRVVmdkState *s = bs->opaque;
+
+    /* update L2 table */
+    if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                    &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
+        return -1;
+    /* update backup L2 table */
+    if (s->l1_backup_table_offset != 0) {
+        m_data->l2_offset = s->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                        &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
             return -1;
     }
+
     return 0;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs,
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
                                    uint64_t offset, int allocate)
 {
     BDRVVmdkState *s = bs->opaque;
     unsigned int l1_index, l2_offset, l2_index;
     int min_index, i, j;
-    uint32_t min_count, *l2_table, tmp;
+    uint32_t min_count, *l2_table, tmp = 0;
     uint64_t cluster_offset;
-    
+    int status;
+
+    if (m_data)
+        m_data->valid = 0;
+
     l1_index = (offset >> 9) / s->l1_entry_sectors;
     if (l1_index >= s->l1_size)
         return 0;
@@ -504,32 +557,45 @@
  found:
     l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size;
     cluster_offset = le32_to_cpu(l2_table[l2_index]);
+
     if (!cluster_offset) {
         struct stat file_buf;
 
         if (!allocate)
             return 0;
-        stat(s->hd->filename, &file_buf);
-        cluster_offset = file_buf.st_size;
-        bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
-
-        cluster_offset >>= 9;
-        /* update L2 table */
-        tmp = cpu_to_le32(cluster_offset);
-        l2_table[l2_index] = tmp;
-        if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                        &tmp, sizeof(tmp)) != sizeof(tmp))
-            return 0;
-        /* update backup L2 table */
-        if (s->l1_backup_table_offset != 0) {
-            l2_offset = s->l1_backup_table[l1_index];
-            if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                            &tmp, sizeof(tmp)) != sizeof(tmp))
+        // Avoid the L2 tables update for the images that have snapshots.
+        if (!s->is_parent) {
+            status = stat(s->hd->filename, &file_buf);
+            if (status == -1) {
+                fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n",
+                                s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno));
                 return 0;
-        }
+            }
+            cluster_offset = file_buf.st_size;
+            bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
 
+            cluster_offset >>= 9;
+            tmp = cpu_to_le32(cluster_offset);
+            l2_table[l2_index] = tmp;
+            // Save the active image state
+            activeBDRV.cluster_offset = cluster_offset;
+            activeBDRV.hd = bs;
+        }
+        /* First of all we write grain itself, to avoid race condition
+         * that may to corrupt the image.
+         * This problem may occur because of insufficient space on host disk
+         * or inappropriate VM shutdown.
+         */
         if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1)
             return 0;
+
+        if (m_data) {
+            m_data->offset = tmp;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->valid = 1;
+        }
     }
     cluster_offset <<= 9;
     return cluster_offset;
@@ -542,7 +608,7 @@
     int index_in_cluster, n;
     uint64_t cluster_offset;
 
-    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+    cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
     index_in_cluster = sector_num % s->cluster_sectors;
     n = s->cluster_sectors - index_in_cluster;
     if (n > nb_sectors)
@@ -559,7 +625,7 @@
     uint64_t cluster_offset;
 
     while (nb_sectors > 0) {
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+        cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
         index_in_cluster = sector_num % s->cluster_sectors;
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
@@ -590,20 +656,34 @@
                      const uint8_t *buf, int nb_sectors)
 {
     BDRVVmdkState *s = bs->opaque;
+    VmdkMetaData m_data;
     int index_in_cluster, n;
     uint64_t cluster_offset;
     static int cid_update = 0;
 
+    if (sector_num > bs->total_sectors) {
+        fprintf(stderr,
+                "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n",
+                sector_num, bs->total_sectors);
+        return -1;
+    }
+
     while (nb_sectors > 0) {
         index_in_cluster = sector_num & (s->cluster_sectors - 1);
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
             n = nb_sectors;
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
+        cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1);
         if (!cluster_offset)
             return -1;
+
         if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512)
             return -1;
+        if (m_data.valid) {
+            /* update L2 tables */
+            if (vmdk_L2update(bs, &m_data) == -1)
+                return -1;
+        }
         nb_sectors -= n;
         sector_num += n;
         buf += n * 512;

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.
  2007-05-19 21:42         ` Thiemo Seufer
@ 2007-05-20 11:40           ` Igor Lvovsky
  0 siblings, 0 replies; 8+ messages in thread
From: Igor Lvovsky @ 2007-05-20 11:40 UTC (permalink / raw)
  To: Thiemo Seufer; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

	Hi Thiemo,
Thank you for comments.
You probably right about the "child" -> "is_parent" renaming,
But the "parent_open" variable (I know, the name is not the best. you can rename it) is not a duplicate of the "is_parent".

The only one reason for existence of "parent_open" it's set/reset the "is_parent".
I think we can't use the "is_parent" instead, because we need to know at every moment if it's a "parent" or "leaf" image.
I mean, if we'll use "is_parent" instead of "parent_open":


+  s->is_parent = 1;
+  if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
         goto failure;
+  s->is_parent = 0;

We'll reset it after the file opening. It's mean, later during the "get_cluster_offset()" we'll not be able to know if it's a "parent" or
"leaf" image.

So, I attach a new patch with you comments.

			Regards,
				Igor Lvovsky



-----Original Message-----
From: Thiemo Seufer [mailto:ths@networkno.de] 
Sent: Sunday, May 20, 2007 12:42 AM
To: Igor Lvovsky
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains.

Thiemo Seufer wrote:
> Igor Lvovsky wrote:
> > 
> > 	Hi,
> > The bug was in my last patch.
> > This is a new one.
> > I'll very appreciate any comments.
> 
> Appended is a variant which gets rid of the confusing "child" variable
> which means "is_parent", and the duplicate of it (parent_open).
> 
> There was also an uninitialized variable "tmp" which was stored as
> offset in m_data.
> 
> I quite possibly missed some bits which broke the patch, please test if
> it still works for you.

Next try, this time with the latest version of the patch.


Thiemo

[-- Attachment #2: block-vmdk-patch.diff --]
[-- Type: application/octet-stream, Size: 9635 bytes --]

Index: block-vmdk.c
===================================================================
RCS file: /sources/qemu/qemu/block-vmdk.c,v
retrieving revision 1.10
diff -u -r1.10 block-vmdk.c
--- block-vmdk.c	24 Jan 2007 21:05:24 -0000	1.10
+++ block-vmdk.c	20 May 2007 10:54:28 -0000
@@ -75,8 +75,25 @@
 
     unsigned int cluster_sectors;
     uint32_t parent_cid;
+    int is_parent;
 } BDRVVmdkState;
 
+typedef struct VmdkMetaData {
+    uint32_t offset;
+    unsigned int l1_index;
+    unsigned int l2_index;
+    unsigned int l2_offset;
+    int valid;
+} VmdkMetaData;
+
+typedef struct ActiveBDRVState{
+    BlockDriverState *hd;            // active image handler
+    uint64_t cluster_offset;         // current write offset
+}ActiveBDRVState;
+
+static ActiveBDRVState activeBDRV;
+
+
 static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
     uint32_t magic;
@@ -305,7 +322,7 @@
         bdrv_close(bs->backing_hd);
 }
 
-
+int parent_open = 0;
 static int vmdk_parent_open(BlockDriverState *bs, const char * filename)
 {
     BDRVVmdkState *s = bs->opaque;
@@ -339,8 +356,10 @@
             bdrv_close(s->hd);
             return -1;
         }
-        if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0)
+        parent_open = 1;
+        if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0)
             goto failure;
+        parent_open = 0;
     }
 
     return 0;
@@ -352,6 +371,11 @@
     uint32_t magic;
     int l1_size, i, ret;
 
+    if (parent_open)
+        // Parent must be opened as RO.
+        flags = BDRV_O_RDONLY;
+    fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename);
+
     ret = bdrv_file_open(&s->hd, filename, flags);
     if (ret < 0)
         return ret;
@@ -386,6 +410,11 @@
             / s->l1_entry_sectors;
         s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9;
         s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9;
+         
+        if (parent_open)
+            s->is_parent = 1;
+        else
+            s->is_parent = 0;
 
         // try to open parent images, if exist
         if (vmdk_parent_open(bs, filename) != 0)
@@ -430,7 +459,8 @@
     return -1;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate);
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
+                                   uint64_t offset, int allocate);
 
 static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset,
                              uint64_t offset, int allocate)
@@ -446,27 +476,55 @@
 
         if (!vmdk_is_cid_valid(bs))
             return -1;
-        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate);
-        if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != 
-                                                                            ps->cluster_sectors*512)
-            return -1;
 
-        if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != 
-                                                                            sizeof(whole_grain))
+        parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate);
+
+        if (parent_cluster_offset) {
+            BDRVVmdkState *act_s = activeBDRV.hd->opaque;
+
+            if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512)
+                return -1;
+
+            //Write grain only into the active image
+            if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain))
+                return -1;
+        }
+    }
+    return 0;
+}
+
+static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data)
+{
+    BDRVVmdkState *s = bs->opaque;
+
+    /* update L2 table */
+    if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                    &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
+        return -1;
+    /* update backup L2 table */
+    if (s->l1_backup_table_offset != 0) {
+        m_data->l2_offset = s->l1_backup_table[m_data->l1_index];
+        if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)),
+                        &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset))
             return -1;
     }
+
     return 0;
 }
 
-static uint64_t get_cluster_offset(BlockDriverState *bs,
+static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data,
                                    uint64_t offset, int allocate)
 {
     BDRVVmdkState *s = bs->opaque;
     unsigned int l1_index, l2_offset, l2_index;
     int min_index, i, j;
-    uint32_t min_count, *l2_table, tmp;
+    uint32_t min_count, *l2_table, tmp = 0;
     uint64_t cluster_offset;
-    
+    int status;
+
+    if (m_data)
+        m_data->valid = 0;
+
     l1_index = (offset >> 9) / s->l1_entry_sectors;
     if (l1_index >= s->l1_size)
         return 0;
@@ -504,32 +562,45 @@
  found:
     l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size;
     cluster_offset = le32_to_cpu(l2_table[l2_index]);
+
     if (!cluster_offset) {
         struct stat file_buf;
 
         if (!allocate)
             return 0;
-        stat(s->hd->filename, &file_buf);
-        cluster_offset = file_buf.st_size;
-        bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
-
-        cluster_offset >>= 9;
-        /* update L2 table */
-        tmp = cpu_to_le32(cluster_offset);
-        l2_table[l2_index] = tmp;
-        if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                        &tmp, sizeof(tmp)) != sizeof(tmp))
-            return 0;
-        /* update backup L2 table */
-        if (s->l1_backup_table_offset != 0) {
-            l2_offset = s->l1_backup_table[l1_index];
-            if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), 
-                            &tmp, sizeof(tmp)) != sizeof(tmp))
+        // Avoid the L2 tables update for the images that have snapshots.
+        if (!s->is_parent) {
+            status = stat(s->hd->filename, &file_buf);
+            if (status == -1) {
+                fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n",
+                                s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno));
                 return 0;
-        }
+            }
+            cluster_offset = file_buf.st_size;
+            bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9));
 
+            cluster_offset >>= 9;
+            tmp = cpu_to_le32(cluster_offset);
+            l2_table[l2_index] = tmp;
+            // Save the active image state
+            activeBDRV.cluster_offset = cluster_offset;
+            activeBDRV.hd = bs;
+        }
+        /* First of all we write grain itself, to avoid race condition
+         * that may to corrupt the image.
+         * This problem may occur because of insufficient space on host disk
+         * or inappropriate VM shutdown.
+         */
         if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1)
             return 0;
+
+        if (m_data) {
+            m_data->offset = tmp;
+            m_data->l1_index = l1_index;
+            m_data->l2_index = l2_index;
+            m_data->l2_offset = l2_offset;
+            m_data->valid = 1;
+        }
     }
     cluster_offset <<= 9;
     return cluster_offset;
@@ -542,7 +613,7 @@
     int index_in_cluster, n;
     uint64_t cluster_offset;
 
-    cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+    cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
     index_in_cluster = sector_num % s->cluster_sectors;
     n = s->cluster_sectors - index_in_cluster;
     if (n > nb_sectors)
@@ -559,7 +630,7 @@
     uint64_t cluster_offset;
 
     while (nb_sectors > 0) {
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 0);
+        cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0);
         index_in_cluster = sector_num % s->cluster_sectors;
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
@@ -590,20 +661,34 @@
                      const uint8_t *buf, int nb_sectors)
 {
     BDRVVmdkState *s = bs->opaque;
+    VmdkMetaData m_data;
     int index_in_cluster, n;
     uint64_t cluster_offset;
     static int cid_update = 0;
 
+    if (sector_num > bs->total_sectors) {
+        fprintf(stderr,
+                "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n",
+                sector_num, bs->total_sectors);
+        return -1;
+    }
+
     while (nb_sectors > 0) {
         index_in_cluster = sector_num & (s->cluster_sectors - 1);
         n = s->cluster_sectors - index_in_cluster;
         if (n > nb_sectors)
             n = nb_sectors;
-        cluster_offset = get_cluster_offset(bs, sector_num << 9, 1);
+        cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1);
         if (!cluster_offset)
             return -1;
+
         if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512)
             return -1;
+        if (m_data.valid) {
+            /* update L2 tables */
+            if (vmdk_L2update(bs, &m_data) == -1)
+                return -1;
+        }
         nb_sectors -= n;
         sector_num += n;
         buf += n * 512;

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-05-20 11:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky
2007-01-16 19:35 ` Fabrice Bellard
2007-01-16 19:35 ` Fabrice Bellard
2007-05-13 11:13   ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky
2007-05-17 13:54     ` Igor Lvovsky
2007-05-19 21:39       ` Thiemo Seufer
2007-05-19 21:42         ` Thiemo Seufer
2007-05-20 11:40           ` Igor Lvovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).