* [Qemu-devel] Race condition in VMDK (QCOW*) formats. @ 2007-01-16 13:59 Igor Lvovsky 2007-01-16 19:35 ` Fabrice Bellard 2007-01-16 19:35 ` Fabrice Bellard 0 siblings, 2 replies; 8+ messages in thread From: Igor Lvovsky @ 2007-01-16 13:59 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1602 bytes --] Hi all, I have doubt about the race condition during the write operation on snapshot. I think the problem exists in VMDK and QCOW* formats (I didn't checked the others). The example from the block_vmdk.c. static int vmdk_write(BlockDriverState *bs, int64_t sector_num, const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; int ret, index_in_cluster, n; uint64_t cluster_offset; while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); if (!cluster_offset) return -1; lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); ret = write(s->fd, buf, n * 512); if (ret != n * 512) return -1; nb_sectors -= n; sector_num += n; buf += n * 512; } return 0; } The get_cluster_offset(…) routine update the L2 table of the metadata and return the cluster_offset. After that the vmdk_write(…) routine actually write the grain at right place. So, we have timing hole here. Assume, VM that perform write operation will be destroyed at this moment. So, we have corrupted image (with updated L2 table, but without the grain itself). Regards, Igor Lvovsky [-- Attachment #2: Type: text/html, Size: 9018 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats. 2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky @ 2007-01-16 19:35 ` Fabrice Bellard 2007-01-16 19:35 ` Fabrice Bellard 1 sibling, 0 replies; 8+ messages in thread From: Fabrice Bellard @ 2007-01-16 19:35 UTC (permalink / raw) To: qemu-devel Well, it was never said that the QCOW* code was safe if you interrupted QEMU at some point. But I agree that it could be safer to write the sector first and update the links after. It could be interesting to analyze the QCOW2 snapshots handling too (what if QEMU is stopped during the creation of a snapshot ?). Regards, Fabrice. Igor Lvovsky wrote: > > > Hi all, > > I have doubt about the race condition during the *write operation on > snapshot*. > > I think the problem exists in VMDK and QCOW* formats (I didn't checked > the others). > > > > The example from the block_vmdk.c. > > > > static int vmdk_write(BlockDriverState *bs, int64_t sector_num, > > const uint8_t *buf, int nb_sectors) > > { > > BDRVVmdkState *s = bs->opaque; > > int ret, index_in_cluster, n; > > uint64_t cluster_offset; > > > > while (nb_sectors > 0) { > > index_in_cluster = sector_num & (s->cluster_sectors - 1); > > n = s->cluster_sectors - index_in_cluster; > > if (n > nb_sectors) > > n = nb_sectors; > > cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); > > if (!cluster_offset) > > return -1; > > lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); > > ret = write(s->fd, buf, n * 512); > > if (ret != n * 512) > > return -1; > > nb_sectors -= n; > > sector_num += n; > > buf += n * 512; > > } > > return 0; > > } > > > > The /get_cluster_offset(…)/ routine update the L2 table of the metadata > and return the /cluster_offset. / > > After that the /vmdk_write(…)/ routine/ /actually write the grain at > right place. > > So, we have timing hole here. > > > > Assume, VM that perform write operation will be destroyed at this moment. > > So, we have corrupted image (with updated L2 table, but without the > grain itself). > > > > Regards, > > Igor Lvovsky > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Qemu-devel mailing list > Qemu-devel@nongnu.org > http://lists.nongnu.org/mailman/listinfo/qemu-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats. 2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky 2007-01-16 19:35 ` Fabrice Bellard @ 2007-01-16 19:35 ` Fabrice Bellard 2007-05-13 11:13 ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky 1 sibling, 1 reply; 8+ messages in thread From: Fabrice Bellard @ 2007-01-16 19:35 UTC (permalink / raw) To: qemu-devel Well, it was never said that the QCOW* code was safe if you interrupted QEMU at some point. But I agree that it could be safer to write the sector first and update the links after. It could be interesting to analyze the QCOW2 snapshots handling too (what if QEMU is stopped during the creation of a snapshot ?). Regards, Fabrice. Igor Lvovsky wrote: > > > Hi all, > > I have doubt about the race condition during the *write operation on > snapshot*. > > I think the problem exists in VMDK and QCOW* formats (I didn't checked > the others). > > > > The example from the block_vmdk.c. > > > > static int vmdk_write(BlockDriverState *bs, int64_t sector_num, > > const uint8_t *buf, int nb_sectors) > > { > > BDRVVmdkState *s = bs->opaque; > > int ret, index_in_cluster, n; > > uint64_t cluster_offset; > > > > while (nb_sectors > 0) { > > index_in_cluster = sector_num & (s->cluster_sectors - 1); > > n = s->cluster_sectors - index_in_cluster; > > if (n > nb_sectors) > > n = nb_sectors; > > cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); > > if (!cluster_offset) > > return -1; > > lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); > > ret = write(s->fd, buf, n * 512); > > if (ret != n * 512) > > return -1; > > nb_sectors -= n; > > sector_num += n; > > buf += n * 512; > > } > > return 0; > > } > > > > The /get_cluster_offset(…)/ routine update the L2 table of the metadata > and return the /cluster_offset. / > > After that the /vmdk_write(…)/ routine/ /actually write the grain at > right place. > > So, we have timing hole here. > > > > Assume, VM that perform write operation will be destroyed at this moment. > > So, we have corrupted image (with updated L2 table, but without the > grain itself). > > > > Regards, > > Igor Lvovsky > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Qemu-devel mailing list > Qemu-devel@nongnu.org > http://lists.nongnu.org/mailman/listinfo/qemu-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. 2007-01-16 19:35 ` Fabrice Bellard @ 2007-05-13 11:13 ` Igor Lvovsky 2007-05-17 13:54 ` Igor Lvovsky 0 siblings, 1 reply; 8+ messages in thread From: Igor Lvovsky @ 2007-05-13 11:13 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 4084 bytes --] Hi, In this patch I fixed two issues: 1. A race condition during write operations on snapshots. Now we write the grain of data first and the L2 metadata after. So, the snapshot will stay correct if the VM will be destroyed in the middle of the write. 2. Non-leaf images growing during writes. Assume we have snapshots chain (Base->Snap1->Snap2->...->Leaf) and we run a VM with the latest image of this chain (leaf image). We have a problem with non-leaf images growing in the snapshot-chain (most noticeable when the VM performs aggressive writes). It's an incorrect behavior according to VMDK spec. For every write operation into an unknown offset, the active image must query its ancestors for this offset, and if exists in any of them perform a read-from-ancestor/modify/write-to-active the whole grain of that offset. The problem happened upon read-from-ancestor/modify/write-to-active where the ancestor was 2 or more generations above the active (leaf) image (not a direct parent), as its direct child was modified. Fixed by always write to the 'active' (leaf) image. Regards, Igor Lvovsky -----Original Message----- From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Fabrice Bellard Sent: Tuesday, January 16, 2007 9:36 PM To: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats. Well, it was never said that the QCOW* code was safe if you interrupted QEMU at some point. But I agree that it could be safer to write the sector first and update the links after. It could be interesting to analyze the QCOW2 snapshots handling too (what if QEMU is stopped during the creation of a snapshot ?). Regards, Fabrice. Igor Lvovsky wrote: > > > Hi all, > > I have doubt about the race condition during the *write operation on > snapshot*. > > I think the problem exists in VMDK and QCOW* formats (I didn't checked > the others). > > > > The example from the block_vmdk.c. > > > > static int vmdk_write(BlockDriverState *bs, int64_t sector_num, > > const uint8_t *buf, int nb_sectors) > > { > > BDRVVmdkState *s = bs->opaque; > > int ret, index_in_cluster, n; > > uint64_t cluster_offset; > > > > while (nb_sectors > 0) { > > index_in_cluster = sector_num & (s->cluster_sectors - 1); > > n = s->cluster_sectors - index_in_cluster; > > if (n > nb_sectors) > > n = nb_sectors; > > cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); > > if (!cluster_offset) > > return -1; > > lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); > > ret = write(s->fd, buf, n * 512); > > if (ret != n * 512) > > return -1; > > nb_sectors -= n; > > sector_num += n; > > buf += n * 512; > > } > > return 0; > > } > > > > The /get_cluster_offset(…)/ routine update the L2 table of the metadata > and return the /cluster_offset. / > > After that the /vmdk_write(…)/ routine/ /actually write the grain at > right place. > > So, we have timing hole here. > > > > Assume, VM that perform write operation will be destroyed at this moment. > > So, we have corrupted image (with updated L2 table, but without the > grain itself). > > > > Regards, > > Igor Lvovsky > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Qemu-devel mailing list > Qemu-devel@nongnu.org > http://lists.nongnu.org/mailman/listinfo/qemu-devel _______________________________________________ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel [-- Attachment #2: block-vmdk.diff --] [-- Type: application/octet-stream, Size: 9478 bytes --] Index: block-vmdk.c =================================================================== RCS file: /sources/qemu/qemu/block-vmdk.c,v retrieving revision 1.10 diff -u -r1.10 block-vmdk.c --- block-vmdk.c 24 Jan 2007 21:05:24 -0000 1.10 +++ block-vmdk.c 13 May 2007 09:01:13 -0000 @@ -75,8 +75,24 @@ unsigned int cluster_sectors; uint32_t parent_cid; + int child; // =1, if have a child } BDRVVmdkState; +typedef struct VmdkMetaData { + uint32_t offset; + unsigned int l1_index; + unsigned int l2_index; + unsigned int l2_offset; + int valid; +} VmdkMetaData; + +typedef struct ActiveBDRVState{ + BlockDriverState *hd; // active image handler + uint64_t cluster_offset; // current write offset +}ActiveBDRVState; +ActiveBDRVState activeBDRV; + + static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename) { uint32_t magic; @@ -305,7 +321,7 @@ bdrv_close(bs->backing_hd); } - +int parent_open = 0; static int vmdk_parent_open(BlockDriverState *bs, const char * filename) { BDRVVmdkState *s = bs->opaque; @@ -339,7 +355,8 @@ bdrv_close(s->hd); return -1; } - if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0) + parent_open = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; } @@ -352,6 +369,11 @@ uint32_t magic; int l1_size, i, ret; + if (parent_open) + // Parent must be open as RO. + flags = BDRV_O_RDONLY; + fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename); + ret = bdrv_file_open(&s->hd, filename, flags); if (ret < 0) return ret; @@ -386,6 +408,11 @@ / s->l1_entry_sectors; s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9; s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9; + + if (parent_open) + s->child = 1; // parent + else + s->child = 0; // child // try to open parent images, if exist if (vmdk_parent_open(bs, filename) != 0) @@ -430,8 +457,7 @@ return -1; } -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate); - +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate); static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset, uint64_t offset, int allocate) { @@ -446,19 +472,43 @@ if (!vmdk_is_cid_valid(bs)) return -1; - parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate); - if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != - ps->cluster_sectors*512) - return -1; - if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != - sizeof(whole_grain)) + parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate); + + if (parent_cluster_offset) { + BDRVVmdkState *act_s = activeBDRV.hd->opaque; + + if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512) + return -1; + + //Write grain only into the active image + if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain)) + return -1; + } + } + return 0; +} + +static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data) +{ + BDRVVmdkState *s = bs->opaque; + + /* update L2 table */ + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) + return -1; + /* update backup L2 table */ + if (s->l1_backup_table_offset != 0) { + m_data->l2_offset = s->l1_backup_table[m_data->l1_index]; + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) return -1; } + return 0; } -static uint64_t get_cluster_offset(BlockDriverState *bs, +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate) { BDRVVmdkState *s = bs->opaque; @@ -466,7 +516,11 @@ int min_index, i, j; uint32_t min_count, *l2_table, tmp; uint64_t cluster_offset; - + int status; + + if (m_data) + m_data->valid = 0; + l1_index = (offset >> 9) / s->l1_entry_sectors; if (l1_index >= s->l1_size) return 0; @@ -504,32 +558,45 @@ found: l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size; cluster_offset = le32_to_cpu(l2_table[l2_index]); + if (!cluster_offset) { struct stat file_buf; if (!allocate) return 0; - stat(s->hd->filename, &file_buf); - cluster_offset = file_buf.st_size; - bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); - - cluster_offset >>= 9; - /* update L2 table */ - tmp = cpu_to_le32(cluster_offset); - l2_table[l2_index] = tmp; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return 0; - /* update backup L2 table */ - if (s->l1_backup_table_offset != 0) { - l2_offset = s->l1_backup_table[l1_index]; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) + // Avoid the L2 tables update for the images that have snapshots. + if (!s->child) { + status = stat(s->hd->filename, &file_buf); + if (status == -1) { + fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", + s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno)); return 0; - } + } + cluster_offset = file_buf.st_size; + bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); + cluster_offset >>= 9; + tmp = cpu_to_le32(cluster_offset); + l2_table[l2_index] = tmp; + // Save the active image state + activeBDRV.cluster_offset = cluster_offset; + activeBDRV.hd = bs; + } + /* First of all we write grain itself, to avoid race condition + * that may to corrupt the image. + * This problem may occur because of insufficient space on host disk + * or inappropriate VM shutdown. + */ if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1) return 0; + + if (m_data) { + m_data->offset = tmp; + m_data->l1_index = l1_index; + m_data->l2_index = l2_index; + m_data->l2_offset = l2_offset; + m_data->valid = 1; + } } cluster_offset <<= 9; return cluster_offset; @@ -542,7 +609,7 @@ int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -559,7 +626,7 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -590,20 +657,32 @@ const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; + VmdkMetaData m_data; int index_in_cluster, n; uint64_t cluster_offset; static int cid_update = 0; + if (sector_num > bs->total_sectors) { + fprintf(stderr, "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", sector_num, bs->total_sectors); + return -1; + } + while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); + cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1); if (!cluster_offset) return -1; + if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512) return -1; + if (m_data.valid) { + /* update L2 tables */ + if (vmdk_L2update(bs, &m_data) == -1) + return -1; + } nb_sectors -= n; sector_num += n; buf += n * 512; ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. 2007-05-13 11:13 ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky @ 2007-05-17 13:54 ` Igor Lvovsky 2007-05-19 21:39 ` Thiemo Seufer 0 siblings, 1 reply; 8+ messages in thread From: Igor Lvovsky @ 2007-05-17 13:54 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 4562 bytes --] Hi, The bug was in my last patch. This is a new one. I'll very appreciate any comments. Regards, Igor Lvovsky -----Original Message----- From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Igor Lvovsky Sent: Sunday, May 13, 2007 2:13 PM To: qemu-devel@nongnu.org Subject: [Qemu-devel] [PATCH] Fix a race condition and non-leaf imagesgrowing in VMDK chains. Hi, In this patch I fixed two issues: 1. A race condition during write operations on snapshots. Now we write the grain of data first and the L2 metadata after. So, the snapshot will stay correct if the VM will be destroyed in the middle of the write. 2. Non-leaf images growing during writes. Assume we have snapshots chain (Base->Snap1->Snap2->...->Leaf) and we run a VM with the latest image of this chain (leaf image). We have a problem with non-leaf images growing in the snapshot-chain (most noticeable when the VM performs aggressive writes). It's an incorrect behavior according to VMDK spec. For every write operation into an unknown offset, the active image must query its ancestors for this offset, and if exists in any of them perform a read-from-ancestor/modify/write-to-active the whole grain of that offset. The problem happened upon read-from-ancestor/modify/write-to-active where the ancestor was 2 or more generations above the active (leaf) image (not a direct parent), as its direct child was modified. Fixed by always write to the 'active' (leaf) image. Regards, Igor Lvovsky -----Original Message----- From: qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org [mailto:qemu-devel-bounces+igor.lvovsky=qumranet.com@nongnu.org] On Behalf Of Fabrice Bellard Sent: Tuesday, January 16, 2007 9:36 PM To: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] Race condition in VMDK (QCOW*) formats. Well, it was never said that the QCOW* code was safe if you interrupted QEMU at some point. But I agree that it could be safer to write the sector first and update the links after. It could be interesting to analyze the QCOW2 snapshots handling too (what if QEMU is stopped during the creation of a snapshot ?). Regards, Fabrice. Igor Lvovsky wrote: > > > Hi all, > > I have doubt about the race condition during the *write operation on > snapshot*. > > I think the problem exists in VMDK and QCOW* formats (I didn't checked > the others). > > > > The example from the block_vmdk.c. > > > > static int vmdk_write(BlockDriverState *bs, int64_t sector_num, > > const uint8_t *buf, int nb_sectors) > > { > > BDRVVmdkState *s = bs->opaque; > > int ret, index_in_cluster, n; > > uint64_t cluster_offset; > > > > while (nb_sectors > 0) { > > index_in_cluster = sector_num & (s->cluster_sectors - 1); > > n = s->cluster_sectors - index_in_cluster; > > if (n > nb_sectors) > > n = nb_sectors; > > cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); > > if (!cluster_offset) > > return -1; > > lseek(s->fd, cluster_offset + index_in_cluster * 512, SEEK_SET); > > ret = write(s->fd, buf, n * 512); > > if (ret != n * 512) > > return -1; > > nb_sectors -= n; > > sector_num += n; > > buf += n * 512; > > } > > return 0; > > } > > > > The /get_cluster_offset(…)/ routine update the L2 table of the metadata > and return the /cluster_offset. / > > After that the /vmdk_write(…)/ routine/ /actually write the grain at > right place. > > So, we have timing hole here. > > > > Assume, VM that perform write operation will be destroyed at this moment. > > So, we have corrupted image (with updated L2 table, but without the > grain itself). > > > > Regards, > > Igor Lvovsky > > > > > > > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Qemu-devel mailing list > Qemu-devel@nongnu.org > http://lists.nongnu.org/mailman/listinfo/qemu-devel _______________________________________________ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel [-- Attachment #2: block-vmdk.diff --] [-- Type: application/octet-stream, Size: 9520 bytes --] Index: block-vmdk.c =================================================================== RCS file: /sources/qemu/qemu/block-vmdk.c,v retrieving revision 1.10 diff -u -r1.10 block-vmdk.c --- block-vmdk.c 24 Jan 2007 21:05:24 -0000 1.10 +++ block-vmdk.c 17 May 2007 13:40:44 -0000 @@ -75,8 +75,24 @@ unsigned int cluster_sectors; uint32_t parent_cid; + int child; // =1, if have a child } BDRVVmdkState; +typedef struct VmdkMetaData { + uint32_t offset; + unsigned int l1_index; + unsigned int l2_index; + unsigned int l2_offset; + int valid; +} VmdkMetaData; + +typedef struct ActiveBDRVState{ + BlockDriverState *hd; // active image handler + uint64_t cluster_offset; // current write offset +}ActiveBDRVState; +ActiveBDRVState activeBDRV; + + static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename) { uint32_t magic; @@ -305,7 +321,7 @@ bdrv_close(bs->backing_hd); } - +int parent_open = 0; static int vmdk_parent_open(BlockDriverState *bs, const char * filename) { BDRVVmdkState *s = bs->opaque; @@ -339,8 +355,10 @@ bdrv_close(s->hd); return -1; } - if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0) + parent_open = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; + parent_open = 0; } return 0; @@ -352,6 +370,11 @@ uint32_t magic; int l1_size, i, ret; + if (parent_open) + // Parent must be open as RO. + flags = BDRV_O_RDONLY; + fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename); + ret = bdrv_file_open(&s->hd, filename, flags); if (ret < 0) return ret; @@ -386,6 +409,11 @@ / s->l1_entry_sectors; s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9; s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9; + + if (parent_open) + s->child = 1; // parent + else + s->child = 0; // child // try to open parent images, if exist if (vmdk_parent_open(bs, filename) != 0) @@ -430,8 +458,7 @@ return -1; } -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate); - +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate); static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset, uint64_t offset, int allocate) { @@ -446,19 +473,43 @@ if (!vmdk_is_cid_valid(bs)) return -1; - parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate); - if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != - ps->cluster_sectors*512) - return -1; - if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != - sizeof(whole_grain)) + parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate); + + if (parent_cluster_offset) { + BDRVVmdkState *act_s = activeBDRV.hd->opaque; + + if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512) + return -1; + + //Write grain only into the active image + if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain)) + return -1; + } + } + return 0; +} + +static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data) +{ + BDRVVmdkState *s = bs->opaque; + + /* update L2 table */ + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) + return -1; + /* update backup L2 table */ + if (s->l1_backup_table_offset != 0) { + m_data->l2_offset = s->l1_backup_table[m_data->l1_index]; + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) return -1; } + return 0; } -static uint64_t get_cluster_offset(BlockDriverState *bs, +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate) { BDRVVmdkState *s = bs->opaque; @@ -466,7 +517,11 @@ int min_index, i, j; uint32_t min_count, *l2_table, tmp; uint64_t cluster_offset; - + int status; + + if (m_data) + m_data->valid = 0; + l1_index = (offset >> 9) / s->l1_entry_sectors; if (l1_index >= s->l1_size) return 0; @@ -504,32 +559,45 @@ found: l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size; cluster_offset = le32_to_cpu(l2_table[l2_index]); + if (!cluster_offset) { struct stat file_buf; if (!allocate) return 0; - stat(s->hd->filename, &file_buf); - cluster_offset = file_buf.st_size; - bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); - - cluster_offset >>= 9; - /* update L2 table */ - tmp = cpu_to_le32(cluster_offset); - l2_table[l2_index] = tmp; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return 0; - /* update backup L2 table */ - if (s->l1_backup_table_offset != 0) { - l2_offset = s->l1_backup_table[l1_index]; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) + // Avoid the L2 tables update for the images that have snapshots. + if (!s->child) { + status = stat(s->hd->filename, &file_buf); + if (status == -1) { + fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", + s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno)); return 0; - } + } + cluster_offset = file_buf.st_size; + bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); + cluster_offset >>= 9; + tmp = cpu_to_le32(cluster_offset); + l2_table[l2_index] = tmp; + // Save the active image state + activeBDRV.cluster_offset = cluster_offset; + activeBDRV.hd = bs; + } + /* First of all we write grain itself, to avoid race condition + * that may to corrupt the image. + * This problem may occur because of insufficient space on host disk + * or inappropriate VM shutdown. + */ if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1) return 0; + + if (m_data) { + m_data->offset = tmp; + m_data->l1_index = l1_index; + m_data->l2_index = l2_index; + m_data->l2_offset = l2_offset; + m_data->valid = 1; + } } cluster_offset <<= 9; return cluster_offset; @@ -542,7 +610,7 @@ int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -559,7 +627,7 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -590,20 +658,32 @@ const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; + VmdkMetaData m_data; int index_in_cluster, n; uint64_t cluster_offset; static int cid_update = 0; + if (sector_num > bs->total_sectors) { + fprintf(stderr, "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", sector_num, bs->total_sectors); + return -1; + } + while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); + cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1); if (!cluster_offset) return -1; + if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512) return -1; + if (m_data.valid) { + /* update L2 tables */ + if (vmdk_L2update(bs, &m_data) == -1) + return -1; + } nb_sectors -= n; sector_num += n; buf += n * 512; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. 2007-05-17 13:54 ` Igor Lvovsky @ 2007-05-19 21:39 ` Thiemo Seufer 2007-05-19 21:42 ` Thiemo Seufer 0 siblings, 1 reply; 8+ messages in thread From: Thiemo Seufer @ 2007-05-19 21:39 UTC (permalink / raw) To: Igor Lvovsky; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 448 bytes --] Igor Lvovsky wrote: > > Hi, > The bug was in my last patch. > This is a new one. > I'll very appreciate any comments. Appended is a variant which gets rid of the confusing "child" variable which means "is_parent", and the duplicate of it (parent_open). There was also an uninitialized variable "tmp" which was stored as offset in m_data. I quite possibly missed some bits which broke the patch, please test if it still works for you. Thiemo [-- Attachment #2: block-vdmk.diff --] [-- Type: text/x-diff, Size: 9259 bytes --] Index: block-vmdk.c =================================================================== --- block-vmdk.c.orig 2007-01-30 20:37:51.000000000 +0000 +++ block-vmdk.c 2007-05-19 22:32:01.000000000 +0100 @@ -75,8 +75,25 @@ unsigned int cluster_sectors; uint32_t parent_cid; + int is_parent; } BDRVVmdkState; +typedef struct VmdkMetaData { + uint32_t offset; + unsigned int l1_index; + unsigned int l2_index; + unsigned int l2_offset; + int valid; +} VmdkMetaData; + +typedef struct ActiveBDRVState{ + BlockDriverState *hd; // active image handler + uint64_t cluster_offset; // current write offset +}ActiveBDRVState; + +static ActiveBDRVState activeBDRV; + + static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename) { uint32_t magic; @@ -305,7 +322,6 @@ bdrv_close(bs->backing_hd); } - static int vmdk_parent_open(BlockDriverState *bs, const char * filename) { BDRVVmdkState *s = bs->opaque; @@ -336,11 +352,14 @@ s->hd->backing_hd = bdrv_new(""); if (!s->hd->backing_hd) { failure: + s->is_parent = 0; bdrv_close(s->hd); return -1; } - if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0) + s->is_parent = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; + s->is_parent = 0; } return 0; @@ -352,6 +371,11 @@ uint32_t magic; int l1_size, i, ret; + if (s->is_parent) + // Parent must be opened as RO. + flags = BDRV_O_RDONLY; + fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename); + ret = bdrv_file_open(&s->hd, filename, flags); if (ret < 0) return ret; @@ -430,7 +454,8 @@ return -1; } -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate); +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, + uint64_t offset, int allocate); static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset, uint64_t offset, int allocate) @@ -446,27 +471,55 @@ if (!vmdk_is_cid_valid(bs)) return -1; - parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate); - if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != - ps->cluster_sectors*512) - return -1; - if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != - sizeof(whole_grain)) + parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate); + + if (parent_cluster_offset) { + BDRVVmdkState *act_s = activeBDRV.hd->opaque; + + if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512) + return -1; + + //Write grain only into the active image + if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain)) + return -1; + } + } + return 0; +} + +static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data) +{ + BDRVVmdkState *s = bs->opaque; + + /* update L2 table */ + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) + return -1; + /* update backup L2 table */ + if (s->l1_backup_table_offset != 0) { + m_data->l2_offset = s->l1_backup_table[m_data->l1_index]; + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) return -1; } + return 0; } -static uint64_t get_cluster_offset(BlockDriverState *bs, +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate) { BDRVVmdkState *s = bs->opaque; unsigned int l1_index, l2_offset, l2_index; int min_index, i, j; - uint32_t min_count, *l2_table, tmp; + uint32_t min_count, *l2_table, tmp = 0; uint64_t cluster_offset; - + int status; + + if (m_data) + m_data->valid = 0; + l1_index = (offset >> 9) / s->l1_entry_sectors; if (l1_index >= s->l1_size) return 0; @@ -504,32 +557,45 @@ found: l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size; cluster_offset = le32_to_cpu(l2_table[l2_index]); + if (!cluster_offset) { struct stat file_buf; if (!allocate) return 0; - stat(s->hd->filename, &file_buf); - cluster_offset = file_buf.st_size; - bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); - - cluster_offset >>= 9; - /* update L2 table */ - tmp = cpu_to_le32(cluster_offset); - l2_table[l2_index] = tmp; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return 0; - /* update backup L2 table */ - if (s->l1_backup_table_offset != 0) { - l2_offset = s->l1_backup_table[l1_index]; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) + // Avoid the L2 tables update for the images that have snapshots. + if (!s->is_parent) { + status = stat(s->hd->filename, &file_buf); + if (status == -1) { + fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", + s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno)); return 0; - } + } + cluster_offset = file_buf.st_size; + bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); + cluster_offset >>= 9; + tmp = cpu_to_le32(cluster_offset); + l2_table[l2_index] = tmp; + // Save the active image state + activeBDRV.cluster_offset = cluster_offset; + activeBDRV.hd = bs; + } + /* First of all we write grain itself, to avoid race condition + * that may to corrupt the image. + * This problem may occur because of insufficient space on host disk + * or inappropriate VM shutdown. + */ if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1) return 0; + + if (m_data) { + m_data->offset = tmp; + m_data->l1_index = l1_index; + m_data->l2_index = l2_index; + m_data->l2_offset = l2_offset; + m_data->valid = 1; + } } cluster_offset <<= 9; return cluster_offset; @@ -542,7 +608,7 @@ int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -559,7 +625,7 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -590,20 +656,34 @@ const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; + VmdkMetaData m_data; int index_in_cluster, n; uint64_t cluster_offset; static int cid_update = 0; + if (sector_num > bs->total_sectors) { + fprintf(stderr, + "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", + sector_num, bs->total_sectors); + return -1; + } + while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); + cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1); if (!cluster_offset) return -1; + if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512) return -1; + if (m_data.valid) { + /* update L2 tables */ + if (vmdk_L2update(bs, &m_data) == -1) + return -1; + } nb_sectors -= n; sector_num += n; buf += n * 512; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. 2007-05-19 21:39 ` Thiemo Seufer @ 2007-05-19 21:42 ` Thiemo Seufer 2007-05-20 11:40 ` Igor Lvovsky 0 siblings, 1 reply; 8+ messages in thread From: Thiemo Seufer @ 2007-05-19 21:42 UTC (permalink / raw) To: Igor Lvovsky; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 558 bytes --] Thiemo Seufer wrote: > Igor Lvovsky wrote: > > > > Hi, > > The bug was in my last patch. > > This is a new one. > > I'll very appreciate any comments. > > Appended is a variant which gets rid of the confusing "child" variable > which means "is_parent", and the duplicate of it (parent_open). > > There was also an uninitialized variable "tmp" which was stored as > offset in m_data. > > I quite possibly missed some bits which broke the patch, please test if > it still works for you. Next try, this time with the latest version of the patch. Thiemo [-- Attachment #2: lvovsky.block-vdmk.diff --] [-- Type: text/x-diff, Size: 9259 bytes --] Index: block-vmdk.c =================================================================== --- block-vmdk.c.orig 2007-01-30 20:37:51.000000000 +0000 +++ block-vmdk.c 2007-05-19 22:32:01.000000000 +0100 @@ -75,8 +75,25 @@ unsigned int cluster_sectors; uint32_t parent_cid; + int is_parent; } BDRVVmdkState; +typedef struct VmdkMetaData { + uint32_t offset; + unsigned int l1_index; + unsigned int l2_index; + unsigned int l2_offset; + int valid; +} VmdkMetaData; + +typedef struct ActiveBDRVState{ + BlockDriverState *hd; // active image handler + uint64_t cluster_offset; // current write offset +}ActiveBDRVState; + +static ActiveBDRVState activeBDRV; + + static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename) { uint32_t magic; @@ -305,7 +322,6 @@ bdrv_close(bs->backing_hd); } - static int vmdk_parent_open(BlockDriverState *bs, const char * filename) { BDRVVmdkState *s = bs->opaque; @@ -336,11 +352,14 @@ s->hd->backing_hd = bdrv_new(""); if (!s->hd->backing_hd) { failure: + s->is_parent = 0; bdrv_close(s->hd); return -1; } - if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0) + s->is_parent = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; + s->is_parent = 0; } return 0; @@ -352,6 +371,11 @@ uint32_t magic; int l1_size, i, ret; + if (s->is_parent) + // Parent must be opened as RO. + flags = BDRV_O_RDONLY; + fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename); + ret = bdrv_file_open(&s->hd, filename, flags); if (ret < 0) return ret; @@ -430,7 +454,8 @@ return -1; } -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate); +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, + uint64_t offset, int allocate); static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset, uint64_t offset, int allocate) @@ -446,27 +471,55 @@ if (!vmdk_is_cid_valid(bs)) return -1; - parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate); - if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != - ps->cluster_sectors*512) - return -1; - if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != - sizeof(whole_grain)) + parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate); + + if (parent_cluster_offset) { + BDRVVmdkState *act_s = activeBDRV.hd->opaque; + + if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512) + return -1; + + //Write grain only into the active image + if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain)) + return -1; + } + } + return 0; +} + +static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data) +{ + BDRVVmdkState *s = bs->opaque; + + /* update L2 table */ + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) + return -1; + /* update backup L2 table */ + if (s->l1_backup_table_offset != 0) { + m_data->l2_offset = s->l1_backup_table[m_data->l1_index]; + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) return -1; } + return 0; } -static uint64_t get_cluster_offset(BlockDriverState *bs, +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate) { BDRVVmdkState *s = bs->opaque; unsigned int l1_index, l2_offset, l2_index; int min_index, i, j; - uint32_t min_count, *l2_table, tmp; + uint32_t min_count, *l2_table, tmp = 0; uint64_t cluster_offset; - + int status; + + if (m_data) + m_data->valid = 0; + l1_index = (offset >> 9) / s->l1_entry_sectors; if (l1_index >= s->l1_size) return 0; @@ -504,32 +557,45 @@ found: l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size; cluster_offset = le32_to_cpu(l2_table[l2_index]); + if (!cluster_offset) { struct stat file_buf; if (!allocate) return 0; - stat(s->hd->filename, &file_buf); - cluster_offset = file_buf.st_size; - bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); - - cluster_offset >>= 9; - /* update L2 table */ - tmp = cpu_to_le32(cluster_offset); - l2_table[l2_index] = tmp; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return 0; - /* update backup L2 table */ - if (s->l1_backup_table_offset != 0) { - l2_offset = s->l1_backup_table[l1_index]; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) + // Avoid the L2 tables update for the images that have snapshots. + if (!s->is_parent) { + status = stat(s->hd->filename, &file_buf); + if (status == -1) { + fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", + s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno)); return 0; - } + } + cluster_offset = file_buf.st_size; + bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); + cluster_offset >>= 9; + tmp = cpu_to_le32(cluster_offset); + l2_table[l2_index] = tmp; + // Save the active image state + activeBDRV.cluster_offset = cluster_offset; + activeBDRV.hd = bs; + } + /* First of all we write grain itself, to avoid race condition + * that may to corrupt the image. + * This problem may occur because of insufficient space on host disk + * or inappropriate VM shutdown. + */ if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1) return 0; + + if (m_data) { + m_data->offset = tmp; + m_data->l1_index = l1_index; + m_data->l2_index = l2_index; + m_data->l2_offset = l2_offset; + m_data->valid = 1; + } } cluster_offset <<= 9; return cluster_offset; @@ -542,7 +608,7 @@ int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -559,7 +625,7 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -590,20 +656,34 @@ const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; + VmdkMetaData m_data; int index_in_cluster, n; uint64_t cluster_offset; static int cid_update = 0; + if (sector_num > bs->total_sectors) { + fprintf(stderr, + "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", + sector_num, bs->total_sectors); + return -1; + } + while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); + cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1); if (!cluster_offset) return -1; + if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512) return -1; + if (m_data.valid) { + /* update L2 tables */ + if (vmdk_L2update(bs, &m_data) == -1) + return -1; + } nb_sectors -= n; sector_num += n; buf += n * 512; ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. 2007-05-19 21:42 ` Thiemo Seufer @ 2007-05-20 11:40 ` Igor Lvovsky 0 siblings, 0 replies; 8+ messages in thread From: Igor Lvovsky @ 2007-05-20 11:40 UTC (permalink / raw) To: Thiemo Seufer; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1722 bytes --] Hi Thiemo, Thank you for comments. You probably right about the "child" -> "is_parent" renaming, But the "parent_open" variable (I know, the name is not the best. you can rename it) is not a duplicate of the "is_parent". The only one reason for existence of "parent_open" it's set/reset the "is_parent". I think we can't use the "is_parent" instead, because we need to know at every moment if it's a "parent" or "leaf" image. I mean, if we'll use "is_parent" instead of "parent_open": + s->is_parent = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; + s->is_parent = 0; We'll reset it after the file opening. It's mean, later during the "get_cluster_offset()" we'll not be able to know if it's a "parent" or "leaf" image. So, I attach a new patch with you comments. Regards, Igor Lvovsky -----Original Message----- From: Thiemo Seufer [mailto:ths@networkno.de] Sent: Sunday, May 20, 2007 12:42 AM To: Igor Lvovsky Cc: qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains. Thiemo Seufer wrote: > Igor Lvovsky wrote: > > > > Hi, > > The bug was in my last patch. > > This is a new one. > > I'll very appreciate any comments. > > Appended is a variant which gets rid of the confusing "child" variable > which means "is_parent", and the duplicate of it (parent_open). > > There was also an uninitialized variable "tmp" which was stored as > offset in m_data. > > I quite possibly missed some bits which broke the patch, please test if > it still works for you. Next try, this time with the latest version of the patch. Thiemo [-- Attachment #2: block-vmdk-patch.diff --] [-- Type: application/octet-stream, Size: 9635 bytes --] Index: block-vmdk.c =================================================================== RCS file: /sources/qemu/qemu/block-vmdk.c,v retrieving revision 1.10 diff -u -r1.10 block-vmdk.c --- block-vmdk.c 24 Jan 2007 21:05:24 -0000 1.10 +++ block-vmdk.c 20 May 2007 10:54:28 -0000 @@ -75,8 +75,25 @@ unsigned int cluster_sectors; uint32_t parent_cid; + int is_parent; } BDRVVmdkState; +typedef struct VmdkMetaData { + uint32_t offset; + unsigned int l1_index; + unsigned int l2_index; + unsigned int l2_offset; + int valid; +} VmdkMetaData; + +typedef struct ActiveBDRVState{ + BlockDriverState *hd; // active image handler + uint64_t cluster_offset; // current write offset +}ActiveBDRVState; + +static ActiveBDRVState activeBDRV; + + static int vmdk_probe(const uint8_t *buf, int buf_size, const char *filename) { uint32_t magic; @@ -305,7 +322,7 @@ bdrv_close(bs->backing_hd); } - +int parent_open = 0; static int vmdk_parent_open(BlockDriverState *bs, const char * filename) { BDRVVmdkState *s = bs->opaque; @@ -339,8 +356,10 @@ bdrv_close(s->hd); return -1; } - if (bdrv_open(s->hd->backing_hd, parent_img_name, 0) < 0) + parent_open = 1; + if (bdrv_open(s->hd->backing_hd, parent_img_name, BDRV_O_RDONLY) < 0) goto failure; + parent_open = 0; } return 0; @@ -352,6 +371,11 @@ uint32_t magic; int l1_size, i, ret; + if (parent_open) + // Parent must be opened as RO. + flags = BDRV_O_RDONLY; + fprintf(stderr, "(VMDK) image open: flags=0x%x filename=%s\n", flags, bs->filename); + ret = bdrv_file_open(&s->hd, filename, flags); if (ret < 0) return ret; @@ -386,6 +410,11 @@ / s->l1_entry_sectors; s->l1_table_offset = le64_to_cpu(header.rgd_offset) << 9; s->l1_backup_table_offset = le64_to_cpu(header.gd_offset) << 9; + + if (parent_open) + s->is_parent = 1; + else + s->is_parent = 0; // try to open parent images, if exist if (vmdk_parent_open(bs, filename) != 0) @@ -430,7 +459,8 @@ return -1; } -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset, int allocate); +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, + uint64_t offset, int allocate); static int get_whole_cluster(BlockDriverState *bs, uint64_t cluster_offset, uint64_t offset, int allocate) @@ -446,27 +476,55 @@ if (!vmdk_is_cid_valid(bs)) return -1; - parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, offset, allocate); - if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != - ps->cluster_sectors*512) - return -1; - if (bdrv_pwrite(s->hd, cluster_offset << 9, whole_grain, sizeof(whole_grain)) != - sizeof(whole_grain)) + parent_cluster_offset = get_cluster_offset(s->hd->backing_hd, NULL, offset, allocate); + + if (parent_cluster_offset) { + BDRVVmdkState *act_s = activeBDRV.hd->opaque; + + if (bdrv_pread(ps->hd, parent_cluster_offset, whole_grain, ps->cluster_sectors*512) != ps->cluster_sectors*512) + return -1; + + //Write grain only into the active image + if (bdrv_pwrite(act_s->hd, activeBDRV.cluster_offset << 9, whole_grain, sizeof(whole_grain)) != sizeof(whole_grain)) + return -1; + } + } + return 0; +} + +static int vmdk_L2update(BlockDriverState *bs, VmdkMetaData *m_data) +{ + BDRVVmdkState *s = bs->opaque; + + /* update L2 table */ + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) + return -1; + /* update backup L2 table */ + if (s->l1_backup_table_offset != 0) { + m_data->l2_offset = s->l1_backup_table[m_data->l1_index]; + if (bdrv_pwrite(s->hd, ((int64_t)m_data->l2_offset * 512) + (m_data->l2_index * sizeof(m_data->offset)), + &(m_data->offset), sizeof(m_data->offset)) != sizeof(m_data->offset)) return -1; } + return 0; } -static uint64_t get_cluster_offset(BlockDriverState *bs, +static uint64_t get_cluster_offset(BlockDriverState *bs, VmdkMetaData *m_data, uint64_t offset, int allocate) { BDRVVmdkState *s = bs->opaque; unsigned int l1_index, l2_offset, l2_index; int min_index, i, j; - uint32_t min_count, *l2_table, tmp; + uint32_t min_count, *l2_table, tmp = 0; uint64_t cluster_offset; - + int status; + + if (m_data) + m_data->valid = 0; + l1_index = (offset >> 9) / s->l1_entry_sectors; if (l1_index >= s->l1_size) return 0; @@ -504,32 +562,45 @@ found: l2_index = ((offset >> 9) / s->cluster_sectors) % s->l2_size; cluster_offset = le32_to_cpu(l2_table[l2_index]); + if (!cluster_offset) { struct stat file_buf; if (!allocate) return 0; - stat(s->hd->filename, &file_buf); - cluster_offset = file_buf.st_size; - bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); - - cluster_offset >>= 9; - /* update L2 table */ - tmp = cpu_to_le32(cluster_offset); - l2_table[l2_index] = tmp; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return 0; - /* update backup L2 table */ - if (s->l1_backup_table_offset != 0) { - l2_offset = s->l1_backup_table[l1_index]; - if (bdrv_pwrite(s->hd, ((int64_t)l2_offset * 512) + (l2_index * sizeof(tmp)), - &tmp, sizeof(tmp)) != sizeof(tmp)) + // Avoid the L2 tables update for the images that have snapshots. + if (!s->is_parent) { + status = stat(s->hd->filename, &file_buf); + if (status == -1) { + fprintf(stderr, "(VMDK) Fail file stat: filename =%s size=0x%lx errno=%s\n", + s->hd->filename, (uint64_t)file_buf.st_size, strerror(errno)); return 0; - } + } + cluster_offset = file_buf.st_size; + bdrv_truncate(s->hd, cluster_offset + (s->cluster_sectors << 9)); + cluster_offset >>= 9; + tmp = cpu_to_le32(cluster_offset); + l2_table[l2_index] = tmp; + // Save the active image state + activeBDRV.cluster_offset = cluster_offset; + activeBDRV.hd = bs; + } + /* First of all we write grain itself, to avoid race condition + * that may to corrupt the image. + * This problem may occur because of insufficient space on host disk + * or inappropriate VM shutdown. + */ if (get_whole_cluster(bs, cluster_offset, offset, allocate) == -1) return 0; + + if (m_data) { + m_data->offset = tmp; + m_data->l1_index = l1_index; + m_data->l2_index = l2_index; + m_data->l2_offset = l2_offset; + m_data->valid = 1; + } } cluster_offset <<= 9; return cluster_offset; @@ -542,7 +613,7 @@ int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -559,7 +630,7 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9, 0); + cluster_offset = get_cluster_offset(bs, NULL, sector_num << 9, 0); index_in_cluster = sector_num % s->cluster_sectors; n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) @@ -590,20 +661,34 @@ const uint8_t *buf, int nb_sectors) { BDRVVmdkState *s = bs->opaque; + VmdkMetaData m_data; int index_in_cluster, n; uint64_t cluster_offset; static int cid_update = 0; + if (sector_num > bs->total_sectors) { + fprintf(stderr, + "(VMDK) Wrong offset: sector_num=0x%lx total_sectors=0x%lx\n", + sector_num, bs->total_sectors); + return -1; + } + while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); n = s->cluster_sectors - index_in_cluster; if (n > nb_sectors) n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, 1); + cluster_offset = get_cluster_offset(bs, &m_data, sector_num << 9, 1); if (!cluster_offset) return -1; + if (bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512) != n * 512) return -1; + if (m_data.valid) { + /* update L2 tables */ + if (vmdk_L2update(bs, &m_data) == -1) + return -1; + } nb_sectors -= n; sector_num += n; buf += n * 512; ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2007-05-20 11:38 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-01-16 13:59 [Qemu-devel] Race condition in VMDK (QCOW*) formats Igor Lvovsky 2007-01-16 19:35 ` Fabrice Bellard 2007-01-16 19:35 ` Fabrice Bellard 2007-05-13 11:13 ` [Qemu-devel] [PATCH] Fix a race condition and non-leaf images growing in VMDK chains Igor Lvovsky 2007-05-17 13:54 ` Igor Lvovsky 2007-05-19 21:39 ` Thiemo Seufer 2007-05-19 21:42 ` Thiemo Seufer 2007-05-20 11:40 ` Igor Lvovsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).