* [Qemu-devel] [RFC][PATCH 0/2] block: Add flush after metadata writes
@ 2010-06-17 12:03 Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 1/2] block: Add bdrv_(p)write_sync Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes Kevin Wolf
0 siblings, 2 replies; 8+ messages in thread
From: Kevin Wolf @ 2010-06-17 12:03 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, hch
This addresses the data integrity problems described at
http://wiki.qemu.org/Features/Qcow2DataIntegrity#Metadata_update_ordering.2C_Part_2
These problems are the same for all image formats (except raw, which doesn't
have any metadata), so I'm going to add more patches for the other formats for
the real patch submission.
Kevin Wolf (2):
block: Add bdrv_(p)write_sync
qcow2: Use bdrv_(p)write_sync for metadata writes
block.c | 37 +++++++++++++++++++++++++++++++++++++
block.h | 4 ++++
block/qcow2-cluster.c | 16 ++++++++--------
block/qcow2-refcount.c | 18 +++++++++---------
block/qcow2-snapshot.c | 14 +++++++-------
block/qcow2.c | 10 +++++-----
6 files changed, 70 insertions(+), 29 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] [RFC][PATCH 1/2] block: Add bdrv_(p)write_sync
2010-06-17 12:03 [Qemu-devel] [RFC][PATCH 0/2] block: Add flush after metadata writes Kevin Wolf
@ 2010-06-17 12:03 ` Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes Kevin Wolf
1 sibling, 0 replies; 8+ messages in thread
From: Kevin Wolf @ 2010-06-17 12:03 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, hch
Add new functions that write and flush the written data to disk immediately.
This is what needs to be used for image format metadata to maintain integrity
for cache=... modes that don't use O_DSYNC. (Actually, we only need barriers,
and therefore the functions are defined as such, but flushes is what is
implemented in this patch - we can try to change that later)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 37 +++++++++++++++++++++++++++++++++++++
block.h | 4 ++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index 0765fbc..7b64c2d 100644
--- a/block.c
+++ b/block.c
@@ -1010,6 +1010,43 @@ int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
return count1;
}
+/*
+ * Writes to the file and ensures that no writes are reordered across this
+ * request (acts as a barrier)
+ *
+ * Returns 0 on success, -errno in error cases.
+ */
+int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
+ const void *buf, int count)
+{
+ int ret;
+
+ ret = bdrv_pwrite(bs, offset, buf, count);
+ if (ret < 0) {
+ return ret;
+ }
+
+ /* No flush needed for cache=writethrough, it uses O_DSYNC */
+ if ((bs->open_flags & BDRV_O_CACHE_MASK) != 0) {
+ bdrv_flush(bs);
+ }
+
+ return 0;
+}
+
+/*
+ * Writes to the file and ensures that no writes are reordered across this
+ * request (acts as a barrier)
+ *
+ * Returns 0 on success, -errno in error cases.
+ */
+int bdrv_write_sync(BlockDriverState *bs, int64_t sector_num,
+ const uint8_t *buf, int nb_sectors)
+{
+ return bdrv_pwrite_sync(bs, BDRV_SECTOR_SIZE * sector_num,
+ buf, BDRV_SECTOR_SIZE * nb_sectors);
+}
+
/**
* Truncate file to 'offset' bytes (needed only for file protocols)
*/
diff --git a/block.h b/block.h
index 9df9b38..6a157f4 100644
--- a/block.h
+++ b/block.h
@@ -80,6 +80,10 @@ int bdrv_pread(BlockDriverState *bs, int64_t offset,
void *buf, int count);
int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
const void *buf, int count);
+int bdrv_pwrite_sync(BlockDriverState *bs, int64_t offset,
+ const void *buf, int count);
+int bdrv_write_sync(BlockDriverState *bs, int64_t sector_num,
+ const uint8_t *buf, int nb_sectors);
int bdrv_truncate(BlockDriverState *bs, int64_t offset);
int64_t bdrv_getlength(BlockDriverState *bs);
void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr);
--
1.6.6.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [Qemu-devel] [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-17 12:03 [Qemu-devel] [RFC][PATCH 0/2] block: Add flush after metadata writes Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 1/2] block: Add bdrv_(p)write_sync Kevin Wolf
@ 2010-06-17 12:03 ` Kevin Wolf
2010-06-17 14:19 ` [Qemu-devel] " Stefan Hajnoczi
1 sibling, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2010-06-17 12:03 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, stefanha, hch
Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block/qcow2-cluster.c | 16 ++++++++--------
block/qcow2-refcount.c | 18 +++++++++---------
block/qcow2-snapshot.c | 14 +++++++-------
block/qcow2.c | 10 +++++-----
4 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 5760ad6..05cf6c2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -64,7 +64,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
for(i = 0; i < s->l1_size; i++)
new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
- ret = bdrv_pwrite(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
+ ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset, new_l1_table, new_l1_size2);
if (ret != new_l1_size2)
goto fail;
for(i = 0; i < s->l1_size; i++)
@@ -74,7 +74,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
cpu_to_be32w((uint32_t*)data, new_l1_size);
cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset);
- ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
+ ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size), data,sizeof(data));
if (ret != sizeof(data)) {
goto fail;
}
@@ -207,7 +207,7 @@ static int write_l1_entry(BlockDriverState *bs, int l1_index)
}
BLKDBG_EVENT(bs->file, BLKDBG_L1_UPDATE);
- ret = bdrv_pwrite(bs->file, s->l1_table_offset + 8 * l1_start_index,
+ ret = bdrv_pwrite_sync(bs->file, s->l1_table_offset + 8 * l1_start_index,
buf, sizeof(buf));
if (ret < 0) {
return ret;
@@ -263,7 +263,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
}
/* write the l2 table to the file */
BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
- ret = bdrv_pwrite(bs->file, l2_offset, l2_table,
+ ret = bdrv_pwrite_sync(bs->file, l2_offset, l2_table,
s->l2_size * sizeof(uint64_t));
if (ret < 0) {
goto fail;
@@ -413,8 +413,8 @@ static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
&s->aes_encrypt_key);
}
BLKDBG_EVENT(bs->file, BLKDBG_COW_WRITE);
- ret = bdrv_write(bs->file, (cluster_offset >> 9) + n_start,
- s->cluster_data, n);
+ ret = bdrv_write_sync(bs->file, (cluster_offset >> 9) + n_start,
+ s->cluster_data, n);
if (ret < 0)
return ret;
return 0;
@@ -631,7 +631,7 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
l2_table[l2_index] = cpu_to_be64(cluster_offset);
- if (bdrv_pwrite(bs->file,
+ if (bdrv_pwrite_sync(bs->file,
l2_offset + l2_index * sizeof(uint64_t),
l2_table + l2_index,
sizeof(uint64_t)) != sizeof(uint64_t))
@@ -655,7 +655,7 @@ static int write_l2_entries(BlockDriverState *bs, uint64_t *l2_table,
int ret;
BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE);
- ret = bdrv_pwrite(bs->file, l2_offset + start_offset,
+ ret = bdrv_pwrite_sync(bs->file, l2_offset + start_offset,
&l2_table[l2_start_index], len);
if (ret < 0) {
return ret;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 41e1da9..540bf49 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -44,7 +44,7 @@ static int write_refcount_block(BlockDriverState *bs)
}
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE);
- if (bdrv_pwrite(bs->file, s->refcount_block_cache_offset,
+ if (bdrv_pwrite_sync(bs->file, s->refcount_block_cache_offset,
s->refcount_block_cache, size) != size)
{
return -EIO;
@@ -269,7 +269,7 @@ static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
/* Now the new refcount block needs to be written to disk */
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE);
- ret = bdrv_pwrite(bs->file, new_block, s->refcount_block_cache,
+ ret = bdrv_pwrite_sync(bs->file, new_block, s->refcount_block_cache,
s->cluster_size);
if (ret < 0) {
goto fail_block;
@@ -279,7 +279,7 @@ static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
if (refcount_table_index < s->refcount_table_size) {
uint64_t data64 = cpu_to_be64(new_block);
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_HOOKUP);
- ret = bdrv_pwrite(bs->file,
+ ret = bdrv_pwrite_sync(bs->file,
s->refcount_table_offset + refcount_table_index * sizeof(uint64_t),
&data64, sizeof(data64));
if (ret < 0) {
@@ -359,7 +359,7 @@ static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
/* Write refcount blocks to disk */
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_BLOCKS);
- ret = bdrv_pwrite(bs->file, meta_offset, new_blocks,
+ ret = bdrv_pwrite_sync(bs->file, meta_offset, new_blocks,
blocks_clusters * s->cluster_size);
qemu_free(new_blocks);
if (ret < 0) {
@@ -372,7 +372,7 @@ static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
}
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_WRITE_TABLE);
- ret = bdrv_pwrite(bs->file, table_offset, new_table,
+ ret = bdrv_pwrite_sync(bs->file, table_offset, new_table,
table_size * sizeof(uint64_t));
if (ret < 0) {
goto fail_table;
@@ -387,7 +387,7 @@ static int64_t alloc_refcount_block(BlockDriverState *bs, int64_t cluster_index)
cpu_to_be64w((uint64_t*)data, table_offset);
cpu_to_be32w((uint32_t*)(data + 8), table_clusters);
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_ALLOC_SWITCH_TABLE);
- ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, refcount_table_offset),
+ ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, refcount_table_offset),
data, sizeof(data));
if (ret < 0) {
goto fail_table;
@@ -444,7 +444,7 @@ static int write_refcount_block_entries(BlockDriverState *bs,
size = (last_index - first_index) << REFCOUNT_SHIFT;
BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
- ret = bdrv_pwrite(bs->file,
+ ret = bdrv_pwrite_sync(bs->file,
refcount_block_offset + (first_index << REFCOUNT_SHIFT),
&s->refcount_block_cache[first_index], size);
if (ret < 0) {
@@ -826,7 +826,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
}
}
if (l2_modified) {
- if (bdrv_pwrite(bs->file,
+ if (bdrv_pwrite_sync(bs->file,
l2_offset, l2_table, l2_size) != l2_size)
goto fail;
}
@@ -850,7 +850,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
if (l1_modified) {
for(i = 0; i < l1_size; i++)
cpu_to_be64s(&l1_table[i]);
- if (bdrv_pwrite(bs->file, l1_table_offset, l1_table,
+ if (bdrv_pwrite_sync(bs->file, l1_table_offset, l1_table,
l1_size2) != l1_size2)
goto fail;
for(i = 0; i < l1_size; i++)
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 2a21c17..4c33996 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -158,24 +158,24 @@ static int qcow_write_snapshots(BlockDriverState *bs)
h.id_str_size = cpu_to_be16(id_str_size);
h.name_size = cpu_to_be16(name_size);
offset = align_offset(offset, 8);
- if (bdrv_pwrite(bs->file, offset, &h, sizeof(h)) != sizeof(h))
+ if (bdrv_pwrite_sync(bs->file, offset, &h, sizeof(h)) != sizeof(h))
goto fail;
offset += sizeof(h);
- if (bdrv_pwrite(bs->file, offset, sn->id_str, id_str_size) != id_str_size)
+ if (bdrv_pwrite_sync(bs->file, offset, sn->id_str, id_str_size) != id_str_size)
goto fail;
offset += id_str_size;
- if (bdrv_pwrite(bs->file, offset, sn->name, name_size) != name_size)
+ if (bdrv_pwrite_sync(bs->file, offset, sn->name, name_size) != name_size)
goto fail;
offset += name_size;
}
/* update the various header fields */
data64 = cpu_to_be64(snapshots_offset);
- if (bdrv_pwrite(bs->file, offsetof(QCowHeader, snapshots_offset),
+ if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, snapshots_offset),
&data64, sizeof(data64)) != sizeof(data64))
goto fail;
data32 = cpu_to_be32(s->nb_snapshots);
- if (bdrv_pwrite(bs->file, offsetof(QCowHeader, nb_snapshots),
+ if (bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, nb_snapshots),
&data32, sizeof(data32)) != sizeof(data32))
goto fail;
@@ -284,7 +284,7 @@ int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info)
for(i = 0; i < s->l1_size; i++) {
l1_table[i] = cpu_to_be64(s->l1_table[i]);
}
- if (bdrv_pwrite(bs->file, sn->l1_table_offset,
+ if (bdrv_pwrite_sync(bs->file, sn->l1_table_offset,
l1_table, s->l1_size * sizeof(uint64_t)) !=
(s->l1_size * sizeof(uint64_t)))
goto fail;
@@ -335,7 +335,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
if (bdrv_pread(bs->file, sn->l1_table_offset,
s->l1_table, l1_size2) != l1_size2)
goto fail;
- if (bdrv_pwrite(bs->file, s->l1_table_offset,
+ if (bdrv_pwrite_sync(bs->file, s->l1_table_offset,
s->l1_table, l1_size2) != l1_size2)
goto fail;
for(i = 0;i < s->l1_size; i++) {
diff --git a/block/qcow2.c b/block/qcow2.c
index 1562557..70c5bb3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -736,7 +736,7 @@ static int qcow2_update_ext_header(BlockDriverState *bs,
backing_file_offset = sizeof(QCowHeader) + offset;
}
- ret = bdrv_pwrite(bs->file, sizeof(QCowHeader), buf, ext_size);
+ ret = bdrv_pwrite_sync(bs->file, sizeof(QCowHeader), buf, ext_size);
if (ret < 0) {
goto fail;
}
@@ -745,13 +745,13 @@ static int qcow2_update_ext_header(BlockDriverState *bs,
uint64_t be_backing_file_offset = cpu_to_be64(backing_file_offset);
uint32_t be_backing_file_size = cpu_to_be32(backing_file_len);
- ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_offset),
+ ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_offset),
&be_backing_file_offset, sizeof(uint64_t));
if (ret < 0) {
goto fail;
}
- ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, backing_file_size),
+ ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, backing_file_size),
&be_backing_file_size, sizeof(uint32_t));
if (ret < 0) {
goto fail;
@@ -1035,8 +1035,8 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
/* write updated header.size */
offset = cpu_to_be64(offset);
- ret = bdrv_pwrite(bs->file, offsetof(QCowHeader, size),
- &offset, sizeof(uint64_t));
+ ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, size),
+ &offset, sizeof(uint64_t));
if (ret < 0) {
return ret;
}
--
1.6.6.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [Qemu-devel] Re: [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes Kevin Wolf
@ 2010-06-17 14:19 ` Stefan Hajnoczi
2010-06-17 14:39 ` Kevin Wolf
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2010-06-17 14:19 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, hch
On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
Any performance numbers? This change is necessary for correctness but
I wonder what the performance impact is for users.
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] Re: [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-17 14:19 ` [Qemu-devel] " Stefan Hajnoczi
@ 2010-06-17 14:39 ` Kevin Wolf
2010-06-17 19:47 ` Stefan Hajnoczi
0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2010-06-17 14:39 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: qemu-devel, hch
Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
>
> Any performance numbers? This change is necessary for correctness but
> I wonder what the performance impact is for users.
No numbers yet, but as you say we need to do it anyway. It should
definitely be better than any other option that I can think of
(cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
performance when metadata is actually changed. As long as we only write
guest data, there is no difference.
Making it a barrier instead of a flush would probably be better, have
you already had a look at this since we talked about it?
Kevin
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] Re: [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-17 14:39 ` Kevin Wolf
@ 2010-06-17 19:47 ` Stefan Hajnoczi
2010-06-18 7:54 ` Kevin Wolf
0 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2010-06-17 19:47 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, hch
On Thu, Jun 17, 2010 at 3:39 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
>> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
>>
>> Any performance numbers? This change is necessary for correctness but
>> I wonder what the performance impact is for users.
>
> No numbers yet, but as you say we need to do it anyway. It should
> definitely be better than any other option that I can think of
> (cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
> performance when metadata is actually changed. As long as we only write
> guest data, there is no difference.
>
> Making it a barrier instead of a flush would probably be better, have
> you already had a look at this since we talked about it?
No I haven't seen a userspace barrier solution. I'm not actively
working on that at the moment but am keeping an eye out for solutions.
I'm not sure what to make of sync_file_range(2), the man page
suggests it cannot be relied on since only already allocated blocks
will be flushed appropriately. Filesystem metadata changes are not
flushed, which is especially problematic for copy-on-write filesystems
(according to the man page).
Thinking about the performance more, I predict that guest OS
installations will slow down but day-to-day usage (especially
modifying already allocated blocks) will be alright for the reasons
you mentioned.
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] Re: [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-17 19:47 ` Stefan Hajnoczi
@ 2010-06-18 7:54 ` Kevin Wolf
2010-06-18 7:58 ` Christoph Hellwig
0 siblings, 1 reply; 8+ messages in thread
From: Kevin Wolf @ 2010-06-18 7:54 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: qemu-devel, hch
Am 17.06.2010 21:47, schrieb Stefan Hajnoczi:
> On Thu, Jun 17, 2010 at 3:39 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 17.06.2010 16:19, schrieb Stefan Hajnoczi:
>>> On Thu, Jun 17, 2010 at 1:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>>>> Use bdrv_(p)write_sync to ensure metadata integrity in case of a crash.
>>>
>>> Any performance numbers? This change is necessary for correctness but
>>> I wonder what the performance impact is for users.
>>
>> No numbers yet, but as you say we need to do it anyway. It should
>> definitely be better than any other option that I can think of
>> (cache=writethrough or some O_DIRECT|O_DSYNC mode) in that it only hurts
>> performance when metadata is actually changed. As long as we only write
>> guest data, there is no difference.
>>
>> Making it a barrier instead of a flush would probably be better, have
>> you already had a look at this since we talked about it?
>
> No I haven't seen a userspace barrier solution. I'm not actively
> working on that at the moment but am keeping an eye out for solutions.
> I'm not sure what to make of sync_file_range(2), the man page
> suggests it cannot be relied on since only already allocated blocks
> will be flushed appropriately. Filesystem metadata changes are not
> flushed, which is especially problematic for copy-on-write filesystems
> (according to the man page).
qcow2 files are expected to grow, so this probably doesn't help us. At
best we could try to detect if the file grows and decide if to use
sync_file_range or fdatasync.
Of course, it would be better to have a flag or something to include the
necessary metadata, but I have no idea how it's implemented and if this
is easily possible.
> Thinking about the performance more, I predict that guest OS
> installations will slow down but day-to-day usage (especially
> modifying already allocated blocks) will be alright for the reasons
> you mentioned.
Right, it's only growth that will get an impact. The pooint is that it's
not only installation (in most cases you could even use cache=unsafe for
that) but that you start to grow again after taking a snapshot - be it
internal (savevm) or external (backing file). This is where it hurts.
Kevin
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Qemu-devel] Re: [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes
2010-06-18 7:54 ` Kevin Wolf
@ 2010-06-18 7:58 ` Christoph Hellwig
0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-06-18 7:58 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel, hch
On Fri, Jun 18, 2010 at 09:54:24AM +0200, Kevin Wolf wrote:
> qcow2 files are expected to grow, so this probably doesn't help us. At
> best we could try to detect if the file grows and decide if to use
> sync_file_range or fdatasync.
No. sync_file_range is always unsafe for use with a real filesystem.
It never even calls into the filesystem and never flushes the disk write
cache. For our use case it's exactly a no-op.
> Of course, it would be better to have a flag or something to include the
> necessary metadata, but I have no idea how it's implemented and if this
> is easily possible.
We could add a flag to it. But rather than adding more crap to this
bastard that should never have been added I'd rather add a real
fsync_range system call.
Note that for our use case it doesn't matter anyway. For an image file
accessed with cache=none there will be no data in the pagecache, and
the writeback of the inode and associated metadata (btree blocks or
indirect blocks) will be same wether we give it a range or not, and same
for the disk cache flush.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-06-18 7:58 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-17 12:03 [Qemu-devel] [RFC][PATCH 0/2] block: Add flush after metadata writes Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 1/2] block: Add bdrv_(p)write_sync Kevin Wolf
2010-06-17 12:03 ` [Qemu-devel] [RFC][PATCH 2/2] qcow2: Use bdrv_(p)write_sync for metadata writes Kevin Wolf
2010-06-17 14:19 ` [Qemu-devel] " Stefan Hajnoczi
2010-06-17 14:39 ` Kevin Wolf
2010-06-17 19:47 ` Stefan Hajnoczi
2010-06-18 7:54 ` Kevin Wolf
2010-06-18 7:58 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).