* [Qemu-devel] [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
@ 2009-05-06 16:39 Kevin Wolf
2009-05-06 16:54 ` [Qemu-devel] " Avi Kivity
0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2009-05-06 16:39 UTC (permalink / raw)
To: qemu-devel; +Cc: Kevin Wolf, markmc, avi
A write on a qcow2 image that results in the allocation of a new cluster can be
divided into two parts: There is a part which happens before the actual data is
written, this is about allocating clusters in the image file. The second part
happens in the AIO callback handler and is about making L2 entries for the
newly allocated clusters.
When two AIO requests happen to touch the same free cluster, there is a chance
that the second request still sees the cluster as free because the callback of
the first request has not yet run. This means that it reserves another cluster
for the same virtual hard disk offset and hooks it up in its own callback,
overwriting what the first callback has done. Long story cut short: Bad Things
happen.
This patch maintains a list of in-flight requests that have allocated new
clusters. A second request touching the same cluster doesn't find an entry yet
in the L2 table, but can look it up in the list now. The second request is
limited so that it either doesn't touch the allocation of the first request
(so it can have a non-overlapping allocation) or that it doesn't exceed the
end of the allocation of the first request (so it can reuse this allocation
and doesn't need to do anything itself).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block-qcow2.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/block-qcow2.c b/block-qcow2.c
index 1f33125..d78d1e7 100644
--- a/block-qcow2.c
+++ b/block-qcow2.c
@@ -140,6 +140,7 @@ typedef struct BDRVQcowState {
uint8_t *cluster_cache;
uint8_t *cluster_data;
uint64_t cluster_cache_offset;
+ LIST_HEAD(QCowClusterAlloc, QCowL2Meta) cluster_allocs;
uint64_t *refcount_table;
uint64_t refcount_table_offset;
@@ -351,6 +352,8 @@ static int qcow_open(BlockDriverState *bs, const char *filename, int flags)
if (refcount_init(bs) < 0)
goto fail;
+ LIST_INIT(&s->cluster_allocs);
+
/* read qcow2 extensions */
if (header.backing_file_offset)
ext_end = header.backing_file_offset;
@@ -953,9 +956,11 @@ static uint64_t alloc_compressed_cluster_offset(BlockDriverState *bs,
typedef struct QCowL2Meta
{
uint64_t offset;
+ uint64_t cluster_offset;
int n_start;
int nb_available;
int nb_clusters;
+ LIST_ENTRY(QCowL2Meta) next;
} QCowL2Meta;
static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
@@ -1007,6 +1012,9 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
for (i = 0; i < j; i++)
free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1);
+ /* Take the request off the list of running requests */
+ LIST_REMOVE(m, next);
+
ret = 0;
err:
qemu_free(old_cluster);
@@ -1035,6 +1043,7 @@ static uint64_t alloc_cluster_offset(BlockDriverState *bs,
int l2_index, ret;
uint64_t l2_offset, *l2_table, cluster_offset;
int nb_clusters, i = 0;
+ QCowL2Meta *old_alloc;
ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index);
if (ret == 0)
@@ -1083,6 +1092,36 @@ static uint64_t alloc_cluster_offset(BlockDriverState *bs,
}
nb_clusters = i;
+ /*
+ * Check if there already is an AIO write request in flight which allocates
+ * the same cluster. In this case take the cluster_offset which was
+ * allocated for the previous write.
+ */
+ LIST_FOREACH(old_alloc, &s->cluster_allocs, next) {
+
+ uint64_t end_offset = offset + nb_clusters * s->cluster_size;
+ uint64_t old_offset = old_alloc->offset;
+ uint64_t old_end_offset = old_alloc->offset +
+ old_alloc->nb_clusters * s->cluster_size;
+
+ if (end_offset < old_offset || offset > old_end_offset) {
+ /* No intersection */
+ } else if (offset < old_offset) {
+ /* Stop at the start of a running allocation */
+ nb_clusters = (old_offset - offset) >> s->cluster_bits;
+ } else {
+ /* Reuse the previously allocated clusters */
+ if (end_offset > old_end_offset) {
+ nb_clusters = (old_end_offset - offset) >> s->cluster_bits;
+ }
+ cluster_offset = old_alloc->cluster_offset + (offset - old_offset);
+ m->nb_clusters = 0;
+ goto out;
+ }
+ }
+
+ LIST_INSERT_HEAD(&s->cluster_allocs, m, next);
+
/* allocate a new cluster */
cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size);
@@ -1091,6 +1130,7 @@ static uint64_t alloc_cluster_offset(BlockDriverState *bs,
m->offset = offset;
m->n_start = n_start;
m->nb_clusters = nb_clusters;
+ m->cluster_offset = cluster_offset;
out:
m->nb_available = MIN(nb_clusters << (s->cluster_bits - 9), n_end);
--
1.6.0.6
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 16:39 [Qemu-devel] [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice Kevin Wolf
@ 2009-05-06 16:54 ` Avi Kivity
2009-05-06 17:03 ` Kevin Wolf
0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2009-05-06 16:54 UTC (permalink / raw)
To: Kevin Wolf; +Cc: markmc, qemu-devel
Kevin Wolf wrote:
> A write on a qcow2 image that results in the allocation of a new cluster can be
> divided into two parts: There is a part which happens before the actual data is
> written, this is about allocating clusters in the image file. The second part
> happens in the AIO callback handler and is about making L2 entries for the
> newly allocated clusters.
>
> When two AIO requests happen to touch the same free cluster, there is a chance
> that the second request still sees the cluster as free because the callback of
> the first request has not yet run. This means that it reserves another cluster
> for the same virtual hard disk offset and hooks it up in its own callback,
> overwriting what the first callback has done. Long story cut short: Bad Things
> happen.
>
> This patch maintains a list of in-flight requests that have allocated new
> clusters. A second request touching the same cluster doesn't find an entry yet
> in the L2 table, but can look it up in the list now. The second request is
> limited so that it either doesn't touch the allocation of the first request
> (so it can have a non-overlapping allocation) or that it doesn't exceed the
> end of the allocation of the first request (so it can reuse this allocation
> and doesn't need to do anything itself).
>
>
> + } else {
> + /* Reuse the previously allocated clusters */
> + if (end_offset > old_end_offset) {
> + nb_clusters = (old_end_offset - offset) >> s->cluster_bits;
> + }
> + cluster_offset = old_alloc->cluster_offset + (offset - old_offset);
> + m->nb_clusters = 0;
> + goto out;
> + }
> + }
> +
> + LIST_INSERT_HEAD(&s->cluster_allocs, m, next);
>
>
What happens if the second request completes before the first? Then,
when the first request completes, alloc_cluster_link_l2() will call
copy_clusters() and overwrite the second request.
Also, the second request now depends on the first to update its
metadata. But if the first request fail, it will not update its
metadata, and the second request will complete without error and also
without updating its metadata.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 16:54 ` [Qemu-devel] " Avi Kivity
@ 2009-05-06 17:03 ` Kevin Wolf
2009-05-06 17:08 ` Avi Kivity
0 siblings, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2009-05-06 17:03 UTC (permalink / raw)
To: Avi Kivity; +Cc: markmc, qemu-devel
Avi Kivity schrieb:
> What happens if the second request completes before the first? Then,
> when the first request completes, alloc_cluster_link_l2() will call
> copy_clusters() and overwrite the second request.
Ouch, you're right. I should not only check if the image is consistent,
but also if the data survives.
Will fix that tomorrow.
> Also, the second request now depends on the first to update its
> metadata. But if the first request fail, it will not update its
> metadata, and the second request will complete without error and also
> without updating its metadata.
Hm, right. Need to think about this...
Kevin
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 17:03 ` Kevin Wolf
@ 2009-05-06 17:08 ` Avi Kivity
2009-05-06 17:52 ` Kevin Wolf
0 siblings, 1 reply; 9+ messages in thread
From: Avi Kivity @ 2009-05-06 17:08 UTC (permalink / raw)
To: Kevin Wolf; +Cc: markmc, qemu-devel
Kevin Wolf wrote:
> Avi Kivity schrieb:
>
>> What happens if the second request completes before the first? Then,
>> when the first request completes, alloc_cluster_link_l2() will call
>> copy_clusters() and overwrite the second request.
>>
>
> Ouch, you're right. I should not only check if the image is consistent,
> but also if the data survives.
>
>
We really want qemu-io fsx.
>> Also, the second request now depends on the first to update its
>> metadata. But if the first request fail, it will not update its
>> metadata, and the second request will complete without error and also
>> without updating its metadata.
>>
>
> Hm, right. Need to think about this...
>
I suggest retaining the part where you use inflight l2metas to layout
data contiguously, but change alloc_cluster_link_l2() not to rely on
n_start and nb_available but instead recompute them on completion.
m->nb_clusters should never be zeroed for this to work.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 17:08 ` Avi Kivity
@ 2009-05-06 17:52 ` Kevin Wolf
2009-05-06 18:31 ` Avi Kivity
2009-05-07 7:32 ` Gleb Natapov
0 siblings, 2 replies; 9+ messages in thread
From: Kevin Wolf @ 2009-05-06 17:52 UTC (permalink / raw)
To: Avi Kivity; +Cc: markmc, qemu-devel
Avi Kivity schrieb:
> Kevin Wolf wrote:
>> Avi Kivity schrieb:
>>> Also, the second request now depends on the first to update its
>>> metadata. But if the first request fail, it will not update its
>>> metadata, and the second request will complete without error and also
>>> without updating its metadata.
>>>
>> Hm, right. Need to think about this...
>>
>
> I suggest retaining the part where you use inflight l2metas to layout
> data contiguously, but change alloc_cluster_link_l2() not to rely on
> n_start and nb_available but instead recompute them on completion.
> m->nb_clusters should never be zeroed for this to work.
Is there even a reason why we need to copy the unmodified sectors in
alloc_cluster_link_l2() and cannot do that in alloc_cluster_offset()
before we write the new data? Then the callback wouldn't need to mess
around with figuring out which part must be overwritten and which one
mustn't.
Kevin
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 17:52 ` Kevin Wolf
@ 2009-05-06 18:31 ` Avi Kivity
2009-05-07 7:32 ` Gleb Natapov
1 sibling, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2009-05-06 18:31 UTC (permalink / raw)
To: Kevin Wolf; +Cc: markmc, qemu-devel
Kevin Wolf wrote:
>> I suggest retaining the part where you use inflight l2metas to layout
>> data contiguously, but change alloc_cluster_link_l2() not to rely on
>> n_start and nb_available but instead recompute them on completion.
>> m->nb_clusters should never be zeroed for this to work.
>>
>
> Is there even a reason why we need to copy the unmodified sectors in
> alloc_cluster_link_l2() and cannot do that in alloc_cluster_offset()
> before we write the new data? Then the callback wouldn't need to mess
> around with figuring out which part must be overwritten and which one
> mustn't.
>
Then you have the inverse problem, you need to only copy sectors which
aren't under an inflight write, and if such an inflight write fails, you
do need to copy them.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-06 17:52 ` Kevin Wolf
2009-05-06 18:31 ` Avi Kivity
@ 2009-05-07 7:32 ` Gleb Natapov
2009-05-07 7:54 ` Avi Kivity
2009-05-07 8:01 ` Kevin Wolf
1 sibling, 2 replies; 9+ messages in thread
From: Gleb Natapov @ 2009-05-07 7:32 UTC (permalink / raw)
To: Kevin Wolf; +Cc: markmc, Avi Kivity, qemu-devel
On Wed, May 06, 2009 at 07:52:44PM +0200, Kevin Wolf wrote:
> Avi Kivity schrieb:
> > Kevin Wolf wrote:
> >> Avi Kivity schrieb:
> >>> Also, the second request now depends on the first to update its
> >>> metadata. But if the first request fail, it will not update its
> >>> metadata, and the second request will complete without error and also
> >>> without updating its metadata.
> >>>
> >> Hm, right. Need to think about this...
> >>
> >
> > I suggest retaining the part where you use inflight l2metas to layout
> > data contiguously, but change alloc_cluster_link_l2() not to rely on
> > n_start and nb_available but instead recompute them on completion.
> > m->nb_clusters should never be zeroed for this to work.
>
> Is there even a reason why we need to copy the unmodified sectors in
> alloc_cluster_link_l2() and cannot do that in alloc_cluster_offset()
> before we write the new data? Then the callback wouldn't need to mess
> around with figuring out which part must be overwritten and which one
> mustn't.
>
The reason we need to copy unmodified sectors in alloc_cluster_link_l2()
is exactly to handle concurrent writes into the same cluster. This is
essentially RMW. I don't see why concurrent writes should not work with
the logic in place. There is a bug there currently of cause :) Can
somebody check this patch:
diff --git a/block-qcow2.c b/block-qcow2.c
index 7840634..801d26d 100644
--- a/block-qcow2.c
+++ b/block-qcow2.c
@@ -995,8 +995,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
if(l2_table[l2_index + i] != 0)
old_cluster[j++] = l2_table[l2_index + i];
- l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
- (i << s->cluster_bits)) | QCOW_OFLAG_COPIED);
+ l2_table[l2_index + i] = cpu_to_be64(((cluster_offset +
+ (i << s->cluster_bits)) | QCOW_OFLAG_COPIED));
}
if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t),
@@ -1005,7 +1005,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
goto err;
for (i = 0; i < j; i++)
- free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1);
+ free_any_clusters(bs, be64_to_cpu(old_cluster[i]) & ~QCOW_OFLAG_COPIED,
+ 1);
ret = 0;
err:
--
Gleb.
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-07 7:32 ` Gleb Natapov
@ 2009-05-07 7:54 ` Avi Kivity
2009-05-07 8:01 ` Kevin Wolf
1 sibling, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2009-05-07 7:54 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Kevin Wolf, markmc, qemu-devel
Gleb Natapov wrote:
>
> The reason we need to copy unmodified sectors in alloc_cluster_link_l2()
> is exactly to handle concurrent writes into the same cluster. This is
> essentially RMW. I don't see why concurrent writes should not work with
> the logic in place. There is a bug there currently of cause :) Can
> somebody check this patch:
>
>
> diff --git a/block-qcow2.c b/block-qcow2.c
> index 7840634..801d26d 100644
> --- a/block-qcow2.c
> +++ b/block-qcow2.c
> @@ -995,8 +995,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
> if(l2_table[l2_index + i] != 0)
> old_cluster[j++] = l2_table[l2_index + i];
>
> - l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
> - (i << s->cluster_bits)) | QCOW_OFLAG_COPIED);
> + l2_table[l2_index + i] = cpu_to_be64(((cluster_offset +
> + (i << s->cluster_bits)) | QCOW_OFLAG_COPIED));
> }
>
> if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t),
> @@ -1005,7 +1005,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
> goto err;
>
> for (i = 0; i < j; i++)
> - free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1);
> + free_any_clusters(bs, be64_to_cpu(old_cluster[i]) & ~QCOW_OFLAG_COPIED,
> + 1);
>
> ret = 0;
> err:
>
I can confirm that it fixes the problem.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] Re: [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice
2009-05-07 7:32 ` Gleb Natapov
2009-05-07 7:54 ` Avi Kivity
@ 2009-05-07 8:01 ` Kevin Wolf
1 sibling, 0 replies; 9+ messages in thread
From: Kevin Wolf @ 2009-05-07 8:01 UTC (permalink / raw)
To: Gleb Natapov; +Cc: markmc, Avi Kivity, qemu-devel
Gleb Natapov schrieb:
> The reason we need to copy unmodified sectors in alloc_cluster_link_l2()
> is exactly to handle concurrent writes into the same cluster. This is
> essentially RMW. I don't see why concurrent writes should not work with
> the logic in place. There is a bug there currently of cause :) Can
> somebody check this patch:
>
>
> diff --git a/block-qcow2.c b/block-qcow2.c
> index 7840634..801d26d 100644
> --- a/block-qcow2.c
> +++ b/block-qcow2.c
> @@ -995,8 +995,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
> if(l2_table[l2_index + i] != 0)
> old_cluster[j++] = l2_table[l2_index + i];
>
> - l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
> - (i << s->cluster_bits)) | QCOW_OFLAG_COPIED);
> + l2_table[l2_index + i] = cpu_to_be64(((cluster_offset +
> + (i << s->cluster_bits)) | QCOW_OFLAG_COPIED));
> }
>
> if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t),
> @@ -1005,7 +1005,8 @@ static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
> goto err;
>
> for (i = 0; i < j; i++)
> - free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1);
> + free_any_clusters(bs, be64_to_cpu(old_cluster[i]) & ~QCOW_OFLAG_COPIED,
> + 1);
>
> ret = 0;
> err:
After Gleb explained to me how the whole thing is meant to work, I agree
that this is the right fix. The first hunk is meaningless though. I
suggest to replace it by a hunk adding a big comment explaining how
things work. ;-)
Kevin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-05-07 8:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-06 16:39 [Qemu-devel] [PATCH] qcow2/virtio corruption: Don't allocate the same cluster twice Kevin Wolf
2009-05-06 16:54 ` [Qemu-devel] " Avi Kivity
2009-05-06 17:03 ` Kevin Wolf
2009-05-06 17:08 ` Avi Kivity
2009-05-06 17:52 ` Kevin Wolf
2009-05-06 18:31 ` Avi Kivity
2009-05-07 7:32 ` Gleb Natapov
2009-05-07 7:54 ` Avi Kivity
2009-05-07 8:01 ` Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).