[Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
@ 2015-05-07  4:04 Zhe Qiu
  2015-05-07 10:09 ` Stefan Hajnoczi
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Zhe Qiu @ 2015-05-07  4:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: phoeagon, qemu-block, mreitz

From: phoeagon <phoeagon@gmail.com>

In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

Only when write is successful that bdrv_flush is called.

Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
---
 block/vdi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/vdi.c b/block/vdi.c
index 7642ef3..dfe8ade 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
         logout("will write %u block map sectors starting from entry %u\n",
                n_sectors, bmap_first);
         ret = bdrv_write(bs->file, offset, base, n_sectors);
+        if (ret >= 0) {
+            ret = bdrv_flush(bs->file);
+        }
     }
 
     return ret;
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
@ 2015-05-07 10:09 ` Stefan Hajnoczi
  2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2015-05-07 10:09 UTC (permalink / raw)
  To: Zhe Qiu; +Cc: Kevin Wolf, Stefan Weil, qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1092 bytes --]

On Thu, May 07, 2015 at 12:04:56PM +0800, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> 
> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>  block/vdi.c | 3 +++
>  1 file changed, 3 insertions(+)

CCing Stefan Weil and Kevin Wolf (see output from
scripts/get_maintainer.pl -f block/vdi.c).

> 
> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>          logout("will write %u block map sectors starting from entry %u\n",
>                 n_sectors, bmap_first);
>          ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>      }
>  
>      return ret;
> -- 
> 2.4.0
> 
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
  2015-05-07 10:09 ` Stefan Hajnoczi
@ 2015-05-07 15:05 ` Eric Blake
  2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Eric Blake @ 2015-05-07 15:05 UTC (permalink / raw)
  To: Zhe Qiu, qemu-devel; +Cc: qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1531 bytes --]

On 05/06/2015 10:04 PM, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> 

Please wrap commit comments to be under 80 columns (in fact, under 72 is
good, because 'git log' adds spaces when displaying commit bodies).

Your notation of commit~commit is unusual; ranges in git are usually
spelled commit..commit.  Also, it's okay to abbreviate commit SHAs to 8
bytes or so.  So to shrink your long line, I'd write b0ad5a45..078a458e.

> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>

Your 'From:' and 'Signed-off-by:' lines have different spellings, which
makes it more confusing when searching for patches by you.  You can add
an entry to .mailmap (as a separate patch) to retro-actively consolidate
entries, but it is better to catch things like this up front before we
even have the problem of separate names.

I suspect you want "Zhe Qiu" as the spelling of your name in both lines
(git config can be taught to set up the preferred name to attribute both
for signing patches and for sending email).  It is also acceptable to
use UTF-8 to spell your name in native characters, or even a combination
of "native name (ascii counterpart)"

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
  2015-05-07 10:09 ` Stefan Hajnoczi
  2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake
@ 2015-05-07 15:16 ` Zhe Qiu
  2015-05-08 13:14   ` Max Reitz
  2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf
  2015-05-08 13:10 ` [Qemu-devel] " Max Reitz
  4 siblings, 1 reply; 23+ messages in thread
From: Zhe Qiu @ 2015-05-07 15:16 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, Zhe Qiu, qemu-block, sw, mreitz

In reference to b0ad5a45...078a458e, metadata writes to 
qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

Only when write is successful that bdrv_flush is called.

Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
---
 block/vdi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/vdi.c b/block/vdi.c
index 7642ef3..dfe8ade 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
         logout("will write %u block map sectors starting from entry %u\n",
                n_sectors, bmap_first);
         ret = bdrv_write(bs->file, offset, base, n_sectors);
+        if (ret >= 0) {
+            ret = bdrv_flush(bs->file);
+        }
     }
 
     return ret;
-- 
2.4.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu
@ 2015-05-08 13:14   ` Max Reitz
  2015-05-08 13:55     ` Kevin Wolf
  0 siblings, 1 reply; 23+ messages in thread
From: Max Reitz @ 2015-05-08 13:14 UTC (permalink / raw)
  To: Zhe Qiu, qemu-devel; +Cc: kwolf, sw, qemu-block

On 07.05.2015 17:16, Zhe Qiu wrote:
> In reference to b0ad5a45...078a458e, metadata writes to
> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>
> Only when write is successful that bdrv_flush is called.
>
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>   block/vdi.c | 3 +++
>   1 file changed, 3 insertions(+)

I missed Kevin's arguments before, but I think that adding this is more 
correct than not having it; and when thinking about speed, this is vdi, 
a format supported for compatibility. Writing isn't our most important 
concern anyway (especially considering that it's even to disable write 
support for vdi at compile time).

So if we wanted to optimize it, we'd probably have to cache multiple 
allocations, do them at once and then flush afterwards (like the 
metadata cache we have in qcow2?), but that is complicated (like the 
metadata cache in qcow2), certainly too complicated for a format 
supported for compatibility (unless someone really wants to implement it).

Reviewed-by: Max Reitz <mreitz@redhat.com>

>
> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>           logout("will write %u block map sectors starting from entry %u\n",
>                  n_sectors, bmap_first);
>           ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>       }
>   
>       return ret;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 13:14   ` Max Reitz
@ 2015-05-08 13:55     ` Kevin Wolf
  2015-05-08 14:43       ` phoeagon
  2015-05-08 21:26       ` Stefan Weil
  0 siblings, 2 replies; 23+ messages in thread
From: Kevin Wolf @ 2015-05-08 13:55 UTC (permalink / raw)
  To: Max Reitz; +Cc: sw, Zhe Qiu, qemu-devel, qemu-block

Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> On 07.05.2015 17:16, Zhe Qiu wrote:
> >In reference to b0ad5a45...078a458e, metadata writes to
> >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> >
> >Only when write is successful that bdrv_flush is called.
> >
> >Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> >---
> >  block/vdi.c | 3 +++
> >  1 file changed, 3 insertions(+)
> 
> I missed Kevin's arguments before, but I think that adding this is
> more correct than not having it; and when thinking about speed, this
> is vdi, a format supported for compatibility.

If you use it only as a convert target, you probably care more about
speed than about leaks in case of a host crash.

> So if we wanted to optimize it, we'd probably have to cache multiple
> allocations, do them at once and then flush afterwards (like the
> metadata cache we have in qcow2?)

That would defeat the purpose of this patch which aims at having
metadata and data written out almost at the same time. On the other
hand, fully avoiding the problem instead of just making the window
smaller would require a journal, which VDI just doesn't have.

I'm not convinced of this patch, but I'll defer to Stefan Weil as the
VDI maintainer.

Kevin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 13:55     ` Kevin Wolf
@ 2015-05-08 14:43       ` phoeagon
  2015-05-08 21:26       ` Stefan Weil
  1 sibling, 0 replies; 23+ messages in thread
From: phoeagon @ 2015-05-08 14:43 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: sw, qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

Then I would guess the same reason should apply to VMDK/VPC as well...
Their metadata update protocol is not atomic either, and a sync after
metadata update doesn't fix the whole thing theoretically either. Yet the
metadata sync patches as old as since 2010 are still there. It should also
be a performance boost if we remove those write barriers as well, if
conversion performance is our major concern.

I think when converting images, one can always opt for "cache=unsafe" to
avoid potential performance degradation from conservative cache (it should
really be default for qemu-img convert, but I don't know if it's the case),
so conversion performance shouldn't be a reason to sacrifice VM-runtime
consistency.

On Fri, May 8, 2015 at 9:55 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> > On 07.05.2015 17:16, Zhe Qiu wrote:
> > >In reference to b0ad5a45...078a458e, metadata writes to
> > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> > >
> > >Only when write is successful that bdrv_flush is called.
> > >
> > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> > >---
> > >  block/vdi.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> >
> > I missed Kevin's arguments before, but I think that adding this is
> > more correct than not having it; and when thinking about speed, this
> > is vdi, a format supported for compatibility.
>
> If you use it only as a convert target, you probably care more about
> speed than about leaks in case of a host crash.
>
> > So if we wanted to optimize it, we'd probably have to cache multiple
> > allocations, do them at once and then flush afterwards (like the
> > metadata cache we have in qcow2?)
>
> That would defeat the purpose of this patch which aims at having
> metadata and data written out almost at the same time. On the other
> hand, fully avoiding the problem instead of just making the window
> smaller would require a journal, which VDI just doesn't have.
>
> I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> VDI maintainer.
>
> Kevin
>

[-- Attachment #2: Type: text/html, Size: 2732 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 13:55     ` Kevin Wolf
  2015-05-08 14:43       ` phoeagon
@ 2015-05-08 21:26       ` Stefan Weil
  2015-05-09  3:54         ` phoeagon
  1 sibling, 1 reply; 23+ messages in thread
From: Stefan Weil @ 2015-05-08 21:26 UTC (permalink / raw)
  To: Kevin Wolf, Max Reitz; +Cc: Zhe Qiu, qemu-devel, qemu-block

Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
> Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
>> On 07.05.2015 17:16, Zhe Qiu wrote:
>>> In reference to b0ad5a45...078a458e, metadata writes to
>>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>>>
>>> Only when write is successful that bdrv_flush is called.
>>>
>>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>>> ---
>>>   block/vdi.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>> I missed Kevin's arguments before, but I think that adding this is
>> more correct than not having it; and when thinking about speed, this
>> is vdi, a format supported for compatibility.
> If you use it only as a convert target, you probably care more about
> speed than about leaks in case of a host crash.
>
>> So if we wanted to optimize it, we'd probably have to cache multiple
>> allocations, do them at once and then flush afterwards (like the
>> metadata cache we have in qcow2?)
> That would defeat the purpose of this patch which aims at having
> metadata and data written out almost at the same time. On the other
> hand, fully avoiding the problem instead of just making the window
> smaller would require a journal, which VDI just doesn't have.
>
> I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> VDI maintainer.
>
> Kevin

Thanks for asking. I share your concerns regarding reduced performance
caused by bdrv_flush. Conversions to VDI will take longer (how much?),
and also installation of an OS on a new VDI disk image will be slower
because that are the typical scenarios where the disk usage grows.

@phoeagon: Did the benchmark which you used allocate additional disk
storage? If not or if it only allocated once and then spent some time
on already allocated blocks, that benchmark was not valid for this case.

On the other hand I don't see a need for the flushing because the kind
of failures (power failure) and their consequences seem to be acceptable
for typical VDI usage, namely either image conversion or tests with
existing images.

That's why I'd prefer not to use bdrv_flush here. Could we make
bdrv_flush optional (either generally or for cases like this one) so
both people who prefer speed and people who would want
bdrv_flush to decrease the likelihood of inconsistencies can be
satisfied?

Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 21:26       ` Stefan Weil
@ 2015-05-09  3:54         ` phoeagon
  2015-05-09  3:59           ` phoeagon
  2015-05-10 15:01           ` Paolo Bonzini
  0 siblings, 2 replies; 23+ messages in thread
From: phoeagon @ 2015-05-09  3:54 UTC (permalink / raw)
  To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 4586 bytes --]

Thanks. Dbench does not logically allocate new disk space all the time,
because it's a FS level benchmark that creates file and deletes them.
Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
about 1.8x space of that from EXT4, due to its COW nature. It does cause
the FS to allocate some space during about 1/3 of the test duration I
think. But this does not mitigate it too much because a FS often writes in
a stride rather than consecutively, which causes write amplification at
allocation times.

So I tested it with qemu-img convert from a 400M raw file:
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
-O vdi /run/shm/rand 1.vdi

real 0m0.402s
user 0m0.206s
sys 0m0.202s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
writeback -O vdi /run/shm/rand 1.vdi

real 0m8.678s
user 0m0.169s
sys 0m0.500s
zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
/run/shm/rand 1.vdi

real 0m4.320s
user 0m0.148s
sys 0m0.471s
zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand
1.vdi
real 0m0.489s
user 0m0.173s
sys 0m0.325s

zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi

real 0m0.515s
user 0m0.168s
sys 0m0.357s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi
/run/shm/rand 1.vdi

real 0m0.431s
user 0m0.192s
sys 0m0.248s

Although 400M is not a giant file, it does show the trend.
As you can see when there's drastic allocation needs, and when there no
extra buffering from a virtualized host, the throughput drops about 50%.
But still it has no effect on "unsafe" mode, as predicted. Also I believe
that expecting to use a half-converted image is seldom the use case, while
host crash and power loss are not so unimaginable.
Looks like qemu-img convert is using "unsafe" as default as well, so even
novice "qemu-img convert" users are not likely to find performance
degradation.

I have not yet tried guest OS installation on top, but I guess a new flag
for one-time faster OS installation is not likely useful, and
"cache=unsafe" already does the trick.


On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> >> On 07.05.2015 17:16, Zhe Qiu wrote:
> >>> In reference to b0ad5a45...078a458e, metadata writes to
> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> >>>
> >>> Only when write is successful that bdrv_flush is called.
> >>>
> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> >>> ---
> >>>   block/vdi.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >> I missed Kevin's arguments before, but I think that adding this is
> >> more correct than not having it; and when thinking about speed, this
> >> is vdi, a format supported for compatibility.
> > If you use it only as a convert target, you probably care more about
> > speed than about leaks in case of a host crash.
> >
> >> So if we wanted to optimize it, we'd probably have to cache multiple
> >> allocations, do them at once and then flush afterwards (like the
> >> metadata cache we have in qcow2?)
> > That would defeat the purpose of this patch which aims at having
> > metadata and data written out almost at the same time. On the other
> > hand, fully avoiding the problem instead of just making the window
> > smaller would require a journal, which VDI just doesn't have.
> >
> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> > VDI maintainer.
> >
> > Kevin
>
> Thanks for asking. I share your concerns regarding reduced performance
> caused by bdrv_flush. Conversions to VDI will take longer (how much?),
> and also installation of an OS on a new VDI disk image will be slower
> because that are the typical scenarios where the disk usage grows.
>
> @phoeagon: Did the benchmark which you used allocate additional disk
> storage? If not or if it only allocated once and then spent some time
> on already allocated blocks, that benchmark was not valid for this case.
>
> On the other hand I don't see a need for the flushing because the kind
> of failures (power failure) and their consequences seem to be acceptable
> for typical VDI usage, namely either image conversion or tests with
> existing images.
>
> That's why I'd prefer not to use bdrv_flush here. Could we make
> bdrv_flush optional (either generally or for cases like this one) so
> both people who prefer speed and people who would want
> bdrv_flush to decrease the likelihood of inconsistencies can be
> satisfied?
>
> Stefan
>
>

[-- Attachment #2: Type: text/html, Size: 6867 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-09  3:54         ` phoeagon
@ 2015-05-09  3:59           ` phoeagon
  2015-05-09  6:39             ` Stefan Weil
  2015-05-10 15:01           ` Paolo Bonzini
  1 sibling, 1 reply; 23+ messages in thread
From: phoeagon @ 2015-05-09  3:59 UTC (permalink / raw)
  To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 5065 bytes --]

BTW, how do you usually measure the time to install a Linux distro within?
Most distros ISOs do NOT have unattended installation ISOs in place. (True
I can bake my own ISOs for this...) But do you have any ISOs made ready for
this purpose?

On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote:

> Thanks. Dbench does not logically allocate new disk space all the time,
> because it's a FS level benchmark that creates file and deletes them.
> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
> about 1.8x space of that from EXT4, due to its COW nature. It does cause
> the FS to allocate some space during about 1/3 of the test duration I
> think. But this does not mitigate it too much because a FS often writes in
> a stride rather than consecutively, which causes write amplification at
> allocation times.
>
> So I tested it with qemu-img convert from a 400M raw file:
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
> -O vdi /run/shm/rand 1.vdi
>
> real 0m0.402s
> user 0m0.206s
> sys 0m0.202s
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
> writeback -O vdi /run/shm/rand 1.vdi
>
> real 0m8.678s
> user 0m0.169s
> sys 0m0.500s
> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
> /run/shm/rand 1.vdi
>
> real 0m4.320s
> user 0m0.148s
> sys 0m0.471s
> zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand
> 1.vdi
> real 0m0.489s
> user 0m0.173s
> sys 0m0.325s
>
> zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi
>
> real 0m0.515s
> user 0m0.168s
> sys 0m0.357s
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi
> /run/shm/rand 1.vdi
>
> real 0m0.431s
> user 0m0.192s
> sys 0m0.248s
>
> Although 400M is not a giant file, it does show the trend.
> As you can see when there's drastic allocation needs, and when there no
> extra buffering from a virtualized host, the throughput drops about 50%.
> But still it has no effect on "unsafe" mode, as predicted. Also I believe
> that expecting to use a half-converted image is seldom the use case, while
> host crash and power loss are not so unimaginable.
> Looks like qemu-img convert is using "unsafe" as default as well, so even
> novice "qemu-img convert" users are not likely to find performance
> degradation.
>
> I have not yet tried guest OS installation on top, but I guess a new flag
> for one-time faster OS installation is not likely useful, and
> "cache=unsafe" already does the trick.
>
>
> On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote:
>
>> Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
>> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
>> >> On 07.05.2015 17:16, Zhe Qiu wrote:
>> >>> In reference to b0ad5a45...078a458e, metadata writes to
>> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>> >>>
>> >>> Only when write is successful that bdrv_flush is called.
>> >>>
>> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>> >>> ---
>> >>>   block/vdi.c | 3 +++
>> >>>   1 file changed, 3 insertions(+)
>> >> I missed Kevin's arguments before, but I think that adding this is
>> >> more correct than not having it; and when thinking about speed, this
>> >> is vdi, a format supported for compatibility.
>> > If you use it only as a convert target, you probably care more about
>> > speed than about leaks in case of a host crash.
>> >
>> >> So if we wanted to optimize it, we'd probably have to cache multiple
>> >> allocations, do them at once and then flush afterwards (like the
>> >> metadata cache we have in qcow2?)
>> > That would defeat the purpose of this patch which aims at having
>> > metadata and data written out almost at the same time. On the other
>> > hand, fully avoiding the problem instead of just making the window
>> > smaller would require a journal, which VDI just doesn't have.
>> >
>> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the
>> > VDI maintainer.
>> >
>> > Kevin
>>
>> Thanks for asking. I share your concerns regarding reduced performance
>> caused by bdrv_flush. Conversions to VDI will take longer (how much?),
>> and also installation of an OS on a new VDI disk image will be slower
>> because that are the typical scenarios where the disk usage grows.
>>
>> @phoeagon: Did the benchmark which you used allocate additional disk
>> storage? If not or if it only allocated once and then spent some time
>> on already allocated blocks, that benchmark was not valid for this case.
>>
>> On the other hand I don't see a need for the flushing because the kind
>> of failures (power failure) and their consequences seem to be acceptable
>> for typical VDI usage, namely either image conversion or tests with
>> existing images.
>>
>> That's why I'd prefer not to use bdrv_flush here. Could we make
>> bdrv_flush optional (either generally or for cases like this one) so
>> both people who prefer speed and people who would want
>> bdrv_flush to decrease the likelihood of inconsistencies can be
>> satisfied?
>>
>> Stefan
>>
>>

[-- Attachment #2: Type: text/html, Size: 7091 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-09  3:59           ` phoeagon
@ 2015-05-09  6:39             ` Stefan Weil
  2015-05-09  7:41               ` phoeagon
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Weil @ 2015-05-09  6:39 UTC (permalink / raw)
  To: phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2041 bytes --]

Am 09.05.2015 um 05:59 schrieb phoeagon:
> BTW, how do you usually measure the time to install a Linux distro 
> within? Most distros ISOs do NOT have unattended installation ISOs in 
> place. (True I can bake my own ISOs for this...) But do you have any 
> ISOs made ready for this purpose?
>
> On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com 
> <mailto:phoeagon@gmail.com>> wrote:
>
>     Thanks. Dbench does not logically allocate new disk space all the
>     time, because it's a FS level benchmark that creates file and
>     deletes them. Therefore it also depends on the guest FS, say, a
>     btrfs guest FS allocates about 1.8x space of that from EXT4, due
>     to its COW nature. It does cause the FS to allocate some space
>     during about 1/3 of the test duration I think. But this does not
>     mitigate it too much because a FS often writes in a stride rather
>     than consecutively, which causes write amplification at allocation
>     times.
>
>     So I tested it with qemu-img convert from a 400M raw file:
>     zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>     unsafe -O vdi /run/shm/rand 1.vdi
>
>     real0m0.402s
>     user0m0.206s
>     sys0m0.202s
>     zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>     writeback -O vdi /run/shm/rand 1.vdi
>


I assume that the target file /run/shm/rand 1.vdi is not on a physical disk.
Then flushing data will be fast. For real hard disks (not SSDs) the 
situation is
different: the r/w heads of the hard disk have to move between data location
and the beginning of the written file where the metadata is written, so
I expect a larger effect there.

For measuring installation time of an OS, I'd take a reproducible 
installation
source (hard disk or DVD, no network connection) and take the time for
those parts of the installation where many packets are installed without
any user interaction. For Linux you won't need a stop watch, because the
packet directories in /usr/share/doc have nice timestamps.

Stefan


[-- Attachment #2: Type: text/html, Size: 3335 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-09  6:39             ` Stefan Weil
@ 2015-05-09  7:41               ` phoeagon
  0 siblings, 0 replies; 23+ messages in thread
From: phoeagon @ 2015-05-09  7:41 UTC (permalink / raw)
  To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2550 bytes --]

Full Linux Mint (17.1) Installation with writeback:

With VDI extra sync 4min35s
Vanilla: 3min17s

which is consistent with 'qemu-img convert' (slightly less overhead due to
some phases in installation is actually CPU bound).
Still much faster than other "sync-after-metadata" formats like VPC
(vanilla VPC 7min43s)
The thing is he who needs to set up a new Linux system every day probably
have pre-installed images to start with, and others just don't install an
OS every day.



On Sat, May 9, 2015 at 2:39 PM Stefan Weil <sw@weilnetz.de> wrote:

> Am 09.05.2015 um 05:59 schrieb phoeagon:
>
> BTW, how do you usually measure the time to install a Linux distro within?
> Most distros ISOs do NOT have unattended installation ISOs in place. (True
> I can bake my own ISOs for this...) But do you have any ISOs made ready for
> this purpose?
>
> On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote:
>
>> Thanks. Dbench does not logically allocate new disk space all the time,
>> because it's a FS level benchmark that creates file and deletes them.
>> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
>> about 1.8x space of that from EXT4, due to its COW nature. It does cause
>> the FS to allocate some space during about 1/3 of the test duration I
>> think. But this does not mitigate it too much because a FS often writes in
>> a stride rather than consecutively, which causes write amplification at
>> allocation times.
>>
>> So I tested it with qemu-img convert from a 400M raw file:
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
>> -O vdi /run/shm/rand 1.vdi
>>
>>  real 0m0.402s
>> user 0m0.206s
>> sys 0m0.202s
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
>> writeback -O vdi /run/shm/rand 1.vdi
>>
>
>
> I assume that the target file /run/shm/rand 1.vdi is not on a physical
> disk.
> Then flushing data will be fast. For real hard disks (not SSDs) the
> situation is
> different: the r/w heads of the hard disk have to move between data
> location
> and the beginning of the written file where the metadata is written, so
> I expect a larger effect there.
>
> For measuring installation time of an OS, I'd take a reproducible
> installation
> source (hard disk or DVD, no network connection) and take the time for
> those parts of the installation where many packets are installed without
> any user interaction. For Linux you won't need a stop watch, because the
> packet directories in /usr/share/doc have nice timestamps.
>
>
> Stefan
>
>

[-- Attachment #2: Type: text/html, Size: 4178 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-09  3:54         ` phoeagon
  2015-05-09  3:59           ` phoeagon
@ 2015-05-10 15:01           ` Paolo Bonzini
  2015-05-10 16:02             ` Stefan Weil
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2015-05-10 15:01 UTC (permalink / raw)
  To: phoeagon, Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

On 09/05/2015 05:54, phoeagon wrote:
> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
> 
> real0m8.678s
> user0m0.169s
> sys0m0.500s
> 
> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
> real0m4.320s
> user0m0.148s
> sys0m0.471s

This means that 3.83 seconds are spent when bdrv_close() calls
bdrv_flush().  That's the only difference between writeback
and unsafe in qemu-img convert.

The remaining part of the time (4.85 seconds instead of 0.49
seconds) means that, at least on your hardware, sequential writes
to unallocated space become 10 times slower with your patch.

Since the default qemu-img convert case isn't slowed down, I
would think that correctness trumps performance.  Nevertheless,
it's a huge difference.

Paolo

> zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi
> real	0m0.489s
> user	0m0.173s
> sys	0m0.325s

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-10 15:01           ` Paolo Bonzini
@ 2015-05-10 16:02             ` Stefan Weil
  2015-05-10 16:05               ` phoeagon
  2015-05-10 16:10               ` Paolo Bonzini
  0 siblings, 2 replies; 23+ messages in thread
From: Stefan Weil @ 2015-05-10 16:02 UTC (permalink / raw)
  To: Paolo Bonzini, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

Am 10.05.2015 um 17:01 schrieb Paolo Bonzini:
>
> On 09/05/2015 05:54, phoeagon wrote:
>> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
>>
>> real0m8.678s
>> user0m0.169s
>> sys0m0.500s
>>
>> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
>> real0m4.320s
>> user0m0.148s
>> sys0m0.471s
> This means that 3.83 seconds are spent when bdrv_close() calls
> bdrv_flush().  That's the only difference between writeback
> and unsafe in qemu-img convert.
>
> The remaining part of the time (4.85 seconds instead of 0.49
> seconds) means that, at least on your hardware, sequential writes
> to unallocated space become 10 times slower with your patch.
>
> Since the default qemu-img convert case isn't slowed down, I
> would think that correctness trumps performance.  Nevertheless,
> it's a huge difference.
>
> Paolo

I doubt that the convert case isn't slowed down.

Writing to a tmpfs as it was obviously done for the test is not a 
typical use case.
With real hard disks I expect a significant slowdown.

Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-10 16:02             ` Stefan Weil
@ 2015-05-10 16:05               ` phoeagon
  2015-05-10 16:10               ` Paolo Bonzini
  1 sibling, 0 replies; 23+ messages in thread
From: phoeagon @ 2015-05-10 16:05 UTC (permalink / raw)
  To: Stefan Weil, Paolo Bonzini, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 1453 bytes --]

Just for clarity, I was not writing to tmpfs. I was READING from tmpfs. I
was writing to a path named 'sdb' (as you see in the prompt) which is a
btrfs on an SSD Drive. I don't have an HDD to test on though.

On Mon, May 11, 2015 at 12:02 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 10.05.2015 um 17:01 schrieb Paolo Bonzini:
> >
> > On 09/05/2015 05:54, phoeagon wrote:
> >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
> writeback -O vdi /run/shm/rand 1.vdi
> >>
> >> real0m8.678s
> >> user0m0.169s
> >> sys0m0.500s
> >>
> >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
> /run/shm/rand 1.vdi
> >> real0m4.320s
> >> user0m0.148s
> >> sys0m0.471s
> > This means that 3.83 seconds are spent when bdrv_close() calls
> > bdrv_flush().  That's the only difference between writeback
> > and unsafe in qemu-img convert.
> >
> > The remaining part of the time (4.85 seconds instead of 0.49
> > seconds) means that, at least on your hardware, sequential writes
> > to unallocated space become 10 times slower with your patch.
> >
> > Since the default qemu-img convert case isn't slowed down, I
> > would think that correctness trumps performance.  Nevertheless,
> > it's a huge difference.
> >
> > Paolo
>
> I doubt that the convert case isn't slowed down.
>
> Writing to a tmpfs as it was obviously done for the test is not a
> typical use case.
> With real hard disks I expect a significant slowdown.
>
> Stefan
>
>

[-- Attachment #2: Type: text/html, Size: 1910 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-10 16:02             ` Stefan Weil
  2015-05-10 16:05               ` phoeagon
@ 2015-05-10 16:10               ` Paolo Bonzini
  2015-05-10 16:26                 ` Stefan Weil
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2015-05-10 16:10 UTC (permalink / raw)
  To: Stefan Weil, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block



On 10/05/2015 18:02, Stefan Weil wrote:
>> Since the default qemu-img convert case isn't slowed down, I
>> would think that correctness trumps performance.  Nevertheless,
>> it's a huge difference.
> 
> I doubt that the convert case isn't slowed down.

The *default* convert case isn't slowed down because "qemu-img convert"
defaults to the "unsafe" cache mode.

The *non-default* convert case with flushes was slowed down indeed: 2x
in total (if you include the final flush done by bdrv_close), and 10x if
you only consider the sequential write part of convert.

Paolo

> Writing to a tmpfs as it was obviously done for the test is not a
> typical use case.
> With real hard disks I expect a significant slowdown.
> 
> Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-10 16:10               ` Paolo Bonzini
@ 2015-05-10 16:26                 ` Stefan Weil
  2015-05-10 17:14                   ` phoeagon
  0 siblings, 1 reply; 23+ messages in thread
From: Stefan Weil @ 2015-05-10 16:26 UTC (permalink / raw)
  To: Paolo Bonzini, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

Am 10.05.2015 um 18:10 schrieb Paolo Bonzini:
> On 10/05/2015 18:02, Stefan Weil wrote:
>>> Since the default qemu-img convert case isn't slowed down, I
>>> would think that correctness trumps performance.  Nevertheless,
>>> it's a huge difference.
>> I doubt that the convert case isn't slowed down.
> The *default* convert case isn't slowed down because "qemu-img convert"
> defaults to the "unsafe" cache mode.
>
> The *non-default* convert case with flushes was slowed down indeed: 2x
> in total (if you include the final flush done by bdrv_close), and 10x if
> you only consider the sequential write part of convert.
>
> Paolo


For those who might be interested:

The relevant GPL source code from VirtualBox is available here:

https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage

If I interpret that code correctly, blocks are normally written 
asynchronously,
but changes of metadata (new block allocation) are written synchronously.

See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp 
(vdIOIntWriteMeta).

Stefan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates
  2015-05-10 16:26                 ` Stefan Weil
@ 2015-05-10 17:14                   ` phoeagon
  0 siblings, 0 replies; 23+ messages in thread
From: phoeagon @ 2015-05-10 17:14 UTC (permalink / raw)
  To: Stefan Weil, Paolo Bonzini, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]

I'm not familiar with the VirtualBox code base, but looks like "
vdIOIntWriteMeta" can work both synchronously & asynchronously, and
"vdiBlockAllocUpdate" looks async to me. Frankly, skimming through the code
for 5 min doesn't enlighten me too much on its detailed implementation, but
looks like at least Virtualbox has VDI-repair that fixes block leaks
relatively easily.

I would agree that a more complete implementation on VDI-check-and-repair
might be better in this particular situation. I'm not sure if there are
other cases where flush after metadata update might be better, but doesn't
look like qemu-img auto repair is coming to other writable image formats in
the near future.

Also, I think that memory exhaustion and consequent page cache eviction are
not too uncommon on computers not originally designed to run VMs. Many
laptops are still trapped with 4GB memory and there seem to widespread
instructions on tuning down the swappiness to favor page cache drops than
swapping out memory, all of which adds to the odds of metadata
inconsistency.

On Mon, May 11, 2015 at 12:26 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 10.05.2015 um 18:10 schrieb Paolo Bonzini:
> > On 10/05/2015 18:02, Stefan Weil wrote:
> >>> Since the default qemu-img convert case isn't slowed down, I
> >>> would think that correctness trumps performance.  Nevertheless,
> >>> it's a huge difference.
> >> I doubt that the convert case isn't slowed down.
> > The *default* convert case isn't slowed down because "qemu-img convert"
> > defaults to the "unsafe" cache mode.
> >
> > The *non-default* convert case with flushes was slowed down indeed: 2x
> > in total (if you include the final flush done by bdrv_close), and 10x if
> > you only consider the sequential write part of convert.
> >
> > Paolo
>
>
> For those who might be interested:
>
> The relevant GPL source code from VirtualBox is available here:
>
> https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage
>
> If I interpret that code correctly, blocks are normally written
> asynchronously,
> but changes of metadata (new block allocation) are written synchronously.
>
> See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp
> (vdIOIntWriteMeta).
>
> Stefan
>
>

[-- Attachment #2: Type: text/html, Size: 2922 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
                   ` (2 preceding siblings ...)
  2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu
@ 2015-05-08 10:43 ` Kevin Wolf
  2015-05-08 11:50   ` phoeagon
  2015-05-08 13:10 ` [Qemu-devel] " Max Reitz
  4 siblings, 1 reply; 23+ messages in thread
From: Kevin Wolf @ 2015-05-08 10:43 UTC (permalink / raw)
  To: Zhe Qiu; +Cc: qemu-devel, qemu-block, mreitz

Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben:
> From: phoeagon <phoeagon@gmail.com>
> 
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.

The justification for these patches (in 2010!) was more or less that we
didn't know whether the implementations were safe without the flushes.
Many of the flushes added back then have been removed again until today
because they have turned out to be unnecessary.

> Only when write is successful that bdrv_flush is called.
> 
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>

Please describe why VDI needs this flush to be safe. This is best done
by describing a case where not doing the flush would lead to image
corruption in case of a crash.

To my knowledge, the VDI driver is completely safe without it.

Kevin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf
@ 2015-05-08 11:50   ` phoeagon
  2015-05-08 12:02     ` Kevin Wolf
  0 siblings, 1 reply; 23+ messages in thread
From: phoeagon @ 2015-05-08 11:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

In case of correctness, lacking a sync here does not introduce data
corruption I can think of. But this reduces the volatile window during
which the metadata changes are NOT guaranteed on disk. Without a barrier,
in case of power loss you may end up with the bitmap changes on disk and
not the header block, or vice versa. Neither introduces data corruption
directly, but since VDI doesn't have proper fix mechanism for qemu-img,
once the leak is introduced you have to "convert" to fix it, consuming a
long time if the disk is large.

This patch does not fix the issue entirely, and it does not substitute for
proper check-and-fix implementation. But this should bring about minor
performance degradation (only 1 extra sync per allocation) but greatly
reduces the metadata inconsistency window.

On Fri, May 8, 2015 at 6:43 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben:
> > From: phoeagon <phoeagon@gmail.com>
> >
> > In reference to
> b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2,
> metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to
> succeeding writes.
>
> The justification for these patches (in 2010!) was more or less that we
> didn't know whether the implementations were safe without the flushes.
> Many of the flushes added back then have been removed again until today
> because they have turned out to be unnecessary.
>
> > Only when write is successful that bdrv_flush is called.
> >
> > Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>
> Please describe why VDI needs this flush to be safe. This is best done
> by describing a case where not doing the flush would lead to image
> corruption in case of a crash.
>
> To my knowledge, the VDI driver is completely safe without it.
>
> Kevin
>

[-- Attachment #2: Type: text/html, Size: 2264 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 11:50   ` phoeagon
@ 2015-05-08 12:02     ` Kevin Wolf
  2015-05-08 12:56       ` phoeagon
  0 siblings, 1 reply; 23+ messages in thread
From: Kevin Wolf @ 2015-05-08 12:02 UTC (permalink / raw)
  To: phoeagon; +Cc: qemu-devel, qemu-block, mreitz

Am 08.05.2015 um 13:50 hat phoeagon geschrieben:
> In case of correctness, lacking a sync here does not introduce data corruption
> I can think of. But this reduces the volatile window during which the metadata
> changes are NOT guaranteed on disk. Without a barrier, in case of power loss
> you may end up with the bitmap changes on disk and not the header block, or
> vice versa. Neither introduces data corruption directly, but since VDI doesn't
> have proper fix mechanism for qemu-img, once the leak is introduced you have to
> "convert" to fix it, consuming a long time if the disk is large.

This is true. I'm not sure how big a problem this is in practice,
though.

> This patch does not fix the issue entirely, and it does not substitute for
> proper check-and-fix implementation. But this should bring about minor
> performance degradation (only 1 extra sync per allocation) but greatly reduces
> the metadata inconsistency window.

Did you benchmark this? From the past experience with flushes in qemu
block drivers, one sync per allocation certainly doesn't sound "minor".

What could possibly save us from the worst is that VDI has a relatively
large block size (or rather, that we don't support images with different
block sizes).

Kevin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-08 12:02     ` Kevin Wolf
@ 2015-05-08 12:56       ` phoeagon
  0 siblings, 0 replies; 23+ messages in thread
From: phoeagon @ 2015-05-08 12:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 2508 bytes --]

I tested it on host-btrfs, guest-ext4, (connected to guest via virtualized
IDE) with 1G VDI image, testing with "dbench 10":

synced:
Writeback: 39.4852M/s 326.812ms
Unsafe: 432.72M/s 1029.313ms

normal:
Writeback: 39.1884M/s 311.383ms
Unsafe: 413.592M/s 280.951ms

Although I hear that dbench is not a good I/O benchmark (and iozone is just
a little too much hassle) but I'm pretty sure the difference, if any, is
within fluctuation in this case. Under my setting even a raw "dd
of=/dev/sdb" (from within a VM) doesn't have higher than 20M/s even without
this extra write barrier in the proposed patch.

So although what described above is not comprehensive (you can probably
extract the most overhead by deliberately O_DIRECT writing at 1M stride,
512K each, no application level sync in "writeback" cache mode of VDI, what
originally mostly resides in host page cache now must be flushed to hard
disk, probably in an inconvenient order if host FS highly fragmented), I
doubt consecutive raw writes cover several Megabytes or guest-FS based
workload would see much performance overhead after introducing this extra
sync.

On Fri, May 8, 2015 at 8:02 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 08.05.2015 um 13:50 hat phoeagon geschrieben:
> > In case of correctness, lacking a sync here does not introduce data
> corruption
> > I can think of. But this reduces the volatile window during which the
> metadata
> > changes are NOT guaranteed on disk. Without a barrier, in case of power
> loss
> > you may end up with the bitmap changes on disk and not the header block,
> or
> > vice versa. Neither introduces data corruption directly, but since VDI
> doesn't
> > have proper fix mechanism for qemu-img, once the leak is introduced you
> have to
> > "convert" to fix it, consuming a long time if the disk is large.
>
> This is true. I'm not sure how big a problem this is in practice,
> though.
>
> > This patch does not fix the issue entirely, and it does not substitute
> for
> > proper check-and-fix implementation. But this should bring about minor
> > performance degradation (only 1 extra sync per allocation) but greatly
> reduces
> > the metadata inconsistency window.
>
> Did you benchmark this? From the past experience with flushes in qemu
> block drivers, one sync per allocation certainly doesn't sound "minor".
>
> What could possibly save us from the worst is that VDI has a relatively
> large block size (or rather, that we don't support images with different
> block sizes).
>
> Kevin
>

[-- Attachment #2: Type: text/html, Size: 3205 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
  2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
                   ` (3 preceding siblings ...)
  2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf
@ 2015-05-08 13:10 ` Max Reitz
  4 siblings, 0 replies; 23+ messages in thread
From: Max Reitz @ 2015-05-08 13:10 UTC (permalink / raw)
  To: Zhe Qiu, qemu-devel; +Cc: qemu-block

On 07.05.2015 06:04, Zhe Qiu wrote:
> From: phoeagon <phoeagon@gmail.com>
>
> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
>
> Only when write is successful that bdrv_flush is called.
>
> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> ---
>   block/vdi.c | 3 +++
>   1 file changed, 3 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

Thanks!

> diff --git a/block/vdi.c b/block/vdi.c
> index 7642ef3..dfe8ade 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
>           logout("will write %u block map sectors starting from entry %u\n",
>                  n_sectors, bmap_first);
>           ret = bdrv_write(bs->file, offset, base, n_sectors);
> +        if (ret >= 0) {
> +            ret = bdrv_flush(bs->file);
> +        }
>       }
>   
>       return ret;

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-05-10 17:14 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-07  4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu
2015-05-07 10:09 ` Stefan Hajnoczi
2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake
2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu
2015-05-08 13:14   ` Max Reitz
2015-05-08 13:55     ` Kevin Wolf
2015-05-08 14:43       ` phoeagon
2015-05-08 21:26       ` Stefan Weil
2015-05-09  3:54         ` phoeagon
2015-05-09  3:59           ` phoeagon
2015-05-09  6:39             ` Stefan Weil
2015-05-09  7:41               ` phoeagon
2015-05-10 15:01           ` Paolo Bonzini
2015-05-10 16:02             ` Stefan Weil
2015-05-10 16:05               ` phoeagon
2015-05-10 16:10               ` Paolo Bonzini
2015-05-10 16:26                 ` Stefan Weil
2015-05-10 17:14                   ` phoeagon
2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf
2015-05-08 11:50   ` phoeagon
2015-05-08 12:02     ` Kevin Wolf
2015-05-08 12:56       ` phoeagon
2015-05-08 13:10 ` [Qemu-devel] " Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).