* [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates @ 2015-05-07 4:04 Zhe Qiu 2015-05-07 10:09 ` Stefan Hajnoczi ` (4 more replies) 0 siblings, 5 replies; 23+ messages in thread From: Zhe Qiu @ 2015-05-07 4:04 UTC (permalink / raw) To: qemu-devel; +Cc: phoeagon, qemu-block, mreitz From: phoeagon <phoeagon@gmail.com> In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. Only when write is successful that bdrv_flush is called. Signed-off-by: Zhe Qiu <phoeagon@gmail.com> --- block/vdi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout("will write %u block map sectors starting from entry %u\n", n_sectors, bmap_first); ret = bdrv_write(bs->file, offset, base, n_sectors); + if (ret >= 0) { + ret = bdrv_flush(bs->file); + } } return ret; -- 2.4.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu @ 2015-05-07 10:09 ` Stefan Hajnoczi 2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake ` (3 subsequent siblings) 4 siblings, 0 replies; 23+ messages in thread From: Stefan Hajnoczi @ 2015-05-07 10:09 UTC (permalink / raw) To: Zhe Qiu; +Cc: Kevin Wolf, Stefan Weil, qemu-devel, qemu-block, mreitz [-- Attachment #1: Type: text/plain, Size: 1092 bytes --] On Thu, May 07, 2015 at 12:04:56PM +0800, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) CCing Stefan Weil and Kevin Wolf (see output from scripts/get_maintainer.pl -f block/vdi.c). > > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret; > -- > 2.4.0 > > [-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu 2015-05-07 10:09 ` Stefan Hajnoczi @ 2015-05-07 15:05 ` Eric Blake 2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu ` (2 subsequent siblings) 4 siblings, 0 replies; 23+ messages in thread From: Eric Blake @ 2015-05-07 15:05 UTC (permalink / raw) To: Zhe Qiu, qemu-devel; +Cc: qemu-block, mreitz [-- Attachment #1: Type: text/plain, Size: 1531 bytes --] On 05/06/2015 10:04 PM, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > Please wrap commit comments to be under 80 columns (in fact, under 72 is good, because 'git log' adds spaces when displaying commit bodies). Your notation of commit~commit is unusual; ranges in git are usually spelled commit..commit. Also, it's okay to abbreviate commit SHAs to 8 bytes or so. So to shrink your long line, I'd write b0ad5a45..078a458e. > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> Your 'From:' and 'Signed-off-by:' lines have different spellings, which makes it more confusing when searching for patches by you. You can add an entry to .mailmap (as a separate patch) to retro-actively consolidate entries, but it is better to catch things like this up front before we even have the problem of separate names. I suspect you want "Zhe Qiu" as the spelling of your name in both lines (git config can be taught to set up the preferred name to attribute both for signing patches and for sending email). It is also acceptable to use UTF-8 to spell your name in native characters, or even a combination of "native name (ascii counterpart)" -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 604 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu 2015-05-07 10:09 ` Stefan Hajnoczi 2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake @ 2015-05-07 15:16 ` Zhe Qiu 2015-05-08 13:14 ` Max Reitz 2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf 2015-05-08 13:10 ` [Qemu-devel] " Max Reitz 4 siblings, 1 reply; 23+ messages in thread From: Zhe Qiu @ 2015-05-07 15:16 UTC (permalink / raw) To: qemu-devel; +Cc: kwolf, Zhe Qiu, qemu-block, sw, mreitz In reference to b0ad5a45...078a458e, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. Only when write is successful that bdrv_flush is called. Signed-off-by: Zhe Qiu <phoeagon@gmail.com> --- block/vdi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout("will write %u block map sectors starting from entry %u\n", n_sectors, bmap_first); ret = bdrv_write(bs->file, offset, base, n_sectors); + if (ret >= 0) { + ret = bdrv_flush(bs->file); + } } return ret; -- 2.4.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu @ 2015-05-08 13:14 ` Max Reitz 2015-05-08 13:55 ` Kevin Wolf 0 siblings, 1 reply; 23+ messages in thread From: Max Reitz @ 2015-05-08 13:14 UTC (permalink / raw) To: Zhe Qiu, qemu-devel; +Cc: kwolf, sw, qemu-block On 07.05.2015 17:16, Zhe Qiu wrote: > In reference to b0ad5a45...078a458e, metadata writes to > qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) I missed Kevin's arguments before, but I think that adding this is more correct than not having it; and when thinking about speed, this is vdi, a format supported for compatibility. Writing isn't our most important concern anyway (especially considering that it's even to disable write support for vdi at compile time). So if we wanted to optimize it, we'd probably have to cache multiple allocations, do them at once and then flush afterwards (like the metadata cache we have in qcow2?), but that is complicated (like the metadata cache in qcow2), certainly too complicated for a format supported for compatibility (unless someone really wants to implement it). Reviewed-by: Max Reitz <mreitz@redhat.com> > > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 13:14 ` Max Reitz @ 2015-05-08 13:55 ` Kevin Wolf 2015-05-08 14:43 ` phoeagon 2015-05-08 21:26 ` Stefan Weil 0 siblings, 2 replies; 23+ messages in thread From: Kevin Wolf @ 2015-05-08 13:55 UTC (permalink / raw) To: Max Reitz; +Cc: sw, Zhe Qiu, qemu-devel, qemu-block Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > On 07.05.2015 17:16, Zhe Qiu wrote: > >In reference to b0ad5a45...078a458e, metadata writes to > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > > >Only when write is successful that bdrv_flush is called. > > > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > >--- > > block/vdi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > I missed Kevin's arguments before, but I think that adding this is > more correct than not having it; and when thinking about speed, this > is vdi, a format supported for compatibility. If you use it only as a convert target, you probably care more about speed than about leaks in case of a host crash. > So if we wanted to optimize it, we'd probably have to cache multiple > allocations, do them at once and then flush afterwards (like the > metadata cache we have in qcow2?) That would defeat the purpose of this patch which aims at having metadata and data written out almost at the same time. On the other hand, fully avoiding the problem instead of just making the window smaller would require a journal, which VDI just doesn't have. I'm not convinced of this patch, but I'll defer to Stefan Weil as the VDI maintainer. Kevin ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 13:55 ` Kevin Wolf @ 2015-05-08 14:43 ` phoeagon 2015-05-08 21:26 ` Stefan Weil 1 sibling, 0 replies; 23+ messages in thread From: phoeagon @ 2015-05-08 14:43 UTC (permalink / raw) To: Kevin Wolf, Max Reitz; +Cc: sw, qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2086 bytes --] Then I would guess the same reason should apply to VMDK/VPC as well... Their metadata update protocol is not atomic either, and a sync after metadata update doesn't fix the whole thing theoretically either. Yet the metadata sync patches as old as since 2010 are still there. It should also be a performance boost if we remove those write barriers as well, if conversion performance is our major concern. I think when converting images, one can always opt for "cache=unsafe" to avoid potential performance degradation from conservative cache (it should really be default for qemu-img convert, but I don't know if it's the case), so conversion performance shouldn't be a reason to sacrifice VM-runtime consistency. On Fri, May 8, 2015 at 9:55 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > > On 07.05.2015 17:16, Zhe Qiu wrote: > > >In reference to b0ad5a45...078a458e, metadata writes to > > >qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > > > > >Only when write is successful that bdrv_flush is called. > > > > > >Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > > >--- > > > block/vdi.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > I missed Kevin's arguments before, but I think that adding this is > > more correct than not having it; and when thinking about speed, this > > is vdi, a format supported for compatibility. > > If you use it only as a convert target, you probably care more about > speed than about leaks in case of a host crash. > > > So if we wanted to optimize it, we'd probably have to cache multiple > > allocations, do them at once and then flush afterwards (like the > > metadata cache we have in qcow2?) > > That would defeat the purpose of this patch which aims at having > metadata and data written out almost at the same time. On the other > hand, fully avoiding the problem instead of just making the window > smaller would require a journal, which VDI just doesn't have. > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > VDI maintainer. > > Kevin > [-- Attachment #2: Type: text/html, Size: 2732 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 13:55 ` Kevin Wolf 2015-05-08 14:43 ` phoeagon @ 2015-05-08 21:26 ` Stefan Weil 2015-05-09 3:54 ` phoeagon 1 sibling, 1 reply; 23+ messages in thread From: Stefan Weil @ 2015-05-08 21:26 UTC (permalink / raw) To: Kevin Wolf, Max Reitz; +Cc: Zhe Qiu, qemu-devel, qemu-block Am 08.05.2015 um 15:55 schrieb Kevin Wolf: > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: >> On 07.05.2015 17:16, Zhe Qiu wrote: >>> In reference to b0ad5a45...078a458e, metadata writes to >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. >>> >>> Only when write is successful that bdrv_flush is called. >>> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> >>> --- >>> block/vdi.c | 3 +++ >>> 1 file changed, 3 insertions(+) >> I missed Kevin's arguments before, but I think that adding this is >> more correct than not having it; and when thinking about speed, this >> is vdi, a format supported for compatibility. > If you use it only as a convert target, you probably care more about > speed than about leaks in case of a host crash. > >> So if we wanted to optimize it, we'd probably have to cache multiple >> allocations, do them at once and then flush afterwards (like the >> metadata cache we have in qcow2?) > That would defeat the purpose of this patch which aims at having > metadata and data written out almost at the same time. On the other > hand, fully avoiding the problem instead of just making the window > smaller would require a journal, which VDI just doesn't have. > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > VDI maintainer. > > Kevin Thanks for asking. I share your concerns regarding reduced performance caused by bdrv_flush. Conversions to VDI will take longer (how much?), and also installation of an OS on a new VDI disk image will be slower because that are the typical scenarios where the disk usage grows. @phoeagon: Did the benchmark which you used allocate additional disk storage? If not or if it only allocated once and then spent some time on already allocated blocks, that benchmark was not valid for this case. On the other hand I don't see a need for the flushing because the kind of failures (power failure) and their consequences seem to be acceptable for typical VDI usage, namely either image conversion or tests with existing images. That's why I'd prefer not to use bdrv_flush here. Could we make bdrv_flush optional (either generally or for cases like this one) so both people who prefer speed and people who would want bdrv_flush to decrease the likelihood of inconsistencies can be satisfied? Stefan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 21:26 ` Stefan Weil @ 2015-05-09 3:54 ` phoeagon 2015-05-09 3:59 ` phoeagon 2015-05-10 15:01 ` Paolo Bonzini 0 siblings, 2 replies; 23+ messages in thread From: phoeagon @ 2015-05-09 3:54 UTC (permalink / raw) To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 4586 bytes --] Thanks. Dbench does not logically allocate new disk space all the time, because it's a FS level benchmark that creates file and deletes them. Therefore it also depends on the guest FS, say, a btrfs guest FS allocates about 1.8x space of that from EXT4, due to its COW nature. It does cause the FS to allocate some space during about 1/3 of the test duration I think. But this does not mitigate it too much because a FS often writes in a stride rather than consecutively, which causes write amplification at allocation times. So I tested it with qemu-img convert from a 400M raw file: zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.402s user 0m0.206s sys 0m0.202s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m8.678s user 0m0.169s sys 0m0.500s zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m4.320s user 0m0.148s sys 0m0.471s zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.489s user 0m0.173s sys 0m0.325s zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.515s user 0m0.168s sys 0m0.357s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.431s user 0m0.192s sys 0m0.248s Although 400M is not a giant file, it does show the trend. As you can see when there's drastic allocation needs, and when there no extra buffering from a virtualized host, the throughput drops about 50%. But still it has no effect on "unsafe" mode, as predicted. Also I believe that expecting to use a half-converted image is seldom the use case, while host crash and power loss are not so unimaginable. Looks like qemu-img convert is using "unsafe" as default as well, so even novice "qemu-img convert" users are not likely to find performance degradation. I have not yet tried guest OS installation on top, but I guess a new flag for one-time faster OS installation is not likely useful, and "cache=unsafe" already does the trick. On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 08.05.2015 um 15:55 schrieb Kevin Wolf: > > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > >> On 07.05.2015 17:16, Zhe Qiu wrote: > >>> In reference to b0ad5a45...078a458e, metadata writes to > >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > >>> > >>> Only when write is successful that bdrv_flush is called. > >>> > >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > >>> --- > >>> block/vdi.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >> I missed Kevin's arguments before, but I think that adding this is > >> more correct than not having it; and when thinking about speed, this > >> is vdi, a format supported for compatibility. > > If you use it only as a convert target, you probably care more about > > speed than about leaks in case of a host crash. > > > >> So if we wanted to optimize it, we'd probably have to cache multiple > >> allocations, do them at once and then flush afterwards (like the > >> metadata cache we have in qcow2?) > > That would defeat the purpose of this patch which aims at having > > metadata and data written out almost at the same time. On the other > > hand, fully avoiding the problem instead of just making the window > > smaller would require a journal, which VDI just doesn't have. > > > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > > VDI maintainer. > > > > Kevin > > Thanks for asking. I share your concerns regarding reduced performance > caused by bdrv_flush. Conversions to VDI will take longer (how much?), > and also installation of an OS on a new VDI disk image will be slower > because that are the typical scenarios where the disk usage grows. > > @phoeagon: Did the benchmark which you used allocate additional disk > storage? If not or if it only allocated once and then spent some time > on already allocated blocks, that benchmark was not valid for this case. > > On the other hand I don't see a need for the flushing because the kind > of failures (power failure) and their consequences seem to be acceptable > for typical VDI usage, namely either image conversion or tests with > existing images. > > That's why I'd prefer not to use bdrv_flush here. Could we make > bdrv_flush optional (either generally or for cases like this one) so > both people who prefer speed and people who would want > bdrv_flush to decrease the likelihood of inconsistencies can be > satisfied? > > Stefan > > [-- Attachment #2: Type: text/html, Size: 6867 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-09 3:54 ` phoeagon @ 2015-05-09 3:59 ` phoeagon 2015-05-09 6:39 ` Stefan Weil 2015-05-10 15:01 ` Paolo Bonzini 1 sibling, 1 reply; 23+ messages in thread From: phoeagon @ 2015-05-09 3:59 UTC (permalink / raw) To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 5065 bytes --] BTW, how do you usually measure the time to install a Linux distro within? Most distros ISOs do NOT have unattended installation ISOs in place. (True I can bake my own ISOs for this...) But do you have any ISOs made ready for this purpose? On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote: > Thanks. Dbench does not logically allocate new disk space all the time, > because it's a FS level benchmark that creates file and deletes them. > Therefore it also depends on the guest FS, say, a btrfs guest FS allocates > about 1.8x space of that from EXT4, due to its COW nature. It does cause > the FS to allocate some space during about 1/3 of the test duration I > think. But this does not mitigate it too much because a FS often writes in > a stride rather than consecutively, which causes write amplification at > allocation times. > > So I tested it with qemu-img convert from a 400M raw file: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe > -O vdi /run/shm/rand 1.vdi > > real 0m0.402s > user 0m0.206s > sys 0m0.202s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > > real 0m8.678s > user 0m0.169s > sys 0m0.500s > zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi > /run/shm/rand 1.vdi > > real 0m4.320s > user 0m0.148s > sys 0m0.471s > zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand > 1.vdi > real 0m0.489s > user 0m0.173s > sys 0m0.325s > > zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi > > real 0m0.515s > user 0m0.168s > sys 0m0.357s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi > /run/shm/rand 1.vdi > > real 0m0.431s > user 0m0.192s > sys 0m0.248s > > Although 400M is not a giant file, it does show the trend. > As you can see when there's drastic allocation needs, and when there no > extra buffering from a virtualized host, the throughput drops about 50%. > But still it has no effect on "unsafe" mode, as predicted. Also I believe > that expecting to use a half-converted image is seldom the use case, while > host crash and power loss are not so unimaginable. > Looks like qemu-img convert is using "unsafe" as default as well, so even > novice "qemu-img convert" users are not likely to find performance > degradation. > > I have not yet tried guest OS installation on top, but I guess a new flag > for one-time faster OS installation is not likely useful, and > "cache=unsafe" already does the trick. > > > On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote: > >> Am 08.05.2015 um 15:55 schrieb Kevin Wolf: >> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: >> >> On 07.05.2015 17:16, Zhe Qiu wrote: >> >>> In reference to b0ad5a45...078a458e, metadata writes to >> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. >> >>> >> >>> Only when write is successful that bdrv_flush is called. >> >>> >> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com> >> >>> --- >> >>> block/vdi.c | 3 +++ >> >>> 1 file changed, 3 insertions(+) >> >> I missed Kevin's arguments before, but I think that adding this is >> >> more correct than not having it; and when thinking about speed, this >> >> is vdi, a format supported for compatibility. >> > If you use it only as a convert target, you probably care more about >> > speed than about leaks in case of a host crash. >> > >> >> So if we wanted to optimize it, we'd probably have to cache multiple >> >> allocations, do them at once and then flush afterwards (like the >> >> metadata cache we have in qcow2?) >> > That would defeat the purpose of this patch which aims at having >> > metadata and data written out almost at the same time. On the other >> > hand, fully avoiding the problem instead of just making the window >> > smaller would require a journal, which VDI just doesn't have. >> > >> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the >> > VDI maintainer. >> > >> > Kevin >> >> Thanks for asking. I share your concerns regarding reduced performance >> caused by bdrv_flush. Conversions to VDI will take longer (how much?), >> and also installation of an OS on a new VDI disk image will be slower >> because that are the typical scenarios where the disk usage grows. >> >> @phoeagon: Did the benchmark which you used allocate additional disk >> storage? If not or if it only allocated once and then spent some time >> on already allocated blocks, that benchmark was not valid for this case. >> >> On the other hand I don't see a need for the flushing because the kind >> of failures (power failure) and their consequences seem to be acceptable >> for typical VDI usage, namely either image conversion or tests with >> existing images. >> >> That's why I'd prefer not to use bdrv_flush here. Could we make >> bdrv_flush optional (either generally or for cases like this one) so >> both people who prefer speed and people who would want >> bdrv_flush to decrease the likelihood of inconsistencies can be >> satisfied? >> >> Stefan >> >> [-- Attachment #2: Type: text/html, Size: 7091 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-09 3:59 ` phoeagon @ 2015-05-09 6:39 ` Stefan Weil 2015-05-09 7:41 ` phoeagon 0 siblings, 1 reply; 23+ messages in thread From: Stefan Weil @ 2015-05-09 6:39 UTC (permalink / raw) To: phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2041 bytes --] Am 09.05.2015 um 05:59 schrieb phoeagon: > BTW, how do you usually measure the time to install a Linux distro > within? Most distros ISOs do NOT have unattended installation ISOs in > place. (True I can bake my own ISOs for this...) But do you have any > ISOs made ready for this purpose? > > On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com > <mailto:phoeagon@gmail.com>> wrote: > > Thanks. Dbench does not logically allocate new disk space all the > time, because it's a FS level benchmark that creates file and > deletes them. Therefore it also depends on the guest FS, say, a > btrfs guest FS allocates about 1.8x space of that from EXT4, due > to its COW nature. It does cause the FS to allocate some space > during about 1/3 of the test duration I think. But this does not > mitigate it too much because a FS often writes in a stride rather > than consecutively, which causes write amplification at allocation > times. > > So I tested it with qemu-img convert from a 400M raw file: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > unsafe -O vdi /run/shm/rand 1.vdi > > real0m0.402s > user0m0.206s > sys0m0.202s > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > I assume that the target file /run/shm/rand 1.vdi is not on a physical disk. Then flushing data will be fast. For real hard disks (not SSDs) the situation is different: the r/w heads of the hard disk have to move between data location and the beginning of the written file where the metadata is written, so I expect a larger effect there. For measuring installation time of an OS, I'd take a reproducible installation source (hard disk or DVD, no network connection) and take the time for those parts of the installation where many packets are installed without any user interaction. For Linux you won't need a stop watch, because the packet directories in /usr/share/doc have nice timestamps. Stefan [-- Attachment #2: Type: text/html, Size: 3335 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-09 6:39 ` Stefan Weil @ 2015-05-09 7:41 ` phoeagon 0 siblings, 0 replies; 23+ messages in thread From: phoeagon @ 2015-05-09 7:41 UTC (permalink / raw) To: Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2550 bytes --] Full Linux Mint (17.1) Installation with writeback: With VDI extra sync 4min35s Vanilla: 3min17s which is consistent with 'qemu-img convert' (slightly less overhead due to some phases in installation is actually CPU bound). Still much faster than other "sync-after-metadata" formats like VPC (vanilla VPC 7min43s) The thing is he who needs to set up a new Linux system every day probably have pre-installed images to start with, and others just don't install an OS every day. On Sat, May 9, 2015 at 2:39 PM Stefan Weil <sw@weilnetz.de> wrote: > Am 09.05.2015 um 05:59 schrieb phoeagon: > > BTW, how do you usually measure the time to install a Linux distro within? > Most distros ISOs do NOT have unattended installation ISOs in place. (True > I can bake my own ISOs for this...) But do you have any ISOs made ready for > this purpose? > > On Sat, May 9, 2015 at 11:54 AM phoeagon <phoeagon@gmail.com> wrote: > >> Thanks. Dbench does not logically allocate new disk space all the time, >> because it's a FS level benchmark that creates file and deletes them. >> Therefore it also depends on the guest FS, say, a btrfs guest FS allocates >> about 1.8x space of that from EXT4, due to its COW nature. It does cause >> the FS to allocate some space during about 1/3 of the test duration I >> think. But this does not mitigate it too much because a FS often writes in >> a stride rather than consecutively, which causes write amplification at >> allocation times. >> >> So I tested it with qemu-img convert from a 400M raw file: >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe >> -O vdi /run/shm/rand 1.vdi >> >> real 0m0.402s >> user 0m0.206s >> sys 0m0.202s >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t >> writeback -O vdi /run/shm/rand 1.vdi >> > > > I assume that the target file /run/shm/rand 1.vdi is not on a physical > disk. > Then flushing data will be fast. For real hard disks (not SSDs) the > situation is > different: the r/w heads of the hard disk have to move between data > location > and the beginning of the written file where the metadata is written, so > I expect a larger effect there. > > For measuring installation time of an OS, I'd take a reproducible > installation > source (hard disk or DVD, no network connection) and take the time for > those parts of the installation where many packets are installed without > any user interaction. For Linux you won't need a stop watch, because the > packet directories in /usr/share/doc have nice timestamps. > > > Stefan > > [-- Attachment #2: Type: text/html, Size: 4178 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-09 3:54 ` phoeagon 2015-05-09 3:59 ` phoeagon @ 2015-05-10 15:01 ` Paolo Bonzini 2015-05-10 16:02 ` Stefan Weil 1 sibling, 1 reply; 23+ messages in thread From: Paolo Bonzini @ 2015-05-10 15:01 UTC (permalink / raw) To: phoeagon, Stefan Weil, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block On 09/05/2015 05:54, phoeagon wrote: > zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi > > real0m8.678s > user0m0.169s > sys0m0.500s > > zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi > real0m4.320s > user0m0.148s > sys0m0.471s This means that 3.83 seconds are spent when bdrv_close() calls bdrv_flush(). That's the only difference between writeback and unsafe in qemu-img convert. The remaining part of the time (4.85 seconds instead of 0.49 seconds) means that, at least on your hardware, sequential writes to unallocated space become 10 times slower with your patch. Since the default qemu-img convert case isn't slowed down, I would think that correctness trumps performance. Nevertheless, it's a huge difference. Paolo > zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi > real 0m0.489s > user 0m0.173s > sys 0m0.325s ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-10 15:01 ` Paolo Bonzini @ 2015-05-10 16:02 ` Stefan Weil 2015-05-10 16:05 ` phoeagon 2015-05-10 16:10 ` Paolo Bonzini 0 siblings, 2 replies; 23+ messages in thread From: Stefan Weil @ 2015-05-10 16:02 UTC (permalink / raw) To: Paolo Bonzini, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block Am 10.05.2015 um 17:01 schrieb Paolo Bonzini: > > On 09/05/2015 05:54, phoeagon wrote: >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi >> >> real0m8.678s >> user0m0.169s >> sys0m0.500s >> >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi >> real0m4.320s >> user0m0.148s >> sys0m0.471s > This means that 3.83 seconds are spent when bdrv_close() calls > bdrv_flush(). That's the only difference between writeback > and unsafe in qemu-img convert. > > The remaining part of the time (4.85 seconds instead of 0.49 > seconds) means that, at least on your hardware, sequential writes > to unallocated space become 10 times slower with your patch. > > Since the default qemu-img convert case isn't slowed down, I > would think that correctness trumps performance. Nevertheless, > it's a huge difference. > > Paolo I doubt that the convert case isn't slowed down. Writing to a tmpfs as it was obviously done for the test is not a typical use case. With real hard disks I expect a significant slowdown. Stefan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-10 16:02 ` Stefan Weil @ 2015-05-10 16:05 ` phoeagon 2015-05-10 16:10 ` Paolo Bonzini 1 sibling, 0 replies; 23+ messages in thread From: phoeagon @ 2015-05-10 16:05 UTC (permalink / raw) To: Stefan Weil, Paolo Bonzini, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 1453 bytes --] Just for clarity, I was not writing to tmpfs. I was READING from tmpfs. I was writing to a path named 'sdb' (as you see in the prompt) which is a btrfs on an SSD Drive. I don't have an HDD to test on though. On Mon, May 11, 2015 at 12:02 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 10.05.2015 um 17:01 schrieb Paolo Bonzini: > > > > On 09/05/2015 05:54, phoeagon wrote: > >> zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t > writeback -O vdi /run/shm/rand 1.vdi > >> > >> real0m8.678s > >> user0m0.169s > >> sys0m0.500s > >> > >> zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi > /run/shm/rand 1.vdi > >> real0m4.320s > >> user0m0.148s > >> sys0m0.471s > > This means that 3.83 seconds are spent when bdrv_close() calls > > bdrv_flush(). That's the only difference between writeback > > and unsafe in qemu-img convert. > > > > The remaining part of the time (4.85 seconds instead of 0.49 > > seconds) means that, at least on your hardware, sequential writes > > to unallocated space become 10 times slower with your patch. > > > > Since the default qemu-img convert case isn't slowed down, I > > would think that correctness trumps performance. Nevertheless, > > it's a huge difference. > > > > Paolo > > I doubt that the convert case isn't slowed down. > > Writing to a tmpfs as it was obviously done for the test is not a > typical use case. > With real hard disks I expect a significant slowdown. > > Stefan > > [-- Attachment #2: Type: text/html, Size: 1910 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-10 16:02 ` Stefan Weil 2015-05-10 16:05 ` phoeagon @ 2015-05-10 16:10 ` Paolo Bonzini 2015-05-10 16:26 ` Stefan Weil 1 sibling, 1 reply; 23+ messages in thread From: Paolo Bonzini @ 2015-05-10 16:10 UTC (permalink / raw) To: Stefan Weil, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block On 10/05/2015 18:02, Stefan Weil wrote: >> Since the default qemu-img convert case isn't slowed down, I >> would think that correctness trumps performance. Nevertheless, >> it's a huge difference. > > I doubt that the convert case isn't slowed down. The *default* convert case isn't slowed down because "qemu-img convert" defaults to the "unsafe" cache mode. The *non-default* convert case with flushes was slowed down indeed: 2x in total (if you include the final flush done by bdrv_close), and 10x if you only consider the sequential write part of convert. Paolo > Writing to a tmpfs as it was obviously done for the test is not a > typical use case. > With real hard disks I expect a significant slowdown. > > Stefan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-10 16:10 ` Paolo Bonzini @ 2015-05-10 16:26 ` Stefan Weil 2015-05-10 17:14 ` phoeagon 0 siblings, 1 reply; 23+ messages in thread From: Stefan Weil @ 2015-05-10 16:26 UTC (permalink / raw) To: Paolo Bonzini, phoeagon, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block Am 10.05.2015 um 18:10 schrieb Paolo Bonzini: > On 10/05/2015 18:02, Stefan Weil wrote: >>> Since the default qemu-img convert case isn't slowed down, I >>> would think that correctness trumps performance. Nevertheless, >>> it's a huge difference. >> I doubt that the convert case isn't slowed down. > The *default* convert case isn't slowed down because "qemu-img convert" > defaults to the "unsafe" cache mode. > > The *non-default* convert case with flushes was slowed down indeed: 2x > in total (if you include the final flush done by bdrv_close), and 10x if > you only consider the sequential write part of convert. > > Paolo For those who might be interested: The relevant GPL source code from VirtualBox is available here: https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage If I interpret that code correctly, blocks are normally written asynchronously, but changes of metadata (new block allocation) are written synchronously. See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp (vdIOIntWriteMeta). Stefan ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates 2015-05-10 16:26 ` Stefan Weil @ 2015-05-10 17:14 ` phoeagon 0 siblings, 0 replies; 23+ messages in thread From: phoeagon @ 2015-05-10 17:14 UTC (permalink / raw) To: Stefan Weil, Paolo Bonzini, Kevin Wolf, Max Reitz; +Cc: qemu-devel, qemu-block [-- Attachment #1: Type: text/plain, Size: 2228 bytes --] I'm not familiar with the VirtualBox code base, but looks like " vdIOIntWriteMeta" can work both synchronously & asynchronously, and "vdiBlockAllocUpdate" looks async to me. Frankly, skimming through the code for 5 min doesn't enlighten me too much on its detailed implementation, but looks like at least Virtualbox has VDI-repair that fixes block leaks relatively easily. I would agree that a more complete implementation on VDI-check-and-repair might be better in this particular situation. I'm not sure if there are other cases where flush after metadata update might be better, but doesn't look like qemu-img auto repair is coming to other writable image formats in the near future. Also, I think that memory exhaustion and consequent page cache eviction are not too uncommon on computers not originally designed to run VMs. Many laptops are still trapped with 4GB memory and there seem to widespread instructions on tuning down the swappiness to favor page cache drops than swapping out memory, all of which adds to the odds of metadata inconsistency. On Mon, May 11, 2015 at 12:26 AM Stefan Weil <sw@weilnetz.de> wrote: > Am 10.05.2015 um 18:10 schrieb Paolo Bonzini: > > On 10/05/2015 18:02, Stefan Weil wrote: > >>> Since the default qemu-img convert case isn't slowed down, I > >>> would think that correctness trumps performance. Nevertheless, > >>> it's a huge difference. > >> I doubt that the convert case isn't slowed down. > > The *default* convert case isn't slowed down because "qemu-img convert" > > defaults to the "unsafe" cache mode. > > > > The *non-default* convert case with flushes was slowed down indeed: 2x > > in total (if you include the final flush done by bdrv_close), and 10x if > > you only consider the sequential write part of convert. > > > > Paolo > > > For those who might be interested: > > The relevant GPL source code from VirtualBox is available here: > > https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Storage > > If I interpret that code correctly, blocks are normally written > asynchronously, > but changes of metadata (new block allocation) are written synchronously. > > See file VDI.cpp (function vdiBlockAllocUpdate) and VD.cpp > (vdIOIntWriteMeta). > > Stefan > > [-- Attachment #2: Type: text/html, Size: 2922 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu ` (2 preceding siblings ...) 2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu @ 2015-05-08 10:43 ` Kevin Wolf 2015-05-08 11:50 ` phoeagon 2015-05-08 13:10 ` [Qemu-devel] " Max Reitz 4 siblings, 1 reply; 23+ messages in thread From: Kevin Wolf @ 2015-05-08 10:43 UTC (permalink / raw) To: Zhe Qiu; +Cc: qemu-devel, qemu-block, mreitz Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. The justification for these patches (in 2010!) was more or less that we didn't know whether the implementations were safe without the flushes. Many of the flushes added back then have been removed again until today because they have turned out to be unnecessary. > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> Please describe why VDI needs this flush to be safe. This is best done by describing a case where not doing the flush would lead to image corruption in case of a crash. To my knowledge, the VDI driver is completely safe without it. Kevin ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf @ 2015-05-08 11:50 ` phoeagon 2015-05-08 12:02 ` Kevin Wolf 0 siblings, 1 reply; 23+ messages in thread From: phoeagon @ 2015-05-08 11:50 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz [-- Attachment #1: Type: text/plain, Size: 1799 bytes --] In case of correctness, lacking a sync here does not introduce data corruption I can think of. But this reduces the volatile window during which the metadata changes are NOT guaranteed on disk. Without a barrier, in case of power loss you may end up with the bitmap changes on disk and not the header block, or vice versa. Neither introduces data corruption directly, but since VDI doesn't have proper fix mechanism for qemu-img, once the leak is introduced you have to "convert" to fix it, consuming a long time if the disk is large. This patch does not fix the issue entirely, and it does not substitute for proper check-and-fix implementation. But this should bring about minor performance degradation (only 1 extra sync per allocation) but greatly reduces the metadata inconsistency window. On Fri, May 8, 2015 at 6:43 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 07.05.2015 um 06:04 hat Zhe Qiu geschrieben: > > From: phoeagon <phoeagon@gmail.com> > > > > In reference to > b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, > metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to > succeeding writes. > > The justification for these patches (in 2010!) was more or less that we > didn't know whether the implementations were safe without the flushes. > Many of the flushes added back then have been removed again until today > because they have turned out to be unnecessary. > > > Only when write is successful that bdrv_flush is called. > > > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > > Please describe why VDI needs this flush to be safe. This is best done > by describing a case where not doing the flush would lead to image > corruption in case of a crash. > > To my knowledge, the VDI driver is completely safe without it. > > Kevin > [-- Attachment #2: Type: text/html, Size: 2264 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 11:50 ` phoeagon @ 2015-05-08 12:02 ` Kevin Wolf 2015-05-08 12:56 ` phoeagon 0 siblings, 1 reply; 23+ messages in thread From: Kevin Wolf @ 2015-05-08 12:02 UTC (permalink / raw) To: phoeagon; +Cc: qemu-devel, qemu-block, mreitz Am 08.05.2015 um 13:50 hat phoeagon geschrieben: > In case of correctness, lacking a sync here does not introduce data corruption > I can think of. But this reduces the volatile window during which the metadata > changes are NOT guaranteed on disk. Without a barrier, in case of power loss > you may end up with the bitmap changes on disk and not the header block, or > vice versa. Neither introduces data corruption directly, but since VDI doesn't > have proper fix mechanism for qemu-img, once the leak is introduced you have to > "convert" to fix it, consuming a long time if the disk is large. This is true. I'm not sure how big a problem this is in practice, though. > This patch does not fix the issue entirely, and it does not substitute for > proper check-and-fix implementation. But this should bring about minor > performance degradation (only 1 extra sync per allocation) but greatly reduces > the metadata inconsistency window. Did you benchmark this? From the past experience with flushes in qemu block drivers, one sync per allocation certainly doesn't sound "minor". What could possibly save us from the worst is that VDI has a relatively large block size (or rather, that we don't support images with different block sizes). Kevin ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-08 12:02 ` Kevin Wolf @ 2015-05-08 12:56 ` phoeagon 0 siblings, 0 replies; 23+ messages in thread From: phoeagon @ 2015-05-08 12:56 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz [-- Attachment #1: Type: text/plain, Size: 2508 bytes --] I tested it on host-btrfs, guest-ext4, (connected to guest via virtualized IDE) with 1G VDI image, testing with "dbench 10": synced: Writeback: 39.4852M/s 326.812ms Unsafe: 432.72M/s 1029.313ms normal: Writeback: 39.1884M/s 311.383ms Unsafe: 413.592M/s 280.951ms Although I hear that dbench is not a good I/O benchmark (and iozone is just a little too much hassle) but I'm pretty sure the difference, if any, is within fluctuation in this case. Under my setting even a raw "dd of=/dev/sdb" (from within a VM) doesn't have higher than 20M/s even without this extra write barrier in the proposed patch. So although what described above is not comprehensive (you can probably extract the most overhead by deliberately O_DIRECT writing at 1M stride, 512K each, no application level sync in "writeback" cache mode of VDI, what originally mostly resides in host page cache now must be flushed to hard disk, probably in an inconvenient order if host FS highly fragmented), I doubt consecutive raw writes cover several Megabytes or guest-FS based workload would see much performance overhead after introducing this extra sync. On Fri, May 8, 2015 at 8:02 PM Kevin Wolf <kwolf@redhat.com> wrote: > Am 08.05.2015 um 13:50 hat phoeagon geschrieben: > > In case of correctness, lacking a sync here does not introduce data > corruption > > I can think of. But this reduces the volatile window during which the > metadata > > changes are NOT guaranteed on disk. Without a barrier, in case of power > loss > > you may end up with the bitmap changes on disk and not the header block, > or > > vice versa. Neither introduces data corruption directly, but since VDI > doesn't > > have proper fix mechanism for qemu-img, once the leak is introduced you > have to > > "convert" to fix it, consuming a long time if the disk is large. > > This is true. I'm not sure how big a problem this is in practice, > though. > > > This patch does not fix the issue entirely, and it does not substitute > for > > proper check-and-fix implementation. But this should bring about minor > > performance degradation (only 1 extra sync per allocation) but greatly > reduces > > the metadata inconsistency window. > > Did you benchmark this? From the past experience with flushes in qemu > block drivers, one sync per allocation certainly doesn't sound "minor". > > What could possibly save us from the worst is that VDI has a relatively > large block size (or rather, that we don't support images with different > block sizes). > > Kevin > [-- Attachment #2: Type: text/html, Size: 3205 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu ` (3 preceding siblings ...) 2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf @ 2015-05-08 13:10 ` Max Reitz 4 siblings, 0 replies; 23+ messages in thread From: Max Reitz @ 2015-05-08 13:10 UTC (permalink / raw) To: Zhe Qiu, qemu-devel; +Cc: qemu-block On 07.05.2015 06:04, Zhe Qiu wrote: > From: phoeagon <phoeagon@gmail.com> > > In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > > Only when write is successful that bdrv_flush is called. > > Signed-off-by: Zhe Qiu <phoeagon@gmail.com> > --- > block/vdi.c | 3 +++ > 1 file changed, 3 insertions(+) Reviewed-by: Max Reitz <mreitz@redhat.com> Thanks! > diff --git a/block/vdi.c b/block/vdi.c > index 7642ef3..dfe8ade 100644 > --- a/block/vdi.c > +++ b/block/vdi.c > @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, > logout("will write %u block map sectors starting from entry %u\n", > n_sectors, bmap_first); > ret = bdrv_write(bs->file, offset, base, n_sectors); > + if (ret >= 0) { > + ret = bdrv_flush(bs->file); > + } > } > > return ret; ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-05-10 17:14 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-07 4:04 [Qemu-devel] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates Zhe Qiu 2015-05-07 10:09 ` Stefan Hajnoczi 2015-05-07 15:05 ` [Qemu-devel] [Qemu-block] " Eric Blake 2015-05-07 15:16 ` [Qemu-devel] [PATCH v4] " Zhe Qiu 2015-05-08 13:14 ` Max Reitz 2015-05-08 13:55 ` Kevin Wolf 2015-05-08 14:43 ` phoeagon 2015-05-08 21:26 ` Stefan Weil 2015-05-09 3:54 ` phoeagon 2015-05-09 3:59 ` phoeagon 2015-05-09 6:39 ` Stefan Weil 2015-05-09 7:41 ` phoeagon 2015-05-10 15:01 ` Paolo Bonzini 2015-05-10 16:02 ` Stefan Weil 2015-05-10 16:05 ` phoeagon 2015-05-10 16:10 ` Paolo Bonzini 2015-05-10 16:26 ` Stefan Weil 2015-05-10 17:14 ` phoeagon 2015-05-08 10:43 ` [Qemu-devel] [Qemu-block] [PATCH v3] " Kevin Wolf 2015-05-08 11:50 ` phoeagon 2015-05-08 12:02 ` Kevin Wolf 2015-05-08 12:56 ` phoeagon 2015-05-08 13:10 ` [Qemu-devel] " Max Reitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).