From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33942) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yqvqj-0007dh-4v for qemu-devel@nongnu.org; Fri, 08 May 2015 23:54:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yqvqg-0004hN-UA for qemu-devel@nongnu.org; Fri, 08 May 2015 23:54:48 -0400 MIME-Version: 1.0 References: <1430971496-32659-1-git-send-email-phoeagon@gmail.com> <1431011818-15822-1-git-send-email-phoeagon@gmail.com> <554CB6C6.3060809@redhat.com> <20150508135512.GJ4318@noname.redhat.com> <554D2A03.3080201@weilnetz.de> In-Reply-To: <554D2A03.3080201@weilnetz.de> From: phoeagon Date: Sat, 09 May 2015 03:54:45 +0000 Message-ID: Content-Type: multipart/alternative; boundary=001a113eb31804972c05159e1ba2 Subject: Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after metadata updates List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Weil , Kevin Wolf , Max Reitz Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org --001a113eb31804972c05159e1ba2 Content-Type: text/plain; charset=UTF-8 Thanks. Dbench does not logically allocate new disk space all the time, because it's a FS level benchmark that creates file and deletes them. Therefore it also depends on the guest FS, say, a btrfs guest FS allocates about 1.8x space of that from EXT4, due to its COW nature. It does cause the FS to allocate some space during about 1/3 of the test duration I think. But this does not mitigate it too much because a FS often writes in a stride rather than consecutively, which causes write amplification at allocation times. So I tested it with qemu-img convert from a 400M raw file: zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.402s user 0m0.206s sys 0m0.202s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m8.678s user 0m0.169s sys 0m0.500s zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi real 0m4.320s user 0m0.148s sys 0m0.471s zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi real 0m0.489s user 0m0.173s sys 0m0.325s zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.515s user 0m0.168s sys 0m0.357s zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi real 0m0.431s user 0m0.192s sys 0m0.248s Although 400M is not a giant file, it does show the trend. As you can see when there's drastic allocation needs, and when there no extra buffering from a virtualized host, the throughput drops about 50%. But still it has no effect on "unsafe" mode, as predicted. Also I believe that expecting to use a half-converted image is seldom the use case, while host crash and power loss are not so unimaginable. Looks like qemu-img convert is using "unsafe" as default as well, so even novice "qemu-img convert" users are not likely to find performance degradation. I have not yet tried guest OS installation on top, but I guess a new flag for one-time faster OS installation is not likely useful, and "cache=unsafe" already does the trick. On Sat, May 9, 2015 at 5:26 AM Stefan Weil wrote: > Am 08.05.2015 um 15:55 schrieb Kevin Wolf: > > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben: > >> On 07.05.2015 17:16, Zhe Qiu wrote: > >>> In reference to b0ad5a45...078a458e, metadata writes to > >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. > >>> > >>> Only when write is successful that bdrv_flush is called. > >>> > >>> Signed-off-by: Zhe Qiu > >>> --- > >>> block/vdi.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >> I missed Kevin's arguments before, but I think that adding this is > >> more correct than not having it; and when thinking about speed, this > >> is vdi, a format supported for compatibility. > > If you use it only as a convert target, you probably care more about > > speed than about leaks in case of a host crash. > > > >> So if we wanted to optimize it, we'd probably have to cache multiple > >> allocations, do them at once and then flush afterwards (like the > >> metadata cache we have in qcow2?) > > That would defeat the purpose of this patch which aims at having > > metadata and data written out almost at the same time. On the other > > hand, fully avoiding the problem instead of just making the window > > smaller would require a journal, which VDI just doesn't have. > > > > I'm not convinced of this patch, but I'll defer to Stefan Weil as the > > VDI maintainer. > > > > Kevin > > Thanks for asking. I share your concerns regarding reduced performance > caused by bdrv_flush. Conversions to VDI will take longer (how much?), > and also installation of an OS on a new VDI disk image will be slower > because that are the typical scenarios where the disk usage grows. > > @phoeagon: Did the benchmark which you used allocate additional disk > storage? If not or if it only allocated once and then spent some time > on already allocated blocks, that benchmark was not valid for this case. > > On the other hand I don't see a need for the flushing because the kind > of failures (power failure) and their consequences seem to be acceptable > for typical VDI usage, namely either image conversion or tests with > existing images. > > That's why I'd prefer not to use bdrv_flush here. Could we make > bdrv_flush optional (either generally or for cases like this one) so > both people who prefer speed and people who would want > bdrv_flush to decrease the likelihood of inconsistencies can be > satisfied? > > Stefan > > --001a113eb31804972c05159e1ba2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks. Dbench does not logically allocate new disk space = all the time, because it's a FS level benchmark that creates file and d= eletes them. Therefore it also depends on the guest FS, say, a btrfs guest = FS allocates about 1.8x space of that from EXT4, due to its COW nature. It = does cause the FS to allocate some space during about 1/3 of the test durat= ion I think. But this does not mitigate it too much because a FS often writ= es in a stride rather than consecutively, which causes write amplification = at allocation times.

So I tested it with qemu-img convert from= a 400M raw file:
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img = convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi

real 0m0.= 402s
user 0m0.206s
sys 0m0.202s
zheq-PC sdb # time ~/qemu-sync-test= /bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
<= div>
real 0m8.678s
user 0m0.169s
sys 0m0.500s
zheq-PC sdb # time= qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi
=
real 0m4.320s
user 0m0.148s
sys 0m0.471s
zheq-PC sdb # time qem= u-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi
real 0m0.489s
user = 0m0.173s
sys 0m0.325s

zheq-PC sdb # time qemu-= img convert -f raw -O vdi /run/shm/rand 1.vdi

real= 0m0.515s
user 0m0.168s
sys 0m0.357s
zheq-PC sdb # time ~/qemu-sync-test/bin/q= emu-img convert -f raw -O vdi /run/shm/rand 1.vdi

= real 0m0.43= 1s
user = 0m0.192s
sys 0m0.248s

Although 400M is= not a giant file, it does show the trend.
As you can see when th= ere's drastic allocation needs, and when there no extra buffering from = a virtualized host, the throughput drops about 50%. But still it has no eff= ect on "unsafe" mode, as predicted. Also I believe that expecting= to use a half-converted image is seldom the use case, while host crash and= power loss are not so unimaginable.
Looks like qemu-img convert is using "= ;unsafe" as default as well, so even novice "qemu-img convert&quo= t; users are not likely to find performance degradation.

I have not yet tried guest OS installation on top, but I = guess a new flag for one-time faster OS installation is not likely useful, = and "cache=3Dunsafe" already does the trick.

=

On Sat, May 9, 2015 at 5:2= 6 AM Stefan Weil <sw@weilnetz.de&g= t; wrote:
Am 08.05.2015 um 15:55 schrieb = Kevin Wolf:
> Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
>> On 07.05.2015 17:16, Zhe Qiu wrote:
>>> In reference to b0ad5a45...078a458e, metadata writes to
>>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding wri= tes.
>>>
>>> Only when write is successful that bdrv_flush is called.
>>>
>>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
>>> ---
>>>=C2=A0 =C2=A0block/vdi.c | 3 +++
>>>=C2=A0 =C2=A01 file changed, 3 insertions(+)
>> I missed Kevin's arguments before, but I think that adding thi= s is
>> more correct than not having it; and when thinking about speed, th= is
>> is vdi, a format supported for compatibility.
> If you use it only as a convert target, you probably care more about > speed than about leaks in case of a host crash.
>
>> So if we wanted to optimize it, we'd probably have to cache mu= ltiple
>> allocations, do them at once and then flush afterwards (like the >> metadata cache we have in qcow2?)
> That would defeat the purpose of this patch which aims at having
> metadata and data written out almost at the same time. On the other > hand, fully avoiding the problem instead of just making the window
> smaller would require a journal, which VDI just doesn't have.
>
> I'm not convinced of this patch, but I'll defer to Stefan Weil= as the
> VDI maintainer.
>
> Kevin

Thanks for asking. I share your concerns regarding reduced performance
caused by bdrv_flush. Conversions to VDI will take longer (how much?),
and also installation of an OS on a new VDI disk image will be slower
because that are the typical scenarios where the disk usage grows.

@phoeagon: Did the benchmark which you used allocate additional disk
storage? If not or if it only allocated once and then spent some time
on already allocated blocks, that benchmark was not valid for this case.
On the other hand I don't see a need for the flushing because the kind<= br> of failures (power failure) and their consequences seem to be acceptable for typical VDI usage, namely either image conversion or tests with
existing images.

That's why I'd prefer not to use bdrv_flush here. Could we make
bdrv_flush optional (either generally or for cases like this one) so
both people who prefer speed and people who would want
bdrv_flush to decrease the likelihood of inconsistencies can be
satisfied?

Stefan

--001a113eb31804972c05159e1ba2--