From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33942)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <phoeagon@gmail.com>) id 1Yqvqj-0007dh-4v
	for qemu-devel@nongnu.org; Fri, 08 May 2015 23:54:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <phoeagon@gmail.com>) id 1Yqvqg-0004hN-UA
	for qemu-devel@nongnu.org; Fri, 08 May 2015 23:54:48 -0400
MIME-Version: 1.0
References: <1430971496-32659-1-git-send-email-phoeagon@gmail.com>
	<1431011818-15822-1-git-send-email-phoeagon@gmail.com>
	<554CB6C6.3060809@redhat.com>
	<20150508135512.GJ4318@noname.redhat.com>
	<554D2A03.3080201@weilnetz.de>
In-Reply-To: <554D2A03.3080201@weilnetz.de>
From: phoeagon <phoeagon@gmail.com>
Date: Sat, 09 May 2015 03:54:45 +0000
Message-ID: <CAKYApDBqmPq2idjVne6zQveJ6mxSw+Bcp_9vzzvOXEfqy_q65Q@mail.gmail.com>
Content-Type: multipart/alternative; boundary=001a113eb31804972c05159e1ba2
Subject: Re: [Qemu-devel] [PATCH v4] block/vdi: Use bdrv_flush after
	metadata updates
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Weil <sw@weilnetz.de>, Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org

--001a113eb31804972c05159e1ba2
Content-Type: text/plain; charset=UTF-8

Thanks. Dbench does not logically allocate new disk space all the time,
because it's a FS level benchmark that creates file and deletes them.
Therefore it also depends on the guest FS, say, a btrfs guest FS allocates
about 1.8x space of that from EXT4, due to its COW nature. It does cause
the FS to allocate some space during about 1/3 of the test duration I
think. But this does not mitigate it too much because a FS often writes in
a stride rather than consecutively, which causes write amplification at
allocation times.

So I tested it with qemu-img convert from a 400M raw file:
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t unsafe
-O vdi /run/shm/rand 1.vdi

real 0m0.402s
user 0m0.206s
sys 0m0.202s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -t
writeback -O vdi /run/shm/rand 1.vdi

real 0m8.678s
user 0m0.169s
sys 0m0.500s
zheq-PC sdb # time qemu-img convert -f raw -t writeback -O vdi
/run/shm/rand 1.vdi

real 0m4.320s
user 0m0.148s
sys 0m0.471s
zheq-PC sdb # time qemu-img convert -f raw -t unsafe -O vdi /run/shm/rand
1.vdi
real 0m0.489s
user 0m0.173s
sys 0m0.325s

zheq-PC sdb # time qemu-img convert -f raw -O vdi /run/shm/rand 1.vdi

real 0m0.515s
user 0m0.168s
sys 0m0.357s
zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img convert -f raw -O vdi
/run/shm/rand 1.vdi

real 0m0.431s
user 0m0.192s
sys 0m0.248s

Although 400M is not a giant file, it does show the trend.
As you can see when there's drastic allocation needs, and when there no
extra buffering from a virtualized host, the throughput drops about 50%.
But still it has no effect on "unsafe" mode, as predicted. Also I believe
that expecting to use a half-converted image is seldom the use case, while
host crash and power loss are not so unimaginable.
Looks like qemu-img convert is using "unsafe" as default as well, so even
novice "qemu-img convert" users are not likely to find performance
degradation.

I have not yet tried guest OS installation on top, but I guess a new flag
for one-time faster OS installation is not likely useful, and
"cache=unsafe" already does the trick.


On Sat, May 9, 2015 at 5:26 AM Stefan Weil <sw@weilnetz.de> wrote:

> Am 08.05.2015 um 15:55 schrieb Kevin Wolf:
> > Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:
> >> On 07.05.2015 17:16, Zhe Qiu wrote:
> >>> In reference to b0ad5a45...078a458e, metadata writes to
> >>> qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes.
> >>>
> >>> Only when write is successful that bdrv_flush is called.
> >>>
> >>> Signed-off-by: Zhe Qiu <phoeagon@gmail.com>
> >>> ---
> >>>   block/vdi.c | 3 +++
> >>>   1 file changed, 3 insertions(+)
> >> I missed Kevin's arguments before, but I think that adding this is
> >> more correct than not having it; and when thinking about speed, this
> >> is vdi, a format supported for compatibility.
> > If you use it only as a convert target, you probably care more about
> > speed than about leaks in case of a host crash.
> >
> >> So if we wanted to optimize it, we'd probably have to cache multiple
> >> allocations, do them at once and then flush afterwards (like the
> >> metadata cache we have in qcow2?)
> > That would defeat the purpose of this patch which aims at having
> > metadata and data written out almost at the same time. On the other
> > hand, fully avoiding the problem instead of just making the window
> > smaller would require a journal, which VDI just doesn't have.
> >
> > I'm not convinced of this patch, but I'll defer to Stefan Weil as the
> > VDI maintainer.
> >
> > Kevin
>
> Thanks for asking. I share your concerns regarding reduced performance
> caused by bdrv_flush. Conversions to VDI will take longer (how much?),
> and also installation of an OS on a new VDI disk image will be slower
> because that are the typical scenarios where the disk usage grows.
>
> @phoeagon: Did the benchmark which you used allocate additional disk
> storage? If not or if it only allocated once and then spent some time
> on already allocated blocks, that benchmark was not valid for this case.
>
> On the other hand I don't see a need for the flushing because the kind
> of failures (power failure) and their consequences seem to be acceptable
> for typical VDI usage, namely either image conversion or tests with
> existing images.
>
> That's why I'd prefer not to use bdrv_flush here. Could we make
> bdrv_flush optional (either generally or for cases like this one) so
> both people who prefer speed and people who would want
> bdrv_flush to decrease the likelihood of inconsistencies can be
> satisfied?
>
> Stefan
>
>

--001a113eb31804972c05159e1ba2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks. Dbench does not logically allocate new disk space =
all the time, because it&#39;s a FS level benchmark that creates file and d=
eletes them. Therefore it also depends on the guest FS, say, a btrfs guest =
FS allocates about 1.8x space of that from EXT4, due to its COW nature. It =
does cause the FS to allocate some space during about 1/3 of the test durat=
ion I think. But this does not mitigate it too much because a FS often writ=
es in a stride rather than consecutively, which causes write amplification =
at allocation times.<div><br><div>So I tested it with qemu-img convert from=
 a 400M raw file:<br><div>zheq-PC sdb # time ~/qemu-sync-test/bin/qemu-img =
convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi</div><div><br></div><di=
v>real<span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>0m0.=
402s</div><div>user<span class=3D"Apple-tab-span" style=3D"white-space:pre"=
>	</span>0m0.206s</div><div>sys<span class=3D"Apple-tab-span" style=3D"whit=
e-space:pre">	</span>0m0.202s</div><div>zheq-PC sdb # time ~/qemu-sync-test=
/bin/qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi</div><=
div><br></div><div>real<span class=3D"Apple-tab-span" style=3D"white-space:=
pre">	</span>0m8.678s</div><div>user<span class=3D"Apple-tab-span" style=3D=
"white-space:pre">	</span>0m0.169s</div><div>sys<span class=3D"Apple-tab-sp=
an" style=3D"white-space:pre">	</span>0m0.500s</div><div>zheq-PC sdb # time=
 qemu-img convert -f raw -t writeback -O vdi /run/shm/rand 1.vdi</div><div>=
<br></div><div>real<span class=3D"Apple-tab-span" style=3D"white-space:pre"=
>	</span>0m4.320s</div><div>user<span class=3D"Apple-tab-span" style=3D"whi=
te-space:pre">	</span>0m0.148s</div><div>sys<span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>0m0.471s</div><div>zheq-PC sdb # time qem=
u-img convert -f raw -t unsafe -O vdi /run/shm/rand 1.vdi</div><div>real<sp=
an class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>0m0.489s</div=
><div>user<span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>=
0m0.173s</div><div>sys<span class=3D"Apple-tab-span" style=3D"white-space:p=
re">	</span>0m0.325s</div><div><br></div><div><div>zheq-PC sdb # time qemu-=
img convert -f raw -O vdi /run/shm/rand 1.vdi</div><div><br></div><div>real=
<span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>0m0.515s</=
div><div>user<span class=3D"Apple-tab-span" style=3D"white-space:pre">	</sp=
an>0m0.168s</div><div>sys<span class=3D"Apple-tab-span" style=3D"white-spac=
e:pre">	</span>0m0.357s</div><div>zheq-PC sdb # time ~/qemu-sync-test/bin/q=
emu-img convert -f raw -O vdi /run/shm/rand 1.vdi</div><div><br></div><div>=
real<span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>0m0.43=
1s</div><div>user<span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>0m0.192s</div><div>sys<span class=3D"Apple-tab-span" style=3D"white-=
space:pre">	</span>0m0.248s</div></div><div><br></div><div>Although 400M is=
 not a giant file, it does show the trend.</div><div>As you can see when th=
ere&#39;s drastic allocation needs, and when there no extra buffering from =
a virtualized host, the throughput drops about 50%. But still it has no eff=
ect on &quot;unsafe&quot; mode, as predicted. Also I believe that expecting=
 to use a half-converted image is seldom the use case, while host crash and=
 power loss are not so unimaginable.</div><div><span style=3D"line-height:1=
.5;font-size:13.1999998092651px">Looks like qemu-img convert is using &quot=
;unsafe&quot; as default as well, so even novice &quot;qemu-img convert&quo=
t; users are not likely to find performance degradation.</span><br></div><d=
iv><br></div><div>I have not yet tried guest OS installation on top, but I =
guess a new flag for one-time faster OS installation is not likely useful, =
and &quot;cache=3Dunsafe&quot; already does the trick.</div><div><br></div>=
</div></div></div><br><div class=3D"gmail_quote">On Sat, May 9, 2015 at 5:2=
6 AM Stefan Weil &lt;<a href=3D"mailto:sw@weilnetz.de">sw@weilnetz.de</a>&g=
t; wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex">Am 08.05.2015 um 15:55 schrieb =
Kevin Wolf:<br>
&gt; Am 08.05.2015 um 15:14 hat Max Reitz geschrieben:<br>
&gt;&gt; On 07.05.2015 17:16, Zhe Qiu wrote:<br>
&gt;&gt;&gt; In reference to b0ad5a45...078a458e, metadata writes to<br>
&gt;&gt;&gt; qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding wri=
tes.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Only when write is successful that bdrv_flush is called.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Signed-off-by: Zhe Qiu &lt;<a href=3D"mailto:phoeagon@gmail.co=
m" target=3D"_blank">phoeagon@gmail.com</a>&gt;<br>
&gt;&gt;&gt; ---<br>
&gt;&gt;&gt;=C2=A0 =C2=A0block/vdi.c | 3 +++<br>
&gt;&gt;&gt;=C2=A0 =C2=A01 file changed, 3 insertions(+)<br>
&gt;&gt; I missed Kevin&#39;s arguments before, but I think that adding thi=
s is<br>
&gt;&gt; more correct than not having it; and when thinking about speed, th=
is<br>
&gt;&gt; is vdi, a format supported for compatibility.<br>
&gt; If you use it only as a convert target, you probably care more about<b=
r>
&gt; speed than about leaks in case of a host crash.<br>
&gt;<br>
&gt;&gt; So if we wanted to optimize it, we&#39;d probably have to cache mu=
ltiple<br>
&gt;&gt; allocations, do them at once and then flush afterwards (like the<b=
r>
&gt;&gt; metadata cache we have in qcow2?)<br>
&gt; That would defeat the purpose of this patch which aims at having<br>
&gt; metadata and data written out almost at the same time. On the other<br=
>
&gt; hand, fully avoiding the problem instead of just making the window<br>
&gt; smaller would require a journal, which VDI just doesn&#39;t have.<br>
&gt;<br>
&gt; I&#39;m not convinced of this patch, but I&#39;ll defer to Stefan Weil=
 as the<br>
&gt; VDI maintainer.<br>
&gt;<br>
&gt; Kevin<br>
<br>
Thanks for asking. I share your concerns regarding reduced performance<br>
caused by bdrv_flush. Conversions to VDI will take longer (how much?),<br>
and also installation of an OS on a new VDI disk image will be slower<br>
because that are the typical scenarios where the disk usage grows.<br>
<br>
@phoeagon: Did the benchmark which you used allocate additional disk<br>
storage? If not or if it only allocated once and then spent some time<br>
on already allocated blocks, that benchmark was not valid for this case.<br=
>
<br>
On the other hand I don&#39;t see a need for the flushing because the kind<=
br>
of failures (power failure) and their consequences seem to be acceptable<br=
>
for typical VDI usage, namely either image conversion or tests with<br>
existing images.<br>
<br>
That&#39;s why I&#39;d prefer not to use bdrv_flush here. Could we make<br>
bdrv_flush optional (either generally or for cases like this one) so<br>
both people who prefer speed and people who would want<br>
bdrv_flush to decrease the likelihood of inconsistencies can be<br>
satisfied?<br>
<br>
Stefan<br>
<br>
</blockquote></div>

--001a113eb31804972c05159e1ba2--