qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Effect of qemu-img convert -m and -W options
@ 2017-11-15 11:52 Richard W.M. Jones
  2017-11-16 14:47 ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Richard W.M. Jones @ 2017-11-15 11:52 UTC (permalink / raw)
  To: libguestfs; +Cc: qemu-devel, pl

[CC to qemu-devel since I'm obviously doing something wrong here,
I'm just not sure what.]

I was getting ready to add multiple threads to ‘qemu-img convert’ (the
longest part of v2v conversions) when I noticed that it had them
already!  (To be fair this was only added in February this year so no
wonder we didn't notice.)

To enable parallel convert we would need to use the ‘qemu-img convert
-W’ option (which allows out-of-order writes to the destination) and
‘qemu-img convert -m <#num-coroutines>’ option to select the
parallelism (defaults to 8).  The documentation refers to coroutines
but I verified from strace that it is using real threads.

I did some testing to see what effect this has.  For this I used a
large guest image which is approximately a third full of random data
(the rest being sparsely allocated):

  Source format:       raw
  Source virtual size: 100 GB
  Source used space:   31 GB  
  Target format:       qcow2
  Version:             qemu-img-2.10.0-7.fc28.x86_64
  Conversion command:
    rm -f /to/target
    time qemu-img convert [-W] [-m ##] -f raw source.img -O qcow2 /to/target

Source and target are regular files on two different disks.  The test
machine is a Xeon E5 with 16 real cores.

----------------------------------------------------------------------
                  Non-preallocated output
                  (times are in seconds)
                   without -W       -W

-m 1                153              -

-m 4                155             157

-m 8 [default]      158             231

-m 16 [max]         166             166
----------------------------------------------------------------------

The documentation for ‘-W’ notes that this is only recommended for
preallocated outputs (which the test above does not use).  So let's
try using a preallocated qcow2 output.

  Conversion command:
    # the same target file is reused each time
    time qemu-img convert -n [-W] [-m ##] -f raw source.img -O qcow2 /to/target

----------------------------------------------------------------------
                   Preallocated output
                  (times are in seconds)
                   without -W       -W

-m 1                147              -

-m 4                146             145

-m 8 [default]      146             199

-m 16 [max]         147             146
----------------------------------------------------------------------

Based on this there seems to be some issue with the ‘-W’ option -- I
even thought I might have it backwards, but checking the code it does
seem like ‘-W’ enables (rather than disables) out of order writes.
Also some bizarre interaction between ‘-W’ and ‘-m 8’.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-15 11:52 [Qemu-devel] Effect of qemu-img convert -m and -W options Richard W.M. Jones
@ 2017-11-16 14:47 ` Stefan Hajnoczi
  2017-11-16 14:51   ` Peter Lieven
  2017-11-16 15:10   ` Richard W.M. Jones
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-11-16 14:47 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: libguestfs, pl, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4341 bytes --]

On Wed, Nov 15, 2017 at 11:52:46AM +0000, Richard W.M. Jones wrote:
> [CC to qemu-devel since I'm obviously doing something wrong here,
> I'm just not sure what.]
> 
> I was getting ready to add multiple threads to ‘qemu-img convert’ (the
> longest part of v2v conversions) when I noticed that it had them
> already!  (To be fair this was only added in February this year so no
> wonder we didn't notice.)
> 
> To enable parallel convert we would need to use the ‘qemu-img convert
> -W’ option (which allows out-of-order writes to the destination) and
> ‘qemu-img convert -m <#num-coroutines>’ option to select the
> parallelism (defaults to 8).  The documentation refers to coroutines
> but I verified from strace that it is using real threads.

The threads you observed are the thread pool that performs
preadv(2)/pwritev(2) syscalls.  The Linux AIO API could be used instead
and does not use threads for read and write operations.  So these
threads are just an implementation detail.  The caller doing the reads
and writes is not multi-threaded but a number of coroutines executing in
a single thread.

The qemu-img convert logic runs in coroutines from just one main loop
thread in qemu-img.c:convert_do_copy():

  for (i = 0; i < s->num_coroutines; i++) {
      s->co[i] = qemu_coroutine_create(convert_co_do_copy, s);
      s->wait_sector_num[i] = -1;
      qemu_coroutine_enter(s->co[i]);
  }

  while (s->running_coroutines) {
      main_loop_wait(false);
  }

> 
> I did some testing to see what effect this has.  For this I used a
> large guest image which is approximately a third full of random data
> (the rest being sparsely allocated):
> 
>   Source format:       raw
>   Source virtual size: 100 GB
>   Source used space:   31 GB  
>   Target format:       qcow2
>   Version:             qemu-img-2.10.0-7.fc28.x86_64
>   Conversion command:
>     rm -f /to/target
>     time qemu-img convert [-W] [-m ##] -f raw source.img -O qcow2 /to/target
> 
> Source and target are regular files on two different disks.  The test
> machine is a Xeon E5 with 16 real cores.
> 
> ----------------------------------------------------------------------
>                   Non-preallocated output
>                   (times are in seconds)
>                    without -W       -W
> 
> -m 1                153              -
> 
> -m 4                155             157
> 
> -m 8 [default]      158             231
> 
> -m 16 [max]         166             166
> ----------------------------------------------------------------------
> 
> The documentation for ‘-W’ notes that this is only recommended for
> preallocated outputs (which the test above does not use).  So let's
> try using a preallocated qcow2 output.
> 
>   Conversion command:
>     # the same target file is reused each time
>     time qemu-img convert -n [-W] [-m ##] -f raw source.img -O qcow2 /to/target
> 
> ----------------------------------------------------------------------
>                    Preallocated output
>                   (times are in seconds)
>                    without -W       -W
> 
> -m 1                147              -
> 
> -m 4                146             145
> 
> -m 8 [default]      146             199
> 
> -m 16 [max]         147             146
> ----------------------------------------------------------------------
> 
> Based on this there seems to be some issue with the ‘-W’ option -- I
> even thought I might have it backwards, but checking the code it does
> seem like ‘-W’ enables (rather than disables) out of order writes.
> Also some bizarre interaction between ‘-W’ and ‘-m 8’.

Interesting.  Did you perform multiple runs of each setting to verify
that the benchmark results are stable with little volatility?

Which command-line did you use to create the preallocated qcow2 file?

Are the source and target files on the same file system and host block
device?  The benefit of using multiple coroutines depends on the
performance characteristics of the source and target files.

I don't think this qemu-img convert mode has been heavily tested, so it
wouldn't be surprising if you encounter unexpected behavior.  Hopefully
it can be fixed to get even better performance.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 14:47 ` Stefan Hajnoczi
@ 2017-11-16 14:51   ` Peter Lieven
  2017-11-16 15:12     ` Richard W.M. Jones
  2017-11-16 15:10   ` Richard W.M. Jones
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Lieven @ 2017-11-16 14:51 UTC (permalink / raw)
  To: Stefan Hajnoczi, Richard W.M. Jones; +Cc: libguestfs, qemu-devel

Am 16.11.2017 um 15:47 schrieb Stefan Hajnoczi:
> On Wed, Nov 15, 2017 at 11:52:46AM +0000, Richard W.M. Jones wrote:
>> [CC to qemu-devel since I'm obviously doing something wrong here,
>> I'm just not sure what.]
>>
>> I was getting ready to add multiple threads to ‘qemu-img convert’ (the
>> longest part of v2v conversions) when I noticed that it had them
>> already!  (To be fair this was only added in February this year so no
>> wonder we didn't notice.)
>>
>> To enable parallel convert we would need to use the ‘qemu-img convert
>> -W’ option (which allows out-of-order writes to the destination) and
>> ‘qemu-img convert -m <#num-coroutines>’ option to select the
>> parallelism (defaults to 8).  The documentation refers to coroutines
>> but I verified from strace that it is using real threads.
> The threads you observed are the thread pool that performs
> preadv(2)/pwritev(2) syscalls.  The Linux AIO API could be used instead
> and does not use threads for read and write operations.  So these
> threads are just an implementation detail.  The caller doing the reads
> and writes is not multi-threaded but a number of coroutines executing in
> a single thread.
>
> The qemu-img convert logic runs in coroutines from just one main loop
> thread in qemu-img.c:convert_do_copy():
>
>    for (i = 0; i < s->num_coroutines; i++) {
>        s->co[i] = qemu_coroutine_create(convert_co_do_copy, s);
>        s->wait_sector_num[i] = -1;
>        qemu_coroutine_enter(s->co[i]);
>    }
>
>    while (s->running_coroutines) {
>        main_loop_wait(false);
>    }
>
>> I did some testing to see what effect this has.  For this I used a
>> large guest image which is approximately a third full of random data
>> (the rest being sparsely allocated):
>>
>>    Source format:       raw
>>    Source virtual size: 100 GB
>>    Source used space:   31 GB
>>    Target format:       qcow2
>>    Version:             qemu-img-2.10.0-7.fc28.x86_64
>>    Conversion command:
>>      rm -f /to/target
>>      time qemu-img convert [-W] [-m ##] -f raw source.img -O qcow2 /to/target
>>
>> Source and target are regular files on two different disks.  The test
>> machine is a Xeon E5 with 16 real cores.
>>
>> ----------------------------------------------------------------------
>>                    Non-preallocated output
>>                    (times are in seconds)
>>                     without -W       -W
>>
>> -m 1                153              -
>>
>> -m 4                155             157
>>
>> -m 8 [default]      158             231
>>
>> -m 16 [max]         166             166
>> ----------------------------------------------------------------------
>>
>> The documentation for ‘-W’ notes that this is only recommended for
>> preallocated outputs (which the test above does not use).  So let's
>> try using a preallocated qcow2 output.
>>
>>    Conversion command:
>>      # the same target file is reused each time
>>      time qemu-img convert -n [-W] [-m ##] -f raw source.img -O qcow2 /to/target
>>
>> ----------------------------------------------------------------------
>>                     Preallocated output
>>                    (times are in seconds)
>>                     without -W       -W
>>
>> -m 1                147              -
>>
>> -m 4                146             145
>>
>> -m 8 [default]      146             199
>>
>> -m 16 [max]         147             146
>> ----------------------------------------------------------------------
>>
>> Based on this there seems to be some issue with the ‘-W’ option -- I
>> even thought I might have it backwards, but checking the code it does
>> seem like ‘-W’ enables (rather than disables) out of order writes.
>> Also some bizarre interaction between ‘-W’ and ‘-m 8’.
> Interesting.  Did you perform multiple runs of each setting to verify
> that the benchmark results are stable with little volatility?
>
> Which command-line did you use to create the preallocated qcow2 file?
>
> Are the source and target files on the same file system and host block
> device?  The benefit of using multiple coroutines depends on the
> performance characteristics of the source and target files.
>
> I don't think this qemu-img convert mode has been heavily tested, so it
> wouldn't be surprising if you encounter unexpected behavior.  Hopefully
> it can be fixed to get even better performance.

The scenario where I tested this patch was with reading a QCOW2
from an NFS (with libnfs) and writing to a RAW iSCSI Target (with libiscsi).
Both having no cache on the local host.

Afaik all writes to the same QCOW2 serialize because of the s->lock that
is held during the write. So its not suprising that there is no benefit from
mutliple threads as long as reading from the RAW file involves no delay.
Which is likely due to readahead of the OS.

Peter

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 14:47 ` Stefan Hajnoczi
  2017-11-16 14:51   ` Peter Lieven
@ 2017-11-16 15:10   ` Richard W.M. Jones
  2017-11-16 17:30     ` Stefan Hajnoczi
  1 sibling, 1 reply; 9+ messages in thread
From: Richard W.M. Jones @ 2017-11-16 15:10 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libguestfs, pl, qemu-devel

On Thu, Nov 16, 2017 at 02:47:46PM +0000, Stefan Hajnoczi wrote:
> The threads you observed are the thread pool that performs
> preadv(2)/pwritev(2) syscalls.  The Linux AIO API could be used instead
> and does not use threads for read and write operations.

I guess if I used AIO then I wouldn't get any parallelism at all since
Linux doesn't block on local file access (at least, it never used to)?

> Interesting.  Did you perform multiple runs of each setting to verify
> that the benchmark results are stable with little volatility?

I retested the -m 8 no-W/-W ones because those were so unexpected and
those are repeatable.

> Which command-line did you use to create the preallocated qcow2 file?

What I actually did was qemu-img convert -n into the existing qcow2
file, so there was no separate command for that.

> Are the source and target files on the same file system and host block
> device?  The benefit of using multiple coroutines depends on the
> performance characteristics of the source and target files.

Both local filesystems, but on different SATA devices.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 14:51   ` Peter Lieven
@ 2017-11-16 15:12     ` Richard W.M. Jones
  2017-11-16 17:20       ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Richard W.M. Jones @ 2017-11-16 15:12 UTC (permalink / raw)
  To: Peter Lieven; +Cc: Stefan Hajnoczi, libguestfs, qemu-devel

On Thu, Nov 16, 2017 at 03:51:29PM +0100, Peter Lieven wrote:
> Afaik all writes to the same QCOW2 serialize because of the s->lock that
> is held during the write. So its not suprising that there is no benefit from
> mutliple threads as long as reading from the RAW file involves no delay.
> Which is likely due to readahead of the OS.

I guess this would explain it, thanks.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 15:12     ` Richard W.M. Jones
@ 2017-11-16 17:20       ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-11-16 17:20 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: Peter Lieven, libguestfs, qemu-devel

On Thu, Nov 16, 2017 at 3:12 PM, Richard W.M. Jones <rjones@redhat.com> wrote:
> On Thu, Nov 16, 2017 at 03:51:29PM +0100, Peter Lieven wrote:
>> Afaik all writes to the same QCOW2 serialize because of the s->lock that
>> is held during the write. So its not suprising that there is no benefit from
>> mutliple threads as long as reading from the RAW file involves no delay.
>> Which is likely due to readahead of the OS.
>
> I guess this would explain it, thanks.

The theory is false for preallocated qcow2 image files.  Parallel
writes do occur in block/qcow2.c:qcow2_co_pwritev() if the cluster is
already allocated:

            qemu_co_mutex_unlock(&s->lock);  <----- parallel!
            BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
            trace_qcow2_writev_data(qemu_coroutine_self(),
                                    cluster_offset + offset_in_cluster);
            ret = bdrv_co_pwritev(bs->file,
                                  cluster_offset + offset_in_cluster,
                                  cur_bytes, &hd_qiov, 0);
            qemu_co_mutex_lock(&s->lock);

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 15:10   ` Richard W.M. Jones
@ 2017-11-16 17:30     ` Stefan Hajnoczi
  2017-11-16 18:00       ` Richard W.M. Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-11-16 17:30 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: libguestfs, Peter Lieven, qemu-devel

On Thu, Nov 16, 2017 at 3:10 PM, Richard W.M. Jones <rjones@redhat.com> wrote:
> On Thu, Nov 16, 2017 at 02:47:46PM +0000, Stefan Hajnoczi wrote:
>> The threads you observed are the thread pool that performs
>> preadv(2)/pwritev(2) syscalls.  The Linux AIO API could be used instead
>> and does not use threads for read and write operations.
>
> I guess if I used AIO then I wouldn't get any parallelism at all since
> Linux doesn't block on local file access (at least, it never used to)?

Even assuming there is enough free page cache, file systems can
definitely block for metadata updates (e.g. space allocation as a file
grows).  I don't think it's possible to assume that pwritev(2) doesn't
block.

>> Are the source and target files on the same file system and host block
>> device?  The benefit of using multiple coroutines depends on the
>> performance characteristics of the source and target files.
>
> Both local filesystems, but on different SATA devices.

Okay.  I'm curious what the strace -f output looks like (only the
preadv(2)/pwritev(2) syscalls are interesting at the moment).

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 17:30     ` Stefan Hajnoczi
@ 2017-11-16 18:00       ` Richard W.M. Jones
  2017-11-20 15:57         ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Richard W.M. Jones @ 2017-11-16 18:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: libguestfs, Peter Lieven, qemu-devel

On Thu, Nov 16, 2017 at 05:30:48PM +0000, Stefan Hajnoczi wrote:
> On Thu, Nov 16, 2017 at 3:10 PM, Richard W.M. Jones <rjones@redhat.com> wrote:
> > Both local filesystems, but on different SATA devices.
> 
> Okay.  I'm curious what the strace -f output looks like (only the
> preadv(2)/pwritev(2) syscalls are interesting at the moment).

Here's the full strace from one run:

  http://oirase.annexia.org/tmp/strace.xz

The command is (with the output.qcow2 file created in a previous run):

  $ strace -o /tmp/strace -f -s 1024 qemu-img convert -f raw fedora-27.img -O qcow2 /var/tmp/output.qcow2 -m 8 -W -n

The version of qemu is: qemu-2.10.0-7.fc28.x86_64

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Effect of qemu-img convert -m and -W options
  2017-11-16 18:00       ` Richard W.M. Jones
@ 2017-11-20 15:57         ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-11-20 15:57 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: libguestfs, Peter Lieven, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1011 bytes --]

On Thu, Nov 16, 2017 at 06:00:54PM +0000, Richard W.M. Jones wrote:
> On Thu, Nov 16, 2017 at 05:30:48PM +0000, Stefan Hajnoczi wrote:
> > On Thu, Nov 16, 2017 at 3:10 PM, Richard W.M. Jones <rjones@redhat.com> wrote:
> > > Both local filesystems, but on different SATA devices.
> > 
> > Okay.  I'm curious what the strace -f output looks like (only the
> > preadv(2)/pwritev(2) syscalls are interesting at the moment).
> 
> Here's the full strace from one run:
> 
>   http://oirase.annexia.org/tmp/strace.xz
> 
> The command is (with the output.qcow2 file created in a previous run):
> 
>   $ strace -o /tmp/strace -f -s 1024 qemu-img convert -f raw fedora-27.img -O qcow2 /var/tmp/output.qcow2 -m 8 -W -n
> 
> The version of qemu is: qemu-2.10.0-7.fc28.x86_64

Requests are being handled in parallel and the exact order is random due
to I/O completion times and thread scheduling.

I didn't see an obvious pattern to optimize but only spent a little time
studying the strace.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-11-20 15:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-15 11:52 [Qemu-devel] Effect of qemu-img convert -m and -W options Richard W.M. Jones
2017-11-16 14:47 ` Stefan Hajnoczi
2017-11-16 14:51   ` Peter Lieven
2017-11-16 15:12     ` Richard W.M. Jones
2017-11-16 17:20       ` Stefan Hajnoczi
2017-11-16 15:10   ` Richard W.M. Jones
2017-11-16 17:30     ` Stefan Hajnoczi
2017-11-16 18:00       ` Richard W.M. Jones
2017-11-20 15:57         ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).