[Qemu-devel] Performance about x-data-plane

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Performance about x-data-plane
@ 2016-12-22  6:34 Weiwei Jia
  2016-12-22  9:30 ` [Qemu-devel] [libvirt] " Daniel P. Berrange
  2017-01-03 15:50 ` [Qemu-devel] " Stefan Hajnoczi
  0 siblings, 2 replies; 9+ messages in thread
From: Weiwei Jia @ 2016-12-22  6:34 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, eblake, libvir-list

Hi,

With QEMU x-data-plane, I find the performance has not been improved
very much. Please see following two settings.

Setting 1: I/O thread in host OS (VMM) reads 4KB each time from disk
(8GB in total). Pin the I/O thread to pCPU 5 which will serve I/O
thread dedicatedly. I find the performance is around 250 MB/s.

Setting 2: I/O thread in guest OS (VMM) reads 4KB each time from
virtual disk (8GB in total). Pin the I/O thread to vCPU 5 and pin vCPU
5 thread to pCPU5 so that vCPU 5 handles this I/O thread dedicatedly
and pCPU5 serve vCPU5 dedicatedly. In order to keep vCPU5 not to be
idle, I also pin one cpu intensive thread (while (1) {i++}) on vCPU 5
so that the I/O thread on it can be served without delay. For this
setting, I find the performance for this I/O thread is around 190
MB/s.

NOTE: For setting 2, I also pin the QEMU dedicated IOthread
(x-data-plane) in host OS to pCPU to handle I/O requests from guest OS
dedicatedly.

I think for setting 2, the performance of I/O thread should be almost
the same as setting 1. I cannot understand why it is 60 MB/s lower
than setting 1. I am wondering whether there are something wrong with
my x-data-plane setting or virtio setting for VM.  Would you please
give me some hints? Thank you.

Libvirt version: 2.4.0
QEMU version: 2.3.0

The libvirt xml  configuration file is like following (I only start
one VM with following xml config).

<domain type='kvm' id='1'
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>vm</name>
  <uuid>3290a8d0-9d9f-b2c4-dd46-5d0d8a730cd6</uuid>
  <memory unit='KiB'>8290304</memory>
  <currentMemory unit='KiB'>8290304</currentMemory>
  <vcpu placement='static'>15</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
    <vcpupin vcpu='8' cpuset='8'/>
    <vcpupin vcpu='9' cpuset='9'/>
    <vcpupin vcpu='10' cpuset='10'/>
    <vcpupin vcpu='11' cpuset='11'/>
    <vcpupin vcpu='12' cpuset='12'/>
    <vcpupin vcpu='13' cpuset='13'/>
    <vcpupin vcpu='14' cpuset='14'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/var/lib/libvirt/images/vm.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
    </disk>
    <disk type='block' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
    </controller>
    <controller type='scsi' index='0'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:8e:3d:06'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/8'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/8'>
      <source path='/dev/pts/8'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video>
      <model type='cirrus' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none'/>
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.scsi=off'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.config-wce=off'/>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
  </qemu:commandline>
</domain>


Thank you,
Weiwei Jia

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [libvirt] Performance about x-data-plane
  2016-12-22  6:34 [Qemu-devel] Performance about x-data-plane Weiwei Jia
@ 2016-12-22  9:30 ` Daniel P. Berrange
  2016-12-22 19:53   ` Weiwei Jia
  2017-01-03 15:50 ` [Qemu-devel] " Stefan Hajnoczi
  1 sibling, 1 reply; 9+ messages in thread
From: Daniel P. Berrange @ 2016-12-22  9:30 UTC (permalink / raw)
  To: Weiwei Jia; +Cc: Stefan Hajnoczi, libvir-list, qemu-devel

On Thu, Dec 22, 2016 at 01:34:47AM -0500, Weiwei Jia wrote:
> Hi,
> 
> With QEMU x-data-plane, I find the performance has not been improved
> very much. Please see following two settings.
> 
> Setting 1: I/O thread in host OS (VMM) reads 4KB each time from disk
> (8GB in total). Pin the I/O thread to pCPU 5 which will serve I/O
> thread dedicatedly. I find the performance is around 250 MB/s.
> 
> Setting 2: I/O thread in guest OS (VMM) reads 4KB each time from
> virtual disk (8GB in total). Pin the I/O thread to vCPU 5 and pin vCPU
> 5 thread to pCPU5 so that vCPU 5 handles this I/O thread dedicatedly
> and pCPU5 serve vCPU5 dedicatedly. In order to keep vCPU5 not to be
> idle, I also pin one cpu intensive thread (while (1) {i++}) on vCPU 5
> so that the I/O thread on it can be served without delay. For this
> setting, I find the performance for this I/O thread is around 190
> MB/s.
> 
> NOTE: For setting 2, I also pin the QEMU dedicated IOthread
> (x-data-plane) in host OS to pCPU to handle I/O requests from guest OS
> dedicatedly.
> 
> I think for setting 2, the performance of I/O thread should be almost
> the same as setting 1. I cannot understand why it is 60 MB/s lower
> than setting 1. I am wondering whether there are something wrong with
> my x-data-plane setting or virtio setting for VM.  Would you please
> give me some hints? Thank you.

The x-data-plane option is obsolete and should not be used. You should
use the modern iothread option instead, which is explicitly supported
by libvirt XML.

http://libvirt.org/formatdomain.html#elementsIOThreadsAllocation

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [libvirt] Performance about x-data-plane
  2016-12-22  9:30 ` [Qemu-devel] [libvirt] " Daniel P. Berrange
@ 2016-12-22 19:53   ` Weiwei Jia
  0 siblings, 0 replies; 9+ messages in thread
From: Weiwei Jia @ 2016-12-22 19:53 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Stefan Hajnoczi, libvir-list, qemu-devel

On Thu, Dec 22, 2016 at 4:30 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
> On Thu, Dec 22, 2016 at 01:34:47AM -0500, Weiwei Jia wrote:
>> Hi,
>>
>> With QEMU x-data-plane, I find the performance has not been improved
>> very much. Please see following two settings.
>>
>> Setting 1: I/O thread in host OS (VMM) reads 4KB each time from disk
>> (8GB in total). Pin the I/O thread to pCPU 5 which will serve I/O
>> thread dedicatedly. I find the performance is around 250 MB/s.
>>
>> Setting 2: I/O thread in guest OS (VMM) reads 4KB each time from
>> virtual disk (8GB in total). Pin the I/O thread to vCPU 5 and pin vCPU
>> 5 thread to pCPU5 so that vCPU 5 handles this I/O thread dedicatedly
>> and pCPU5 serve vCPU5 dedicatedly. In order to keep vCPU5 not to be
>> idle, I also pin one cpu intensive thread (while (1) {i++}) on vCPU 5
>> so that the I/O thread on it can be served without delay. For this
>> setting, I find the performance for this I/O thread is around 190
>> MB/s.
>>
>> NOTE: For setting 2, I also pin the QEMU dedicated IOthread
>> (x-data-plane) in host OS to pCPU to handle I/O requests from guest OS
>> dedicatedly.
>>
>> I think for setting 2, the performance of I/O thread should be almost
>> the same as setting 1. I cannot understand why it is 60 MB/s lower
>> than setting 1. I am wondering whether there are something wrong with
>> my x-data-plane setting or virtio setting for VM.  Would you please
>> give me some hints? Thank you.
>
> The x-data-plane option is obsolete and should not be used. You should
> use the modern iothread option instead, which is explicitly supported
> by libvirt XML.
>
> http://libvirt.org/formatdomain.html#elementsIOThreadsAllocation

Thanks for your reply. However, I think this is not the main points of
my question. Let me see whether Stefan has some comments. Thank you.


Best,
Weiwei Jia

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2016-12-22  6:34 [Qemu-devel] Performance about x-data-plane Weiwei Jia
  2016-12-22  9:30 ` [Qemu-devel] [libvirt] " Daniel P. Berrange
@ 2017-01-03 15:50 ` Stefan Hajnoczi
  2017-01-03 17:02   ` Weiwei Jia
  1 sibling, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-01-03 15:50 UTC (permalink / raw)
  To: Weiwei Jia; +Cc: qemu-devel, eblake, libvir-list

[-- Attachment #1: Type: text/plain, Size: 7667 bytes --]

On Thu, Dec 22, 2016 at 01:34:47AM -0500, Weiwei Jia wrote:
> With QEMU x-data-plane, I find the performance has not been improved
> very much. Please see following two settings.

Using IOThreads improves scalability for SMP guests with many disks.  It
does not improve performance for a single disk benchmark because there
is nothing to spread across IOThreads.

> Setting 1: I/O thread in host OS (VMM) reads 4KB each time from disk
> (8GB in total). Pin the I/O thread to pCPU 5 which will serve I/O
> thread dedicatedly. I find the performance is around 250 MB/s.

250 MB/s / 4 KB = 64k IOPS

This seems like a reasonable result for single thread with a single
disk.  I guess the benchmark queue depth setting is larger than 1 though
because it would only allow 15 microseconds per request.

> Setting 2: I/O thread in guest OS (VMM) reads 4KB each time from
> virtual disk (8GB in total). Pin the I/O thread to vCPU 5 and pin vCPU
> 5 thread to pCPU5 so that vCPU 5 handles this I/O thread dedicatedly
> and pCPU5 serve vCPU5 dedicatedly. In order to keep vCPU5 not to be
> idle, I also pin one cpu intensive thread (while (1) {i++}) on vCPU 5
> so that the I/O thread on it can be served without delay. For this
> setting, I find the performance for this I/O thread is around 190
> MB/s.

190 MB/s / 4 KB = 48k IOPS

I worry that your while (1) {i++} thread may prevent achieving the best
performance if the guest kernel scheduler allows it to use its time
slice.

Two options that might work better are:
1. idle=poll guest kernel command-line parameter
2. kvm.ko's halt_poll_ns host kernel module parameter

> NOTE: For setting 2, I also pin the QEMU dedicated IOthread
> (x-data-plane) in host OS to pCPU to handle I/O requests from guest OS
> dedicatedly.

Which pCPU did you pin the dataplane thread to?  Did you try changing
this?

> I think for setting 2, the performance of I/O thread should be almost
> the same as setting 1. I cannot understand why it is 60 MB/s lower
> than setting 1. I am wondering whether there are something wrong with
> my x-data-plane setting or virtio setting for VM.  Would you please
> give me some hints? Thank you.

Ideally QEMU should achieve the same performance as bare metal.  In
practice the overhead increases as IOPS increases.  You may be able to
achieve 260 MB/s inside the guest with a larger request size since it
involves fewer I/O requests.

The expensive part is the virtqueue kick.  Recently we tried polling the
virtqueue instead of waiting for the ioeventfd file descriptor and got
double-digit performance improvements:
https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html

If you want to understand the performance of your benchmark you'll have
compare host/guest disk stats (e.g. request lifetime, disk utilization,
queue depth, average request size) to check that the bare metal and
guest workloads are really sending comparable I/O patterns to the
physical disk.

Then you using Linux and/or QEMU tracing to analyze the request latency
by looking at interesting points in the request lifecycle like virtqueue
kick, host Linux AIO io_submit(2), etc.

> Libvirt version: 2.4.0
> QEMU version: 2.3.0
> 
> The libvirt xml  configuration file is like following (I only start
> one VM with following xml config).
> 
> <domain type='kvm' id='1'
> xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
>   <name>vm</name>
>   <uuid>3290a8d0-9d9f-b2c4-dd46-5d0d8a730cd6</uuid>
>   <memory unit='KiB'>8290304</memory>
>   <currentMemory unit='KiB'>8290304</currentMemory>
>   <vcpu placement='static'>15</vcpu>
>   <cputune>
>     <vcpupin vcpu='0' cpuset='0'/>
>     <vcpupin vcpu='1' cpuset='1'/>
>     <vcpupin vcpu='2' cpuset='2'/>
>     <vcpupin vcpu='3' cpuset='3'/>
>     <vcpupin vcpu='4' cpuset='4'/>
>     <vcpupin vcpu='5' cpuset='5'/>
>     <vcpupin vcpu='6' cpuset='6'/>
>     <vcpupin vcpu='7' cpuset='7'/>
>     <vcpupin vcpu='8' cpuset='8'/>
>     <vcpupin vcpu='9' cpuset='9'/>
>     <vcpupin vcpu='10' cpuset='10'/>
>     <vcpupin vcpu='11' cpuset='11'/>
>     <vcpupin vcpu='12' cpuset='12'/>
>     <vcpupin vcpu='13' cpuset='13'/>
>     <vcpupin vcpu='14' cpuset='14'/>
>   </cputune>
>   <resource>
>     <partition>/machine</partition>
>   </resource>
>   <os>
>     <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
>     <boot dev='hd'/>
>   </os>
>   <features>
>     <acpi/>
>     <apic/>
>     <pae/>
>   </features>
>   <clock offset='utc'/>
>   <on_poweroff>destroy</on_poweroff>
>   <on_reboot>restart</on_reboot>
>   <on_crash>restart</on_crash>
>   <devices>
>     <emulator>/usr/bin/kvm-spice</emulator>
>     <disk type='file' device='disk'>
>       <driver name='qemu' type='raw' cache='none' io='native'/>
>       <source file='/var/lib/libvirt/images/vm.img'/>
>       <target dev='vda' bus='virtio'/>
>       <alias name='virtio-disk0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
> function='0x0'/>
>     </disk>
>     <disk type='block' device='cdrom'>
>       <driver name='qemu' type='raw'/>
>       <target dev='hdc' bus='ide'/>
>       <readonly/>
>       <alias name='ide0-1-0'/>
>       <address type='drive' controller='0' bus='1' target='0' unit='0'/>
>     </disk>
>     <controller type='usb' index='0'>
>       <alias name='usb0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
> function='0x2'/>
>     </controller>
>     <controller type='pci' index='0' model='pci-root'>
>       <alias name='pci.0'/>
>     </controller>
>     <controller type='ide' index='0'>
>       <alias name='ide0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
> function='0x1'/>
>     </controller>
>     <controller type='scsi' index='0'>
>       <alias name='scsi0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
> function='0x0'/>
>     </controller>
>     <interface type='network'>
>       <mac address='52:54:00:8e:3d:06'/>
>       <source network='default'/>
>       <target dev='vnet0'/>
>       <model type='virtio'/>
>       <alias name='net0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
> function='0x0'/>
>     </interface>
>     <serial type='pty'>
>       <source path='/dev/pts/8'/>
>       <target port='0'/>
>       <alias name='serial0'/>
>     </serial>
>     <console type='pty' tty='/dev/pts/8'>
>       <source path='/dev/pts/8'/>
>       <target type='serial' port='0'/>
>       <alias name='serial0'/>
>     </console>
>     <input type='mouse' bus='ps2'/>
>     <input type='keyboard' bus='ps2'/>
>     <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
>       <listen type='address' address='127.0.0.1'/>
>     </graphics>
>     <video>
>       <model type='cirrus' vram='9216' heads='1'/>
>       <alias name='video0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02'
> function='0x0'/>
>     </video>
>     <memballoon model='virtio'>
>       <alias name='balloon0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x0'/>
>     </memballoon>
>   </devices>
>   <seclabel type='none'/>
>   <qemu:commandline>
>     <qemu:arg value='-set'/>
>     <qemu:arg value='device.virtio-disk0.scsi=off'/>
>     <qemu:arg value='-set'/>
>     <qemu:arg value='device.virtio-disk0.config-wce=off'/>
>     <qemu:arg value='-set'/>
>     <qemu:arg value='device.virtio-disk0.x-data-plane=on'/>
>   </qemu:commandline>
> </domain>
> 
> 
> Thank you,
> Weiwei Jia

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2017-01-03 15:50 ` [Qemu-devel] " Stefan Hajnoczi
@ 2017-01-03 17:02   ` Weiwei Jia
  2017-01-16 13:15     ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Weiwei Jia @ 2017-01-03 17:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, eblake, libvir-list

Hi Stefan,

Thanks for your reply.

On Tue, Jan 3, 2017 at 10:50 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Dec 22, 2016 at 01:34:47AM -0500, Weiwei Jia wrote:
>> With QEMU x-data-plane, I find the performance has not been improved
>> very much. Please see following two settings.
>
> Using IOThreads improves scalability for SMP guests with many disks.  It
> does not improve performance for a single disk benchmark because there
> is nothing to spread across IOThreads.
>
>> Setting 1: I/O thread in host OS (VMM) reads 4KB each time from disk
>> (8GB in total). Pin the I/O thread to pCPU 5 which will serve I/O
>> thread dedicatedly. I find the performance is around 250 MB/s.
>
> 250 MB/s / 4 KB = 64k IOPS
>
> This seems like a reasonable result for single thread with a single
> disk.  I guess the benchmark queue depth setting is larger than 1 though
> because it would only allow 15 microseconds per request.

For this test, we have two disks as a RAID 0. Each of the disk is 2.5
inch and has around 130 MB/s bandwidth (sequential read) as
specification said.

>
>> Setting 2: I/O thread in guest OS (VMM) reads 4KB each time from
>> virtual disk (8GB in total). Pin the I/O thread to vCPU 5 and pin vCPU
>> 5 thread to pCPU5 so that vCPU 5 handles this I/O thread dedicatedly
>> and pCPU5 serve vCPU5 dedicatedly. In order to keep vCPU5 not to be
>> idle, I also pin one cpu intensive thread (while (1) {i++}) on vCPU 5
>> so that the I/O thread on it can be served without delay. For this
>> setting, I find the performance for this I/O thread is around 190
>> MB/s.
>
> 190 MB/s / 4 KB = 48k IOPS
>
> I worry that your while (1) {i++} thread may prevent achieving the best
> performance if the guest kernel scheduler allows it to use its time
> slice.
>
> Two options that might work better are:
> 1. idle=poll guest kernel command-line parameter
> 2. kvm.ko's halt_poll_ns host kernel module parameter

Yes, I add "idle=poll" currently but I did not try "kvm.ko's
halt_poll_ns" yet. I will try it. Actually, current days, I have made
a RAID 0 with four disks and each of them is 2.5 inch and has 130 MB/s
bandwidth (from specification) for sequential read. I also fix some
interrupts delay problem. See following for the latest experiment and
problems.

I run one I/O thread (sequential read 4 KB each time and read 8 GB in
total) in host OS, the throughput is around 420 MB/s. However, when I
run this I/O thread in one VM (no other VM is created and data-plane
is enabled) with dedicated hardware, the throughput will be around 350
MB/s. The VM's experiment setting is as follows.

In the VM, there are 15 vCPUs (vCPU0 - vCPU14); each vCPU is pinned to
corresponding dedicated pCPU (for example, vCPU 0 is pinned to pCPU0
... vCPU 14 is pinned to pCPU 14); "Idle=poll" is added in the VM's
boot grub so that the vCPU will not be idle; in the VM, all the
interrupts are pinned to vCPU0 to guarantee these interrupts can be
responded on time; the I/O thread is executed on one of the vCPU
except vCPU0.

In the host OS, there are 16 pCPUs (pCPU0 - pCPU15); pCPU 0 - pCPU 14
are used for vCPU0 - vCPU 14 in the VM dedicatedly; pCPU 15 is used by
QEMU IOthread (data-plane) to handle I/O read requests from VM
dedicatedly.

Kernel version: 3.16.39
QEMU version: 2.4.1

I don't know why there is 70 MB/s difference between host OS and guest
OS as above experiment. Does anyone have same experiences? Any
comments? Thank you in advance.

BTW, I have checked several times about my hardware configuration and
I think the throughput difference as above should be related to QEMU.
Maybe, I miss any configuration about QEMU.

>
>> NOTE: For setting 2, I also pin the QEMU dedicated IOthread
>> (x-data-plane) in host OS to pCPU to handle I/O requests from guest OS
>> dedicatedly.
>
> Which pCPU did you pin the dataplane thread to?  Did you try changing
> this?

No, it is pinned to pCPU 15 dedicatedly. Please see above latest
experiment and problem.

>
>> I think for setting 2, the performance of I/O thread should be almost
>> the same as setting 1. I cannot understand why it is 60 MB/s lower
>> than setting 1. I am wondering whether there are something wrong with
>> my x-data-plane setting or virtio setting for VM.  Would you please
>> give me some hints? Thank you.
>
> Ideally QEMU should achieve the same performance as bare metal.  In
> practice the overhead increases as IOPS increases.  You may be able to
> achieve 260 MB/s inside the guest with a larger request size since it
> involves fewer I/O requests.

Yes, I agree with you that it should achieve the performance as bare
metal. I will try a larger request size. Thank you.

>
> The expensive part is the virtqueue kick.  Recently we tried polling the
> virtqueue instead of waiting for the ioeventfd file descriptor and got
> double-digit performance improvements:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
>
> If you want to understand the performance of your benchmark you'll have
> compare host/guest disk stats (e.g. request lifetime, disk utilization,
> queue depth, average request size) to check that the bare metal and
> guest workloads are really sending comparable I/O patterns to the
> physical disk.
>
> Then you using Linux and/or QEMU tracing to analyze the request latency
> by looking at interesting points in the request lifecycle like virtqueue
> kick, host Linux AIO io_submit(2), etc.
>

Thank you. I will look into "polling the virtqueue" as you said above.
Currently, I just use blktrace to see disk stats and add logs in the
I/O workload to see the time latency for each request. What kind of
tools are you using to analyze request lifecycle like virtqueue kick,
host Linux AIO iosubmit, etc.

Do you trace the lifecycle like this
(http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
but it seems to be out of date. Does it
(http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
still work on QEMU 2.4.1?

Thank you,
Weiwei Jia

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2017-01-03 17:02   ` Weiwei Jia
@ 2017-01-16 13:15     ` Stefan Hajnoczi
  2017-01-16 15:00       ` Karl Rister
  2017-01-16 19:37       ` Weiwei Jia
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2017-01-16 13:15 UTC (permalink / raw)
  To: Weiwei Jia; +Cc: qemu-devel, eblake, libvir-list, krister

[-- Attachment #1: Type: text/plain, Size: 1995 bytes --]

On Tue, Jan 03, 2017 at 12:02:14PM -0500, Weiwei Jia wrote:
> > The expensive part is the virtqueue kick.  Recently we tried polling the
> > virtqueue instead of waiting for the ioeventfd file descriptor and got
> > double-digit performance improvements:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
> >
> > If you want to understand the performance of your benchmark you'll have
> > compare host/guest disk stats (e.g. request lifetime, disk utilization,
> > queue depth, average request size) to check that the bare metal and
> > guest workloads are really sending comparable I/O patterns to the
> > physical disk.
> >
> > Then you using Linux and/or QEMU tracing to analyze the request latency
> > by looking at interesting points in the request lifecycle like virtqueue
> > kick, host Linux AIO io_submit(2), etc.
> >
> 
> Thank you. I will look into "polling the virtqueue" as you said above.
> Currently, I just use blktrace to see disk stats and add logs in the
> I/O workload to see the time latency for each request. What kind of
> tools are you using to analyze request lifecycle like virtqueue kick,
> host Linux AIO iosubmit, etc.
> 
> Do you trace the lifecycle like this
> (http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
> but it seems to be out of date. Does it
> (http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
> still work on QEMU 2.4.1?

The details are out of date but the general approach to tracing the I/O
request lifecycle still apply.

There are multiple tracing tools that can do what you need.  I've CCed
Karl Rister who did the latest virtio-blk dataplane tracing.

"perf record -a -e kvm:\*" is a good start.  You can use "perf probe" to
trace QEMU's trace events (recent versions have sdt support, which means
SystemTap tracepoints work) and also trace any function in QEMU:
http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2017-01-16 13:15     ` Stefan Hajnoczi
@ 2017-01-16 15:00       ` Karl Rister
  2017-01-16 19:38         ` Weiwei Jia
  2017-01-16 19:37       ` Weiwei Jia
  1 sibling, 1 reply; 9+ messages in thread
From: Karl Rister @ 2017-01-16 15:00 UTC (permalink / raw)
  To: Stefan Hajnoczi, Weiwei Jia; +Cc: qemu-devel, eblake, libvir-list

On 01/16/2017 07:15 AM, Stefan Hajnoczi wrote:
> On Tue, Jan 03, 2017 at 12:02:14PM -0500, Weiwei Jia wrote:
>>> The expensive part is the virtqueue kick.  Recently we tried polling the
>>> virtqueue instead of waiting for the ioeventfd file descriptor and got
>>> double-digit performance improvements:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
>>>
>>> If you want to understand the performance of your benchmark you'll have
>>> compare host/guest disk stats (e.g. request lifetime, disk utilization,
>>> queue depth, average request size) to check that the bare metal and
>>> guest workloads are really sending comparable I/O patterns to the
>>> physical disk.
>>>
>>> Then you using Linux and/or QEMU tracing to analyze the request latency
>>> by looking at interesting points in the request lifecycle like virtqueue
>>> kick, host Linux AIO io_submit(2), etc.
>>>
>>
>> Thank you. I will look into "polling the virtqueue" as you said above.
>> Currently, I just use blktrace to see disk stats and add logs in the
>> I/O workload to see the time latency for each request. What kind of
>> tools are you using to analyze request lifecycle like virtqueue kick,
>> host Linux AIO iosubmit, etc.
>>
>> Do you trace the lifecycle like this
>> (http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
>> but it seems to be out of date. Does it
>> (http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
>> still work on QEMU 2.4.1?
> 
> The details are out of date but the general approach to tracing the I/O
> request lifecycle still apply.
> 
> There are multiple tracing tools that can do what you need.  I've CCed
> Karl Rister who did the latest virtio-blk dataplane tracing.

I roughly followed this guide by Luiz Capitulino:

https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html

I tweaked his trace-host-and-guest script to avoid doing IO while
tracing is enabled, my version is available here:

http://people.redhat.com/~krister/tracing/trace-host-and-guest

I built QEMU with --enable-trace-backends=ftrace and then turned on the
QEMU trace events I was interested in with this bit of bash:

for event in $(/usr/libexec/qemu-kvm -trace help 2>&1|grep virtio|grep
-v "gpu\|console\|serial\|rng\|balloon\|ccw"); do virsh
qemu-monitor-command master --hmp trace-event ${event} on; done

At this point, the QEMU trace events are automatically inserted into the
ftrace buffers and the methodology outlined by Luiz gets the guest
kernel, host kernel, and QEMU events properly interleaved.

> 
> "perf record -a -e kvm:\*" is a good start.  You can use "perf probe" to
> trace QEMU's trace events (recent versions have sdt support, which means
> SystemTap tracepoints work) and also trace any function in QEMU:
> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html
> 
> Stefan
> 


-- 
Karl Rister <krister@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2017-01-16 13:15     ` Stefan Hajnoczi
  2017-01-16 15:00       ` Karl Rister
@ 2017-01-16 19:37       ` Weiwei Jia
  1 sibling, 0 replies; 9+ messages in thread
From: Weiwei Jia @ 2017-01-16 19:37 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, eblake, libvir-list, krister

On Mon, Jan 16, 2017 at 8:15 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Jan 03, 2017 at 12:02:14PM -0500, Weiwei Jia wrote:
>> > The expensive part is the virtqueue kick.  Recently we tried polling the
>> > virtqueue instead of waiting for the ioeventfd file descriptor and got
>> > double-digit performance improvements:
>> > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
>> >
>> > If you want to understand the performance of your benchmark you'll have
>> > compare host/guest disk stats (e.g. request lifetime, disk utilization,
>> > queue depth, average request size) to check that the bare metal and
>> > guest workloads are really sending comparable I/O patterns to the
>> > physical disk.
>> >
>> > Then you using Linux and/or QEMU tracing to analyze the request latency
>> > by looking at interesting points in the request lifecycle like virtqueue
>> > kick, host Linux AIO io_submit(2), etc.
>> >
>>
>> Thank you. I will look into "polling the virtqueue" as you said above.
>> Currently, I just use blktrace to see disk stats and add logs in the
>> I/O workload to see the time latency for each request. What kind of
>> tools are you using to analyze request lifecycle like virtqueue kick,
>> host Linux AIO iosubmit, etc.
>>
>> Do you trace the lifecycle like this
>> (http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
>> but it seems to be out of date. Does it
>> (http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
>> still work on QEMU 2.4.1?
>
> The details are out of date but the general approach to tracing the I/O
> request lifecycle still apply.
>
> There are multiple tracing tools that can do what you need.  I've CCed
> Karl Rister who did the latest virtio-blk dataplane tracing.
>
> "perf record -a -e kvm:\*" is a good start.  You can use "perf probe" to
> trace QEMU's trace events (recent versions have sdt support, which means
> SystemTap tracepoints work) and also trace any function in QEMU:
> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html

Thank you. I will try it.


Best,
Weiwei Jia

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] Performance about x-data-plane
  2017-01-16 15:00       ` Karl Rister
@ 2017-01-16 19:38         ` Weiwei Jia
  0 siblings, 0 replies; 9+ messages in thread
From: Weiwei Jia @ 2017-01-16 19:38 UTC (permalink / raw)
  To: krister; +Cc: Stefan Hajnoczi, qemu-devel, eblake, libvir-list

On Mon, Jan 16, 2017 at 10:00 AM, Karl Rister <krister@redhat.com> wrote:
> On 01/16/2017 07:15 AM, Stefan Hajnoczi wrote:
>> On Tue, Jan 03, 2017 at 12:02:14PM -0500, Weiwei Jia wrote:
>>>> The expensive part is the virtqueue kick.  Recently we tried polling the
>>>> virtqueue instead of waiting for the ioeventfd file descriptor and got
>>>> double-digit performance improvements:
>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
>>>>
>>>> If you want to understand the performance of your benchmark you'll have
>>>> compare host/guest disk stats (e.g. request lifetime, disk utilization,
>>>> queue depth, average request size) to check that the bare metal and
>>>> guest workloads are really sending comparable I/O patterns to the
>>>> physical disk.
>>>>
>>>> Then you using Linux and/or QEMU tracing to analyze the request latency
>>>> by looking at interesting points in the request lifecycle like virtqueue
>>>> kick, host Linux AIO io_submit(2), etc.
>>>>
>>>
>>> Thank you. I will look into "polling the virtqueue" as you said above.
>>> Currently, I just use blktrace to see disk stats and add logs in the
>>> I/O workload to see the time latency for each request. What kind of
>>> tools are you using to analyze request lifecycle like virtqueue kick,
>>> host Linux AIO iosubmit, etc.
>>>
>>> Do you trace the lifecycle like this
>>> (http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
>>> but it seems to be out of date. Does it
>>> (http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
>>> still work on QEMU 2.4.1?
>>
>> The details are out of date but the general approach to tracing the I/O
>> request lifecycle still apply.
>>
>> There are multiple tracing tools that can do what you need.  I've CCed
>> Karl Rister who did the latest virtio-blk dataplane tracing.
>
> I roughly followed this guide by Luiz Capitulino:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
>
> I tweaked his trace-host-and-guest script to avoid doing IO while
> tracing is enabled, my version is available here:
>
> http://people.redhat.com/~krister/tracing/trace-host-and-guest
>
> I built QEMU with --enable-trace-backends=ftrace and then turned on the
> QEMU trace events I was interested in with this bit of bash:
>
> for event in $(/usr/libexec/qemu-kvm -trace help 2>&1|grep virtio|grep
> -v "gpu\|console\|serial\|rng\|balloon\|ccw"); do virsh
> qemu-monitor-command master --hmp trace-event ${event} on; done
>
> At this point, the QEMU trace events are automatically inserted into the
> ftrace buffers and the methodology outlined by Luiz gets the guest
> kernel, host kernel, and QEMU events properly interleaved.


Thank you. I will try it.


Best,
Weiwei Jia

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-01-16 19:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-22  6:34 [Qemu-devel] Performance about x-data-plane Weiwei Jia
2016-12-22  9:30 ` [Qemu-devel] [libvirt] " Daniel P. Berrange
2016-12-22 19:53   ` Weiwei Jia
2017-01-03 15:50 ` [Qemu-devel] " Stefan Hajnoczi
2017-01-03 17:02   ` Weiwei Jia
2017-01-16 13:15     ` Stefan Hajnoczi
2017-01-16 15:00       ` Karl Rister
2017-01-16 19:38         ` Weiwei Jia
2017-01-16 19:37       ` Weiwei Jia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).