* [Qemu-devel] Poor 8K random IO performance inside the guest
@ 2017-07-14 4:28 Nagarajan, Padhu (HPE Storage)
2017-07-14 6:07 ` Fam Zheng
2017-07-17 10:13 ` Stefan Hajnoczi
0 siblings, 2 replies; 3+ messages in thread
From: Nagarajan, Padhu (HPE Storage) @ 2017-07-14 4:28 UTC (permalink / raw)
To: qemu-devel@nongnu.org
During an 8K random-read fio benchmark, we observed poor performance inside the guest in comparison to the performance seen on the host block device. The table below shows the IOPS on the host and inside the guest with both virtioscsi (scsimq) and virtioblk (blkmq).
-----------------------------------
config | IOPS | fio gst hst
-----------------------------------
host-q32-t1 | 79478 | 401 271scsimq-q8-t4 | 45958 | 693 639 351blkmq-q8-t4 | 49247 | 647 589 308-----------------------------------host-q48-t1 | 85599 | 559 291
scsimq-q12-t4 | 50237 | 952 807 358blkmq-q12-t4 | 54016 | 885 786 329-----------------------------------
fio gst hst => latencies in usecs, as
seen by fio, guest and
host block layers.
q8-t4 => qdepth=8, numjobs=4
host => fio run directly on the host
scsimq,blkmq => fio run inside the guest
Shouldn't we get a much better performance inside the guest ?
When fio inside the guest was generating 32 outstanding IOs, iostat on the host shows avgqu-sz of only 16. For 48 outstanding IOs inside the guest, avgqu-sz on the host was only marginally better.
qemu command line: qemu-system-x86_64 -L /usr/share/seabios/ -name node1,debug-threads=on -name node1 -S -machine pc,accel=kvm,usb=off -cpu SandyBridge -m 7680 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -uuid XX -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/node1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device lsi,id=scsi0,bus=pci.0,addr=0x6 -device virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi1,bus=pci.0,addr=0x7 -device virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi2,bus=pci.0,addr=0x8 -drive file=rhel7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdc,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native -device virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,iothread=iothread1,scsi=off,bus=pci.0,addr=0x17,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/dev/sdc,if=none,id=drive-scsi1-0-0-0,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi1.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi1-0-0-0,id=scsi1-0-0-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=XXX,bus=pci.0,addr=0x2 -netdev tap,fd=26,id=hostnet1,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=YYY,bus=pci.0,multifunction=on,addr=0x15 -netdev tap,fd=28,id=hostnet2,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=ZZZ,bus=pci.0,multifunction=on,addr=0x16 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
fio command line: /tmp/fio --time_based --ioengine=libaio --randrepeat=1 --direct=1 --invalidate=1 --verify=0 --offset=0 --verify_fatal=0 --group_reporting --numjobs=$jobs --name=randread --rw=randread --blocksize=8K --iodepth=$qd --runtime=60 --filename={/dev/vdb or /dev/sda}
# qemu-system-x86_64 --version
QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3~bpo8+1)
Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
The guest was running RHEL 7.3 and the host was Debian 8.
Any thoughts on what could be happening here ?
~Padhu.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] Poor 8K random IO performance inside the guest
2017-07-14 4:28 [Qemu-devel] Poor 8K random IO performance inside the guest Nagarajan, Padhu (HPE Storage)
@ 2017-07-14 6:07 ` Fam Zheng
2017-07-17 10:13 ` Stefan Hajnoczi
1 sibling, 0 replies; 3+ messages in thread
From: Fam Zheng @ 2017-07-14 6:07 UTC (permalink / raw)
To: Nagarajan, Padhu (HPE Storage); +Cc: qemu-devel@nongnu.org
On Fri, 07/14 04:28, Nagarajan, Padhu (HPE Storage) wrote:
> During an 8K random-read fio benchmark, we observed poor performance inside
> the guest in comparison to the performance seen on the host block device. The
> table below shows the IOPS on the host and inside the guest with both
> virtioscsi (scsimq) and virtioblk (blkmq).
>
> -----------------------------------
> config | IOPS | fio gst hst
> -----------------------------------
> host-q32-t1 | 79478 | 401 271
> scsimq-q8-t4 | 45958 | 693 639 351
> blkmq-q8-t4 | 49247 | 647 589 308
> -----------------------------------
> host-q48-t1 | 85599 | 559 291
> scsimq-q12-t4 | 50237 | 952 807 358
> blkmq-q12-t4 | 54016 | 885 786 329
> -----------------------------------
> fio gst hst => latencies in usecs, as
> seen by fio, guest and
> host block layers.
Out of curisoty, how are gst and hst collected here? It's interesting why hst
(q32-t1) is better than (q8-t4).
> q8-t4 => qdepth=8, numjobs=4
> host => fio run directly on the host
> scsimq,blkmq => fio run inside the guest
>
> Shouldn't we get a much better performance inside the guest ?
>
> When fio inside the guest was generating 32 outstanding IOs, iostat on the
> host shows avgqu-sz of only 16. For 48 outstanding IOs inside the guest,
> avgqu-sz on the host was only marginally better.
>
> qemu command line: qemu-system-x86_64 -L /usr/share/seabios/ -name
> node1,debug-threads=on -name node1 -S -machine pc,accel=kvm,usb=off -cpu
> SandyBridge -m 7680 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1
> -object iothread,id=iothread1 -object iothread,id=iothread2 -object
> iothread,id=iothread3 -object iothread,id=iothread4 -uuid XX -nographic
> -no-user-config -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/node1.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew
> -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on
> -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
> lsi,id=scsi0,bus=pci.0,addr=0x6 -device
> virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi1,bus=pci.0,addr=0x7
> -device
> virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi2,bus=pci.0,addr=0x8
> -drive file=rhel7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device
> virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> -drive
> file=/dev/sdc,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native
> -device
> virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,iothread=iothread1,scsi=off,bus=pci.0,addr=0x17,drive=drive-virtio-disk1,id=virtio-disk1
num-queues here will not make much of a difference with current implementation
in QEMU because they all get processed in the same iothread.
> -drive
> file=/dev/sdc,if=none,id=drive-scsi1-0-0-0,format=raw,cache=none,aio=native
> -device
> scsi-hd,bus=scsi1.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi1-0-0-0,id=scsi1-0-0-0
> -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=XXX,bus=pci.0,addr=0x2 -netdev
> tap,fd=26,id=hostnet1,vhost=on,vhostfd=27 -device
> virtio-net-pci,netdev=hostnet1,id=net1,mac=YYY,bus=pci.0,multifunction=on,addr=0x15
> -netdev tap,fd=28,id=hostnet2,vhost=on,vhostfd=29 -device
> virtio-net-pci,netdev=hostnet2,id=net2,mac=ZZZ,bus=pci.0,multifunction=on,addr=0x16
> -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
>
> fio command line: /tmp/fio --time_based --ioengine=libaio --randrepeat=1
> --direct=1 --invalidate=1 --verify=0 --offset=0 --verify_fatal=0
> --group_reporting --numjobs=$jobs --name=randread --rw=randread --blocksize=8K
> --iodepth=$qd --runtime=60 --filename={/dev/vdb or /dev/sda}
>
> # qemu-system-x86_64 --version
> QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3~bpo8+1)
> Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
>
> The guest was running RHEL 7.3 and the host was Debian 8.
>
> Any thoughts on what could be happening here ?
While there could be things that can be optimized/tuned, the results are not too
surprising to me. You have fast disks here so the overhead is more obvious.
Fam
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] Poor 8K random IO performance inside the guest
2017-07-14 4:28 [Qemu-devel] Poor 8K random IO performance inside the guest Nagarajan, Padhu (HPE Storage)
2017-07-14 6:07 ` Fam Zheng
@ 2017-07-17 10:13 ` Stefan Hajnoczi
1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2017-07-17 10:13 UTC (permalink / raw)
To: Nagarajan, Padhu (HPE Storage); +Cc: qemu-devel@nongnu.org
[-- Attachment #1: Type: text/plain, Size: 4880 bytes --]
On Fri, Jul 14, 2017 at 04:28:12AM +0000, Nagarajan, Padhu (HPE Storage) wrote:
> During an 8K random-read fio benchmark, we observed poor performance inside the guest in comparison to the performance seen on the host block device. The table below shows the IOPS on the host and inside the guest with both virtioscsi (scsimq) and virtioblk (blkmq).
>
> -----------------------------------
> config | IOPS | fio gst hst
> -----------------------------------
> host-q32-t1 | 79478 | 401 271
hst->fio adds 200 microseconds of latency/request? That seems very
high.
> scsimq-q8-t4 | 45958 | 693 639 351
> blkmq-q8-t4 | 49247 | 647 589 308
The gst->fio latency is much lower than hst->fio in the host-q32-t1
case. Strange unless the physical HBA driver is very slow or you have a
md or device mapper configuration on the host but not the guest.
What is the storage configuration (guest, host, and hardware)?
Please also look at the latency percentiles in the fio output. It's
possible that the latency distribution is very different from a normal
distribution and the mean latency isn't very meaningful.
> -----------------------------------
> host-q48-t1 | 85599 | 559 291
> scsimq-q12-t4 | 50237 | 952 807 358
> blkmq-q12-t4 | 54016 | 885 786 329
> -----------------------------------
> fio gst hst => latencies in usecs, as
> seen by fio, guest and
> host block layers.
> q8-t4 => qdepth=8, numjobs=4
> host => fio run directly on the host
> scsimq,blkmq => fio run inside the guest
>
> Shouldn't we get a much better performance inside the guest ?
>
> When fio inside the guest was generating 32 outstanding IOs, iostat on the host shows avgqu-sz of only 16. For 48 outstanding IOs inside the guest, avgqu-sz on the host was only marginally better.
The latency numbers you posted support the qvgqu-sz result. fio - hst
is roughly equal to hst. If the software overhead is ~50% of the entire
request duration, then it makes sense that the host queue size is 50% of
the desired benchmark queue size.
>
> qemu command line: qemu-system-x86_64 -L /usr/share/seabios/ -name node1,debug-threads=on -name node1 -S -machine pc,accel=kvm,usb=off -cpu SandyBridge -m 7680 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object iothread,id=iothread1 -object iothread,id=iothread2 -object iothread,id=iothread3 -object iothread,id=iothread4 -uuid XX -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/node1.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device lsi,id=scsi0,bus=pci.0,addr=0x6 -device virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi1,bus=pci.0,addr=0x7 -device virtio-scsi-pci,ioeventfd=on,num_queues=4,iothread=iothread2,id=scsi2,bus=pci.0,addr=0x8 -drive file=rhel7.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdc,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native -device virtio-blk-pci,ioeventfd=on,num-queues=4,iothread=iothread1,iothread=iothread1,scsi=off,bus=pci.0,addr=0x17,drive=drive-virtio-disk1,id=virtio-disk1 -drive file=/dev/sdc,if=none,id=drive-scsi1-0-0-0,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi1.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi1-0-0-0,id=scsi1-0-0-0 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=XXX,bus=pci.0,addr=0x2 -netdev tap,fd=26,id=hostnet1,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=YYY,bus=pci.0,multifunction=on,addr=0x15 -netdev tap,fd=28,id=hostnet2,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=ZZZ,bus=pci.0,multifunction=on,addr=0x16 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 -msg timestamp=on
Are you pinning vcpus and iothreads so that the physical HBA interrupts
are processed by the same host CPU as the vcpu/iothread?
> fio command line: /tmp/fio --time_based --ioengine=libaio --randrepeat=1 --direct=1 --invalidate=1 --verify=0 --offset=0 --verify_fatal=0 --group_reporting --numjobs=$jobs --name=randread --rw=randread --blocksize=8K --iodepth=$qd --runtime=60 --filename={/dev/vdb or /dev/sda}
>
> # qemu-system-x86_64 --version
> QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3~bpo8+1)
> Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
>
> The guest was running RHEL 7.3 and the host was Debian 8.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2017-07-17 10:14 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-14 4:28 [Qemu-devel] Poor 8K random IO performance inside the guest Nagarajan, Padhu (HPE Storage)
2017-07-14 6:07 ` Fam Zheng
2017-07-17 10:13 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).