From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:48278) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tb2kC-0000RK-Of for qemu-devel@nongnu.org; Wed, 21 Nov 2012 00:21:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Tb2kA-0001SY-4w for qemu-devel@nongnu.org; Wed, 21 Nov 2012 00:21:04 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49902) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tb2k9-0001SS-T1 for qemu-devel@nongnu.org; Wed, 21 Nov 2012 00:21:02 -0500 Message-ID: <50AC650E.2080207@redhat.com> Date: Wed, 21 Nov 2012 13:22:22 +0800 From: Asias He MIME-Version: 1.0 References: <1352992746-8767-1-git-send-email-stefanha@redhat.com> <50AB470F.7050408@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Anthony Liguori , "Michael S. Tsirkin" , qemu-devel , Khoa Huynh , Stefan Hajnoczi , Paolo Bonzini On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote: > On Tue, Nov 20, 2012 at 10:02 AM, Asias He wrote: >> Hello Stefan, >> >> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote: >>> This series adds the -device virtio-blk-pci,x-data-plane=3Don propert= y that >>> enables a high performance I/O codepath. A dedicated thread is used = to process >>> virtio-blk requests outside the global mutex and without going throug= h the QEMU >>> block layer. >>> >>> Khoa Huynh reported an increase from 140,000 IOPS t= o 600,000 >>> IOPS for a single VM using virtio-blk-data-plane in July: >>> >>> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580 >>> >>> The virtio-blk-data-plane approach was originally presented at Linux = Plumbers >>> Conference 2010. The following slides contain a brief overview: >>> >>> http://linuxplumbersconf.org/2010/ocw/system/presentations/651/orig= inal/Optimizing_the_QEMU_Storage_Stack.pdf >>> >>> The basic approach is: >>> 1. Each virtio-blk device has a thread dedicated to handling ioeventf= d >>> signalling when the guest kicks the virtqueue. >>> 2. Requests are processed without going through the QEMU block layer = using >>> Linux AIO directly. >>> 3. Completion interrupts are injected via irqfd from the dedicated th= read. >>> >>> To try it out: >>> >>> qemu -drive if=3Dnone,id=3Ddrive0,cache=3Dnone,aio=3Dnative,format=3D= raw,file=3D... >>> -device virtio-blk-pci,drive=3Ddrive0,scsi=3Doff,x-data-plane=3D= on >> >> >> Is this the latest dataplane bits: >> (git://github.com/stefanha/qemu.git virtio-blk-data-plane) >> >> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328 >> Author: Stefan Hajnoczi >> Date: Wed Nov 14 15:45:38 2012 +0100 >> >> virtio-blk: add x-data-plane=3Don|off performance feature >> >> >> With this commit on a ramdisk based box, I am seeing about 10K IOPS wi= th >> x-data-plane on and 90K IOPS with x-data-plane off. >> >> Any ideas? >> >> Command line I used: >> >> IMG=3D/dev/ram0 >> x86_64-softmmu/qemu-system-x86_64 \ >> -drive file=3D/root/img/sid.img,if=3Dide \ >> -drive file=3D${IMG},if=3Dnone,cache=3Dnone,aio=3Dnative,id=3Ddisk1 -d= evice >> virtio-blk-pci,x-data-plane=3Doff,drive=3Ddisk1,scsi=3Doff \ >> -kernel $KERNEL -append "root=3D/dev/sdb1 console=3Dtty0" \ >> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m >> 2048 -smp 4 -cpu qemu64,+x2apic -M pc >=20 > Was just about to send out the latest patch series which addresses > review comments, so I have tested the latest code > (61b70fef489ce51ecd18d69afb9622c110b9315c). >=20 > I was unable to reproduce a ramdisk performance regression on Linux > 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with > 8 GB RAM. I am using the latest upstream kernel. > The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3 g= uest. >=20 > Summary results: > x-data-plane-on: iops=3D132856 aggrb=3D1039.1MB/s > x-data-plane-off: iops=3D126236 aggrb=3D988.40MB/s >=20 > virtio-blk-data-plane is ~5% faster in this benchmark. >=20 > fio jobfile: > [global] > filename=3D/dev/vda > blocksize=3D8k > ioengine=3Dlibaio > direct=3D1 > iodepth=3D8 > runtime=3D120 > time_based=3D1 >=20 > [reads] > readwrite=3Drandread > numjobs=3D4 >=20 > Perf top (data-plane-on): > 3.71% [kvm] [k] kvm_arch_vcpu_ioctl_run > 3.27% [kernel] [k] memset <--- ramdisk > 2.98% [kernel] [k] do_blockdev_direct_IO > 2.82% [kvm_intel] [k] vmx_vcpu_run > 2.66% [kernel] [k] _raw_spin_lock_irqsave > 2.06% [kernel] [k] put_compound_page > 2.06% [kernel] [k] __get_page_tail > 1.83% [i915] [k] __gen6_gt_force_wake_mt_get > 1.75% [kernel] [k] _raw_spin_unlock_irqrestore > 1.33% qemu-system-x86_64 [.] vring_pop <--- virtio-blk-data-plane > 1.19% [kernel] [k] compound_unlock_irqrestore > 1.13% [kernel] [k] gup_huge_pmd > 1.11% [kernel] [k] __audit_syscall_exit > 1.07% [kernel] [k] put_page_testzero > 1.01% [kernel] [k] fget > 1.01% [kernel] [k] do_io_submit >=20 > Since the ramdisk (memset and page-related functions) is so prominent > in perf top, I also tried a 1-job 8k dd sequential write test on a > Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than > virtio-blk. Optimizing against ramdisk isn't a good idea IMO because > it acts very differently from real hardware where the driver relies on > mmio, DMA, and interrupts (vs synchronous memcpy/memset). For the memset in ramdisk, you can simply patch drivers/block/brd.c to do nop instead of memset for testing. Yes, if you have fast SSD device =EF=BC=88sometimes you need multiple whi= ch I do not have=EF=BC=89, it makes more sense to test on real hardware. Howev= er, ramdisk test is still useful. It gives rough performance numbers. If A and B are both tested against ramdisk. The difference between A and B are still useful. > Full results: > $ cat data-plane-off > reads: (g=3D0): rw=3Drandread, bs=3D8K-8K/8K-8K, ioengine=3Dlibaio, iod= epth=3D8 > ... > reads: (g=3D0): rw=3Drandread, bs=3D8K-8K/8K-8K, ioengine=3Dlibaio, iod= epth=3D8 > fio 1.57 > Starting 4 processes >=20 > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1851 > read : io=3D29408MB, bw=3D250945KB/s, iops=3D31368 , runt=3D120001mse= c > slat (usec): min=3D2 , max=3D27829 , avg=3D11.06, stdev=3D78.05 > clat (usec): min=3D1 , max=3D28028 , avg=3D241.41, stdev=3D388.47 > lat (usec): min=3D33 , max=3D28035 , avg=3D253.17, stdev=3D396.66 > bw (KB/s) : min=3D197141, max=3D335365, per=3D24.78%, avg=3D250797.= 02, > stdev=3D29376.35 > cpu : usr=3D6.55%, sys=3D31.34%, ctx=3D310932, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3764202/0/0, short=3D0/0/0 > lat (usec): 2=3D0.01%, 4=3D0.01%, 20=3D0.01%, 50=3D1.78%, 100=3D27= .11% > lat (usec): 250=3D38.97%, 500=3D27.11%, 750=3D2.09%, 1000=3D0.71% > lat (msec): 2=3D1.32%, 4=3D0.70%, 10=3D0.20%, 20=3D0.01%, 50=3D0.0= 1% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1852 > read : io=3D29742MB, bw=3D253798KB/s, iops=3D31724 , runt=3D120001mse= c > slat (usec): min=3D2 , max=3D17007 , avg=3D10.61, stdev=3D67.51 > clat (usec): min=3D1 , max=3D41531 , avg=3D239.00, stdev=3D379.03 > lat (usec): min=3D32 , max=3D41547 , avg=3D250.33, stdev=3D385.21 > bw (KB/s) : min=3D194336, max=3D347497, per=3D25.02%, avg=3D253204.= 25, > stdev=3D31172.37 > cpu : usr=3D6.66%, sys=3D32.58%, ctx=3D327250, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3806999/0/0, short=3D0/0/0 > lat (usec): 2=3D0.01%, 20=3D0.01%, 50=3D1.54%, 100=3D26.45%, 250=3D= 40.04% > lat (usec): 500=3D27.15%, 750=3D1.95%, 1000=3D0.71% > lat (msec): 2=3D1.29%, 4=3D0.68%, 10=3D0.18%, 20=3D0.01%, 50=3D0.0= 1% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1853 > read : io=3D29859MB, bw=3D254797KB/s, iops=3D31849 , runt=3D120001mse= c > slat (usec): min=3D2 , max=3D16821 , avg=3D11.35, stdev=3D76.54 > clat (usec): min=3D1 , max=3D17659 , avg=3D237.25, stdev=3D375.31 > lat (usec): min=3D31 , max=3D17673 , avg=3D249.27, stdev=3D383.62 > bw (KB/s) : min=3D194864, max=3D345280, per=3D25.15%, avg=3D254534.= 63, > stdev=3D30549.32 > cpu : usr=3D6.52%, sys=3D31.84%, ctx=3D303763, majf=3D0, min= f=3D39 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3821989/0/0, short=3D0/0/0 > lat (usec): 2=3D0.01%, 10=3D0.01%, 20=3D0.01%, 50=3D2.09%, 100=3D2= 9.19% > lat (usec): 250=3D37.31%, 500=3D26.41%, 750=3D2.08%, 1000=3D0.71% > lat (msec): 2=3D1.32%, 4=3D0.70%, 10=3D0.20%, 20=3D0.01% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1854 > read : io=3D29598MB, bw=3D252565KB/s, iops=3D31570 , runt=3D120001mse= c > slat (usec): min=3D2 , max=3D26413 , avg=3D11.21, stdev=3D78.32 > clat (usec): min=3D16 , max=3D27993 , avg=3D239.56, stdev=3D381.67 > lat (usec): min=3D34 , max=3D28006 , avg=3D251.49, stdev=3D390.13 > bw (KB/s) : min=3D194256, max=3D369424, per=3D24.94%, avg=3D252462.= 86, > stdev=3D29420.58 > cpu : usr=3D6.57%, sys=3D31.33%, ctx=3D305623, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3788507/0/0, short=3D0/0/0 > lat (usec): 20=3D0.01%, 50=3D2.13%, 100=3D28.30%, 250=3D37.74%, 50= 0=3D26.66% > lat (usec): 750=3D2.17%, 1000=3D0.75% > lat (msec): 2=3D1.35%, 4=3D0.70%, 10=3D0.19%, 20=3D0.01%, 50=3D0.0= 1% >=20 > Run status group 0 (all jobs): > READ: io=3D118607MB, aggrb=3D988.40MB/s, minb=3D256967KB/s, > maxb=3D260912KB/s, mint=3D120001msec, maxt=3D120001msec >=20 > Disk stats (read/write): > vda: ios=3D15148328/0, merge=3D0/0, ticks=3D1550570/0, in_queue=3D153= 6232, util=3D96.56% >=20 > $ cat data-plane-on > reads: (g=3D0): rw=3Drandread, bs=3D8K-8K/8K-8K, ioengine=3Dlibaio, iod= epth=3D8 > ... > reads: (g=3D0): rw=3Drandread, bs=3D8K-8K/8K-8K, ioengine=3Dlibaio, iod= epth=3D8 > fio 1.57 > Starting 4 processes >=20 > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1796 > read : io=3D32081MB, bw=3D273759KB/s, iops=3D34219 , runt=3D120001mse= c > slat (usec): min=3D1 , max=3D20404 , avg=3D21.08, stdev=3D125.49 > clat (usec): min=3D10 , max=3D135743 , avg=3D207.62, stdev=3D532.90 > lat (usec): min=3D21 , max=3D136055 , avg=3D229.60, stdev=3D556.82 > bw (KB/s) : min=3D56480, max=3D951952, per=3D25.49%, avg=3D271488.8= 1, > stdev=3D149773.57 > cpu : usr=3D7.01%, sys=3D43.26%, ctx=3D336854, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D4106413/0/0, short=3D0/0/0 > lat (usec): 20=3D0.01%, 50=3D2.46%, 100=3D61.13%, 250=3D21.58%, 50= 0=3D3.11% > lat (usec): 750=3D3.04%, 1000=3D3.88% > lat (msec): 2=3D4.50%, 4=3D0.13%, 10=3D0.11%, 20=3D0.06%, 50=3D0.0= 1% > lat (msec): 250=3D0.01% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1797 > read : io=3D30104MB, bw=3D256888KB/s, iops=3D32110 , runt=3D120001mse= c > slat (usec): min=3D1 , max=3D17595 , avg=3D22.20, stdev=3D120.29 > clat (usec): min=3D13 , max=3D136264 , avg=3D221.21, stdev=3D528.19 > lat (usec): min=3D22 , max=3D136280 , avg=3D244.35, stdev=3D551.73 > bw (KB/s) : min=3D57312, max=3D838880, per=3D23.93%, avg=3D254798.5= 1, > stdev=3D139546.57 > cpu : usr=3D6.82%, sys=3D41.87%, ctx=3D360348, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3853351/0/0, short=3D0/0/0 > lat (usec): 20=3D0.01%, 50=3D2.10%, 100=3D58.47%, 250=3D22.38%, 50= 0=3D3.68% > lat (usec): 750=3D3.69%, 1000=3D4.52% > lat (msec): 2=3D4.87%, 4=3D0.14%, 10=3D0.11%, 20=3D0.05%, 250=3D0.= 01% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1798 > read : io=3D31698MB, bw=3D270487KB/s, iops=3D33810 , runt=3D120001mse= c > slat (usec): min=3D1 , max=3D17457 , avg=3D20.93, stdev=3D125.33 > clat (usec): min=3D16 , max=3D134663 , avg=3D210.19, stdev=3D535.77 > lat (usec): min=3D21 , max=3D134671 , avg=3D232.02, stdev=3D559.27 > bw (KB/s) : min=3D57248, max=3D841952, per=3D25.29%, avg=3D269330.2= 1, > stdev=3D148661.08 > cpu : usr=3D6.92%, sys=3D42.81%, ctx=3D337799, majf=3D0, min= f=3D39 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D4057340/0/0, short=3D0/0/0 > lat (usec): 20=3D0.01%, 50=3D1.98%, 100=3D62.00%, 250=3D20.70%, 50= 0=3D3.22% > lat (usec): 750=3D3.23%, 1000=3D4.16% > lat (msec): 2=3D4.41%, 4=3D0.13%, 10=3D0.10%, 20=3D0.06%, 250=3D0.= 01% > reads: (groupid=3D0, jobs=3D1): err=3D 0: pid=3D1799 > read : io=3D30913MB, bw=3D263789KB/s, iops=3D32973 , runt=3D120000mse= c > slat (usec): min=3D1 , max=3D17565 , avg=3D21.52, stdev=3D120.17 > clat (usec): min=3D15 , max=3D136064 , avg=3D215.53, stdev=3D529.56 > lat (usec): min=3D27 , max=3D136070 , avg=3D237.99, stdev=3D552.50 > bw (KB/s) : min=3D57632, max=3D900896, per=3D24.74%, avg=3D263431.5= 7, > stdev=3D148379.15 > cpu : usr=3D6.90%, sys=3D42.56%, ctx=3D348217, majf=3D0, min= f=3D41 > IO depths : 1=3D0.1%, 2=3D0.1%, 4=3D0.1%, 8=3D100.0%, 16=3D0.0%, 3= 2=3D0.0%, >=3D64=3D0.0% > submit : 0=3D0.0%, 4=3D100.0%, 8=3D0.0%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > complete : 0=3D0.0%, 4=3D100.0%, 8=3D0.1%, 16=3D0.0%, 32=3D0.0%, = 64=3D0.0%, >=3D64=3D0.0% > issued r/w/d: total=3D3956830/0/0, short=3D0/0/0 > lat (usec): 20=3D0.01%, 50=3D1.76%, 100=3D59.96%, 250=3D22.21%, 50= 0=3D3.45% > lat (usec): 750=3D3.35%, 1000=3D4.33% > lat (msec): 2=3D4.65%, 4=3D0.13%, 10=3D0.11%, 20=3D0.05%, 250=3D0.= 01% >=20 > Run status group 0 (all jobs): > READ: io=3D124796MB, aggrb=3D1039.1MB/s, minb=3D263053KB/s, > maxb=3D280328KB/s, mint=3D120000msec, maxt=3D120001msec >=20 > Disk stats (read/write): > vda: ios=3D15942789/0, merge=3D0/0, ticks=3D336240/0, in_queue=3D3178= 32, util=3D97.47% >=20 --=20 Asias