From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:43269) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TbViK-0003Ps-Lg for qemu-devel@nongnu.org; Thu, 22 Nov 2012 07:17:10 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TbViF-0000ZD-0B for qemu-devel@nongnu.org; Thu, 22 Nov 2012 07:17:04 -0500 Received: from mx1.redhat.com ([209.132.183.28]:3276) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TbViE-0000Z9-O8 for qemu-devel@nongnu.org; Thu, 22 Nov 2012 07:16:58 -0500 Date: Thu, 22 Nov 2012 13:16:52 +0100 From: Stefan Hajnoczi Message-ID: <20121122121652.GE13571@stefanha-thinkpad.redhat.com> References: <1352992746-8767-1-git-send-email-stefanha@redhat.com> <50AB470F.7050408@redhat.com> <50AC650E.2080207@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <50AC650E.2080207@redhat.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH 0/7] virtio: virtio-blk data plane List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Asias He Cc: Kevin Wolf , Anthony Liguori , "Michael S. Tsirkin" , Stefan Hajnoczi , qemu-devel , Khoa Huynh , Paolo Bonzini On Wed, Nov 21, 2012 at 01:22:22PM +0800, Asias He wrote: > On 11/20/2012 08:21 PM, Stefan Hajnoczi wrote: > > On Tue, Nov 20, 2012 at 10:02 AM, Asias He wrote: > >> Hello Stefan, > >> > >> On 11/15/2012 11:18 PM, Stefan Hajnoczi wrote: > >>> This series adds the -device virtio-blk-pci,x-data-plane=3Don prope= rty that > >>> enables a high performance I/O codepath. A dedicated thread is use= d to process > >>> virtio-blk requests outside the global mutex and without going thro= ugh the QEMU > >>> block layer. > >>> > >>> Khoa Huynh reported an increase from 140,000 IOPS= to 600,000 > >>> IOPS for a single VM using virtio-blk-data-plane in July: > >>> > >>> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/94580 > >>> > >>> The virtio-blk-data-plane approach was originally presented at Linu= x Plumbers > >>> Conference 2010. The following slides contain a brief overview: > >>> > >>> http://linuxplumbersconf.org/2010/ocw/system/presentations/651/or= iginal/Optimizing_the_QEMU_Storage_Stack.pdf > >>> > >>> The basic approach is: > >>> 1. Each virtio-blk device has a thread dedicated to handling ioeven= tfd > >>> signalling when the guest kicks the virtqueue. > >>> 2. Requests are processed without going through the QEMU block laye= r using > >>> Linux AIO directly. > >>> 3. Completion interrupts are injected via irqfd from the dedicated = thread. > >>> > >>> To try it out: > >>> > >>> qemu -drive if=3Dnone,id=3Ddrive0,cache=3Dnone,aio=3Dnative,forma= t=3Draw,file=3D... > >>> -device virtio-blk-pci,drive=3Ddrive0,scsi=3Doff,x-data-plan= e=3Don > >> > >> > >> Is this the latest dataplane bits: > >> (git://github.com/stefanha/qemu.git virtio-blk-data-plane) > >> > >> commit 7872075c24fa01c925d4f41faa9d04ce69bf5328 > >> Author: Stefan Hajnoczi > >> Date: Wed Nov 14 15:45:38 2012 +0100 > >> > >> virtio-blk: add x-data-plane=3Don|off performance feature > >> > >> > >> With this commit on a ramdisk based box, I am seeing about 10K IOPS = with > >> x-data-plane on and 90K IOPS with x-data-plane off. > >> > >> Any ideas? > >> > >> Command line I used: > >> > >> IMG=3D/dev/ram0 > >> x86_64-softmmu/qemu-system-x86_64 \ > >> -drive file=3D/root/img/sid.img,if=3Dide \ > >> -drive file=3D${IMG},if=3Dnone,cache=3Dnone,aio=3Dnative,id=3Ddisk1 = -device > >> virtio-blk-pci,x-data-plane=3Doff,drive=3Ddisk1,scsi=3Doff \ > >> -kernel $KERNEL -append "root=3D/dev/sdb1 console=3Dtty0" \ > >> -L /tmp/qemu-dataplane/share/qemu/ -nographic -vnc :0 -enable-kvm -m > >> 2048 -smp 4 -cpu qemu64,+x2apic -M pc > >=20 > > Was just about to send out the latest patch series which addresses > > review comments, so I have tested the latest code > > (61b70fef489ce51ecd18d69afb9622c110b9315c). > >=20 > > I was unable to reproduce a ramdisk performance regression on Linux > > 3.6.6-3.fc18.x86_64 with Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz wit= h > > 8 GB RAM. >=20 > I am using the latest upstream kernel. >=20 > > The ramdisk is 4 GB and I used your QEMU command-line with a RHEL 6.3= guest. > >=20 > > Summary results: > > x-data-plane-on: iops=3D132856 aggrb=3D1039.1MB/s > > x-data-plane-off: iops=3D126236 aggrb=3D988.40MB/s > >=20 > > virtio-blk-data-plane is ~5% faster in this benchmark. > >=20 > > fio jobfile: > > [global] > > filename=3D/dev/vda > > blocksize=3D8k > > ioengine=3Dlibaio > > direct=3D1 > > iodepth=3D8 > > runtime=3D120 > > time_based=3D1 > >=20 > > [reads] > > readwrite=3Drandread > > numjobs=3D4 > >=20 > > Perf top (data-plane-on): > > 3.71% [kvm] [k] kvm_arch_vcpu_ioctl_run > > 3.27% [kernel] [k] memset <--- ramdisk > > 2.98% [kernel] [k] do_blockdev_direct_IO > > 2.82% [kvm_intel] [k] vmx_vcpu_run > > 2.66% [kernel] [k] _raw_spin_lock_irqsave > > 2.06% [kernel] [k] put_compound_page > > 2.06% [kernel] [k] __get_page_tail > > 1.83% [i915] [k] __gen6_gt_force_wake_mt_get > > 1.75% [kernel] [k] _raw_spin_unlock_irqrestore > > 1.33% qemu-system-x86_64 [.] vring_pop <--- virtio-blk-data-plane > > 1.19% [kernel] [k] compound_unlock_irqrestore > > 1.13% [kernel] [k] gup_huge_pmd > > 1.11% [kernel] [k] __audit_syscall_exit > > 1.07% [kernel] [k] put_page_testzero > > 1.01% [kernel] [k] fget > > 1.01% [kernel] [k] do_io_submit > >=20 > > Since the ramdisk (memset and page-related functions) is so prominent > > in perf top, I also tried a 1-job 8k dd sequential write test on a > > Samsung 830 Series SSD where virtio-blk-data-plane was 9% faster than > > virtio-blk. Optimizing against ramdisk isn't a good idea IMO because > > it acts very differently from real hardware where the driver relies o= n > > mmio, DMA, and interrupts (vs synchronous memcpy/memset). >=20 > For the memset in ramdisk, you can simply patch drivers/block/brd.c to > do nop instead of memset for testing. >=20 > Yes, if you have fast SSD device =EF=BC=88sometimes you need multiple w= hich I > do not have=EF=BC=89, it makes more sense to test on real hardware. How= ever, > ramdisk test is still useful. It gives rough performance numbers. If A > and B are both tested against ramdisk. The difference between A and B > are still useful. Optimizing the difference between A and B on ramdisk is only guaranteed to optimize the ramdisk case. On real hardware the bottleneck might be elsewhere and we'd be chasing the wrong lead. I don't think it's a waste of time but I think to stay healthy we need to focus on real disks and SSDs most of the time. Stefan