From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49123) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHsYX-0006I2-FV for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:46:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XHsYQ-00051p-Jh for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:46:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:28757) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XHsYQ-00051e-CL for qemu-devel@nongnu.org; Thu, 14 Aug 2014 06:46:46 -0400 Date: Thu, 14 Aug 2014 12:46:37 +0200 From: Kevin Wolf Message-ID: <20140814104637.GB3820@noname.redhat.com> References: <1407209598-2572-1-git-send-email-ming.lei@canonical.com> <20140805094844.GF4391@noname.str.redhat.com> <20140805134815.GD12251@stefanha-thinkpad.redhat.com> <20140805144728.GH4391@noname.str.redhat.com> <20140806084855.GA4090@noname.str.redhat.com> <20140810114624.0305b7af@tom-ThinkPad-T410> <53E91B5D.4090009@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53E91B5D.4090009@redhat.com> Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: tom.leiming@gmail.com, Ming Lei , Fam Zheng , qemu-devel , Stefan Hajnoczi Am 11.08.2014 um 21:37 hat Paolo Bonzini geschrieben: > Il 10/08/2014 05:46, Ming Lei ha scritto: > > Hi Kevin, Paolo, Stefan and all, > > > > > > On Wed, 6 Aug 2014 10:48:55 +0200 > > Kevin Wolf wrote: > > > >> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: > > > >> > >> Anyhow, the coroutine version of your benchmark is buggy, it leaks all > >> coroutines instead of exiting them, so it can't make any use of the > >> coroutine pool. On my laptop, I get this (where fixed coroutine is a > >> version that simply removes the yield at the end): > >> > >> | bypass | fixed coro | buggy coro > >> ----------------+---------------+---------------+-------------- > >> time | 1.09s | 1.10s | 1.62s > >> L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 > >> insns per cycle | 2.39 | 2.39 | 1.90 > >> > >> Begs the question whether you see a similar effect on a real qemu and > >> the coroutine pool is still not big enough? With correct use of > >> coroutines, the difference seems to be barely measurable even without > >> any I/O involved. > > > > Now I fixes the coroutine leak bug, and previous crypt bench is a bit high > > loading, and cause operations per sec very low(~40K/sec), finally I write a new > > and simple one which can generate hundreds of kilo operations per sec and > > the number should match with some fast storage devices, and it does show there > > is not small effect from coroutine. > > > > Extremely if just getppid() syscall is run in each iteration, with using coroutine, > > only 3M operations/sec can be got, and without using coroutine, the number can > > reach 16M/sec, and there is more than 4 times difference!!! > > I should be on vacation, but I'm following a couple threads in the mailing list > and I'm a bit tired to hear the same argument again and again... > > The different characteristics of asynchronous I/O vs. any synchronous workload > are such that it is hard to be sure that microbenchmarks make sense. > > The below patch is basically the minimal change to bypass coroutines. Of course > the block.c part is not acceptable as is (the change to refresh_total_sectors > is broken, the others are just ugly), but it is a start. Please run it with > your fio workloads, or write an aio-based version of a qemu-img/qemu-io *I/O* > benchmark. So to finally reply with some numbers... I'm running fio tests based on Ming's configuration on a loop-mounted tmpfs image using dataplane. I've extended the tests to not only test random reads, but also sequential reads. I did not yet test writes and almost no test for block sizes larger than 4k, so I'm not including it here. The "base" case is with Ming's patches applied, but the set_bypass(true) call commented out in the virtio-blk code. All other cases are patches applied on top of this. | Random throughput | Sequential throughput ----------------+-------------------+----------------------- master | 442 MB/s | 730 MB/s base | 453 MB/s | 757 MB/s bypass (Ming) | 461 MB/s | 734 MB/s coroutine | 468 MB/s | 716 MB/s bypass (Paolo) | 476 MB/s | 682 MB/s So while your patches look pretty good in Ming's test case of random reads, I think the sequential case is worrying. The same is true for my latest coroutine optimisations, even though the degradation is smaller there. This needs some more investigation. Kevin