From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34132) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XCmDN-0003cs-0A for qemu-devel@nongnu.org; Thu, 31 Jul 2014 05:00:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XCmDF-0002KU-Fc for qemu-devel@nongnu.org; Thu, 31 Jul 2014 04:59:56 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:43001) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XCmDF-0002KQ-9K for qemu-devel@nongnu.org; Thu, 31 Jul 2014 04:59:49 -0400 Received: from mail-vc0-f175.google.com ([209.85.220.175]) by youngberry.canonical.com with esmtpsa (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71) (envelope-from ) id 1XCmDE-00028F-Cg for qemu-devel@nongnu.org; Thu, 31 Jul 2014 08:59:48 +0000 Received: by mail-vc0-f175.google.com with SMTP id ik5so3652656vcb.6 for ; Thu, 31 Jul 2014 01:59:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <53D981C0.4030708@redhat.com> References: <1406720388-18671-1-git-send-email-ming.lei@canonical.com> <1406720388-18671-2-git-send-email-ming.lei@canonical.com> <53D8F6F0.7040106@redhat.com> <53D981C0.4030708@redhat.com> Date: Thu, 31 Jul 2014 16:59:47 +0800 Message-ID: From: Ming Lei Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] [PATCH 01/15] qemu coroutine: support bypass mode List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Kevin Wolf , Peter Maydell , Fam Zheng , "Michael S. Tsirkin" , qemu-devel , Stefan Hajnoczi On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini wrote: > Il 30/07/2014 19:15, Ming Lei ha scritto: >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini wrote: >>> Il 30/07/2014 13:39, Ming Lei ha scritto: >>>> This patch introduces several APIs for supporting bypass qemu coroutine >>>> in case of being not necessary and for performance's sake. >>> >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >>> anything else is a step backwards. >> >> As we saw, coroutine has brought up performance regression >> on dataplane, and it isn't necessary to use co in some cases, is it? > > Yes, and it's not necessary on non-dataplane either. It's not necessary > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane > either. > >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >>> conditions that allow doing that and call the bdrv_aio_readv/writev >>> directly. >> >> That is easy to detect, please see the 5th patch. > > No, that's not enough. Dataplane right now prevents block jobs, but > that's going to change and it could require coroutines even for raw devices. > >>> To begin with, have you benchmarked QEMU and can you provide a trace of >>> *where* the coroutine overhead lies? >> >> I guess it may be caused by the stack switch, at least in one of >> my box, bypassing co can improve throughput by ~7%, and by >> ~15% in another box. > > No guesses please. Actually that's also my guess, but since you are > submitting the patch you must do better and show profiles where stack > switching disappears after the patches. Follows the below hardware events reported by 'perf stat' when running fio randread benchmark for 2min in VM(single vq, 2 jobs): sudo ~/bin/perf stat -e L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses ./nqemu-start-mq 4 1 1), without bypassing coroutine via forcing to set 's->raw_format ' as false, see patch 5/15 - throughout: 95K Performance counter stats for './nqemu-start-mq 4 1': 69,231,035,842 L1-dcache-loads [40.10%] 1,909,978,930 L1-dcache-load-misses # 2.76% of all L1-dcache hits [39.98%] 263,731,501,086 cpu-cycles [40.03%] 232,564,905,115 instructions # 0.88 insns per cycle [50.23%] 46,157,868,745 branch-instructions [49.82%] 785,618,591 branch-misses # 1.70% of all branches [49.99%] 46,280,342,654 branch-loads [49.95%] 34,934,790,140 branch-load-misses [50.02%] 69,447,857,237 dTLB-loads [40.13%] 169,617,374 dTLB-load-misses # 0.24% of all dTLB cache hits [40.04%] 161.991075781 seconds time elapsed 2), with bypassing coroutinue - throughput: 115K Performance counter stats for './nqemu-start-mq 4 1': 76,784,224,509 L1-dcache-loads [39.93%] 1,334,036,447 L1-dcache-load-misses # 1.74% of all L1-dcache hits [39.91%] 262,697,428,470 cpu-cycles [40.03%] 255,526,629,881 instructions # 0.97 insns per cycle [50.01%] 50,160,082,611 branch-instructions [49.97%] 564,407,788 branch-misses # 1.13% of all branches [50.08%] 50,331,510,702 branch-loads [50.08%] 35,760,766,459 branch-load-misses [50.03%] 76,706,000,951 dTLB-loads [40.00%] 123,291,001 dTLB-load-misses # 0.16% of all dTLB cache hits [40.02%] 162.333465490 seconds time elapsed