From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sridhar Samudrala Subject: Re: [RFC PATCH 0/4] Implement multiqueue virtio-net Date: Thu, 09 Sep 2010 16:00:24 -0700 Message-ID: <4C896708.1050607@us.ibm.com> References: <20100908072859.23769.97363.sendpatchset@krkumar2.in.ibm.com> <20100908081011.GC23051@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: anthony@codemonkey.ws, davem@davemloft.net, kvm@vger.kernel.org, "Michael S. Tsirkin" , netdev@vger.kernel.org, rusty@rustcorp.com.au To: Krishna Kumar2 Return-path: Received: from e32.co.us.ibm.com ([32.97.110.150]:42954 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756912Ab0IIXA7 (ORCPT ); Thu, 9 Sep 2010 19:00:59 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On 9/9/2010 2:45 AM, Krishna Kumar2 wrote: >> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM: > Some more results and likely cause for single netperf > degradation below. > > >> Guest -> Host (single netperf): >> I am getting a drop of almost 20%. I am trying to figure out >> why. >> >> Host -> guest (single netperf): >> I am getting an improvement of almost 15%. Again - unexpected. >> >> Guest -> Host TCP_RR: I get an average 7.4% increase in #packets >> for runs upto 128 sessions. With fewer netperf (under 8), there >> was a drop of 3-7% in #packets, but beyond that, the #packets >> improved significantly to give an average improvement of 7.4%. >> >> So it seems that fewer sessions is having negative effect for >> some reason on the tx side. The code path in virtio-net has not >> changed much, so the drop in some cases is quite unexpected. > The drop for the single netperf seems to be due to multiple vhost. > I changed the patch to start *single* vhost: > > Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45% > Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5% I remember seeing similar issue when using a separate vhost thread for TX and RX queues. Basically, we should have the same vhost thread process a TCP flow in both directions. I guess this allows the data and ACKs to be processed in sync. Thanks Sridhar > Single vhost performs well but hits the barrier at 16 netperf > sessions: > > SINGLE vhost (Guest -> Host): > 1 netperf: BW: 10.7% SD: -1.4% > 4 netperfs: BW: 3% SD: 1.4% > 8 netperfs: BW: 17.7% SD: -10% > 16 netperfs: BW: 4.7% SD: -7.0% > 32 netperfs: BW: -6.1% SD: -5.7% > BW and SD both improves (guest multiple txqs help). For 32 > netperfs, SD improves. > > But with multiple vhosts, guest is able to send more packets > and BW increases much more (SD too increases, but I think > that is expected). From the earlier results: > > N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%) > _______________________________________________________________________________ > 4 26387 40716 (54.30) 20 28 (40.00) 86 85 > (-1.16) > 8 24356 41843 (71.79) 88 129 (46.59) 372 362 > (-2.68) > 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519 > (-2.50) > 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722 > (-14.52) > 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552 > (-14.35) > 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173 > (-9.66) > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 > (10.74) > _______________________________________________________________________________ > (All tests were done without any tuning) > > From my testing: > > 1. Single vhost improves mq guest performance upto 16 > netperfs but degrades after that. > 2. Multiple vhost degrades single netperf guest > performance, but significantly improves performance > for any number of netperf sessions. > > Likely cause for the 1 stream degradation with multiple > vhost patch: > > 1. Two vhosts run handling the RX and TX respectively. > I think the issue is related to cache ping-pong esp > since these run on different cpus/sockets. > 2. I (re-)modified the patch to share RX with TX[0]. The > performance drop is the same, but the reason is the > guest is not using txq[0] in most cases (dev_pick_tx), > so vhost's rx and tx are running on different threads. > But whenever the guest uses txq[0], only one vhost > runs and the performance is similar to original. > > I went back to my *submitted* patch and started a guest > with numtxq=16 and pinned every vhost to cpus #0&1. Now > whether guest used txq[0] or txq[n], the performance is > similar or better (between 10-27% across 10 runs) than > original code. Also, -6% to -24% improvement in SD. > > I will start a full test run of original vs submitted > code with minimal tuning (Avi also suggested the same), > and re-send. Please let me know if you need any other > data. > > Thanks, > > - KK > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html