From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Rosato Subject: Re: Regression in throughput between kvm guests over virtual bridge Date: Wed, 13 Sep 2017 12:59:02 -0400 Message-ID: <627d0c7a-dce5-3094-d5d4-c1507fcb8080@linux.vnet.ibm.com> References: <4c7e2924-b10f-0e97-c388-c8809ecfdeeb@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: davem@davemloft.net, mst@redhat.com To: Jason Wang , netdev@vger.kernel.org Return-path: Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:45130 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751090AbdIMQ7I (ORCPT ); Wed, 13 Sep 2017 12:59:08 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id v8DGwxXY133875 for ; Wed, 13 Sep 2017 12:59:08 -0400 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by mx0a-001b2d01.pphosted.com with ESMTP id 2cy7ws21wp-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 13 Sep 2017 12:59:07 -0400 Received: from localhost by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 13 Sep 2017 10:59:07 -0600 In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 09/13/2017 04:13 AM, Jason Wang wrote: > > > On 2017年09月13日 09:16, Jason Wang wrote: >> >> >> On 2017年09月13日 01:56, Matthew Rosato wrote: >>> We are seeing a regression for a subset of workloads across KVM guests >>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting >>> points to c67df11f "vhost_net: try batch dequing from skb array" >>> >>> In the regressed environment, we are running 4 kvm guests, 2 running as >>> uperf servers and 2 running as uperf clients, all on a single host. >>> They are connected via a virtual bridge. The uperf client profile looks >>> like: >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> So, 1 tcp streaming instance per client. When upgrading the host kernel >>> from 4.12->4.13, we see about a 30% drop in throughput for this >>> scenario. After the bisect, I further verified that reverting c67df11f >>> on 4.13 "fixes" the throughput for this scenario. >>> >>> On the other hand, if we increase the load by upping the number of >>> streaming instances to 50 (nprocs="50") or even 10, we see instead a >>> ~10% increase in throughput when upgrading host from 4.12->4.13. >>> >>> So it may be the issue is specific to "light load" scenarios. I would >>> expect some overhead for the batching, but 30% seems significant... Any >>> thoughts on what might be happening here? >>> >> >> Hi, thanks for the bisecting. Will try to see if I can reproduce. >> Various factors could have impact on stream performance. If possible, >> could you collect the #pkts and average packet size during the test? >> And if you guest version is above 4.12, could you please retry with >> napi_tx=true? Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 - 4.4.0-93-generic specifically). Here's a throughput report (uperf) and #pkts and average packet size (tcpstat) for one of the uperf clients: host 4.12 / guest 4.4: throughput: 29.98Gb/s #pkts=33465571 avg packet size=33755.70 host 4.13 / guest 4.4: throughput: 20.36Gb/s #pkts=21233399 avg packet size=36130.69 I ran the test again using net-next.git as guest kernel, with and without napi_tx=true. napi_tx did not seem to have any significant impact on throughput. However, the guest kernel shift from 4.4->net-next improved things. I can still see a regression between host 4.12 and 4.13, but it's more on the order of 10-15% - another sample: host 4.12 / guest net-next (without napi_tx): throughput: 28.88Gb/s #pkts=31743116 avg packet size=33779.78 host 4.13 / guest net-next (without napi_tx): throughput: 24.34Gb/s #pkts=25532724 avg packet size=35963.20 >> >> Thanks > > Unfortunately, I could not reproduce it locally. I'm using net-next.git > as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz > for both before and after the commit. I use 1 vcpu and 1 queue, and pin > vcpu and vhost threads into separate cpu on host manually (in same numa > node). The environment is quite a bit different -- I'm running in an LPAR on a z13 (s390x). We've seen the issue in various configurations, the smallest thus far was a host partition w/ 40G and 20 CPUs defined (the numbers above were gathered w/ this configuration). Each guest has 4GB and 4 vcpus. No pinning / affinity configured. > > Can you hit this regression constantly and what's you qemu command line Yes, the regression seems consistent. I can try tweaking some of the host and guest definitions to see if it makes a difference. The guests are instantiated from libvirt - Here's one of the resulting qemu command lines: /usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes -machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -drive file=/dev/disk/by-id/scsi-3600507630bffc0380000000000001803,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001 -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1 -chardev pty,id=charconsole0 -device sclpconsole,chardev=charconsole0,id=console0 -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on In the above, net0 is used for a macvtap connection (not used in the experiment, just for a reliable ssh connection - can remove if needed). net1 is the bridge connection used for the uperf tests. > and #cpus on host? Is zerocopy enabled? Host info provided above. cat /sys/module/vhost_net/parameters/experimental_zcopytx 1