From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Wang Subject: Re: Regression in throughput between kvm guests over virtual bridge Date: Tue, 28 Nov 2017 09:36:37 +0800 Message-ID: References: <20171026094415.uyogf2iw7yoavnoc@Wei-Dev> <20171031070717.wcbgrp6thrjmtrh3@Wei-Dev> <56710dc8-f289-0211-db97-1a1ea29e38f7@linux.vnet.ibm.com> <20171104233519.7jwja7t2itooyeak@Wei-Dev> <1611b26f-0997-3b22-95f5-debf57b7be8c@linux.vnet.ibm.com> <101d1fdf-9df1-44bd-73a7-e7d8fbc09160@linux.vnet.ibm.com> <20171112183406.zuuj7w3fmtb4eduf@Wei-Dev> <9996b0f1-ffa6-ff95-2e9c-0deccf4623ae@linux.vnet.ibm.com> <20171127162109.eriexz7gpvz6vxnx@Wei-Dev> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net To: Wei Xu , Matthew Rosato Return-path: Received: from mx1.redhat.com ([209.132.183.28]:57364 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752610AbdK1Bgp (ORCPT ); Mon, 27 Nov 2017 20:36:45 -0500 In-Reply-To: <20171127162109.eriexz7gpvz6vxnx@Wei-Dev> Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 2017年11月28日 00:21, Wei Xu wrote: > On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote: >> On 11/14/2017 03:11 PM, Matthew Rosato wrote: >>> On 11/12/2017 01:34 PM, Wei Xu wrote: >>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote: >>>>>>> This case should be quite similar with pkgten, if you got improvement with >>>>>>> pktgen, usually it was also the same for UDP, could you please try to disable >>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently >>>>>>> the most significant tests would be like this AFAICT: >>>>>>> >>>>>>> Host->VM 4.12 4.13 >>>>>>> TCP: >>>>>>> UDP: >>>>>>> pktgen: >> So, I automated these scenarios for extended overnight runs and started >> experiencing OOM conditions overnight on a 40G system. I did a bisect >> and it also points to c67df11f. I can see a leak in at least all of the >> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the >> fastest leak. >> >> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short >> intervals until a large% of host memory was consumed. Numbers below >> after the last pktgen run completed. The summary is that a very large # >> of active skbuff_head_cache entries can be seen - The sum of alloc/free >> calls match up, but the # of active skbuff_head_cache entries keeps >> growing each time the workload is run and never goes back down in >> between runs. >> >> free -h: >> total used free shared buff/cache available >> Mem: 39G 31G 6.6G 472K 1.4G 6.8G >> >> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >> >> 1001952 1000610 99% 0.75K 23856 42 763392K skbuff_head_cache >> 126192 126153 99% 0.36K 2868 44 45888K ksm_rmap_item >> 100485 100435 99% 0.41K 1305 77 41760K kernfs_node_cache >> 63294 39598 62% 0.48K 959 66 30688K dentry >> 31968 31719 99% 0.88K 888 36 28416K inode_cache >> >> /sys/kernel/slab/skbuff_head_cache/alloc_calls : >> 259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18 >> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10 >> >> /sys/kernel/slab/skbuff_head_cache/free_calls: >> 13492 age=4295073614 pid=0 cpus=0 >> 978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733 >> cpus=1-19 >> 6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325 >> cpus=4,8,10,12,14 >> 3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269 >> pid=0-11605 cpus=5,7,12 >> 1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4 >> 2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12 >> 1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12 >> 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6 >> 3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273 >> pid=9915-11581 cpus=8,16,18 >> 2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155 >> pid=11605-11699 cpus=2,9 >> 1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835 >> pid=331 cpus=11 >> 8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571 >> pid=11863 cpus=0 >> >> >> By comparison, when running 4.13 with c67df11f reverted, here's the same >> output after the exact same test: >> >> free -h: >> total used free shared buff/cache available >> Mem: 39G 783M 37G 472K 637M 37G >> >> slabtop: >> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >> 714 256 35% 0.75K 17 42 544K skbuff_head_cache >> >> /sys/kernel/slab/skbuff_head_cache/alloc_calls: >> 257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15 >> /sys/kernel/slab/skbuff_head_cache/free_calls: >> 255 age=4295003081 pid=0 cpus=0 >> 1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15 >> 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16 >> > Thanks a lot for the test, and sorry for the late update, I was working on > the code path and didn't find anything helpful to you till today. > > I did some tests and initially it turned out that the bottleneck was the guest > kernel stack(napi) side, followed by tracking the traffic footprints and it > appeared as the loss happened when vring was full and could not be drained > out by the guest, afterwards it triggered a SKB drop in vhost driver due > to no headcount to fill it with, it can be avoided by deferring consuming the > SKB after having obtained a sufficient headcount with below patch. > > Could you please try it? It is based on 4.13 and I also applied Jason's > 'conditionally enable tx polling' patch. > https://lkml.org/lkml/2016/6/1/39 This patch has already been merged. > > I only tested one instance case from Host -> VM with uperf & iperf3, I like > iperf3 a bit more since it spontaneously tells the retransmitted and cwnd > during testing. :) > > To maximize the performance of one instance case, two vcpus are needed, > one does the kernel napi and the other one should serve the socket syscall > (mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest > and pinned the iperf/uperf slave to the one not used by kernel napi, you may > need to check out which one you should pin properly by seeing the CPU > utilization with a quick trial test before running the long duration test. > > Slight performance improvement for tcp with the patch(host/guest offload off) > on x86, also 4.12 wins the game with 20-30% possibility from time to time, but > the cwnd and retransmitted statistics are almost the same now, the 'retrans' > was about 10x times more and cwnd was 6x smaller than 4.12 before. > > Here is one typical sample of my tests. > 4.12 4.13 > offload on: 36.8Gbits 37.4Gbits > offload off: 7.68Gbits 7.84Gbits > > I also borrowed a s390x machine with 6 cpus and 4G memory from system z team, > it seems 4.12 is still a bit faster than 4.13, could you please see if this > is aligned with your test bed? > 4.12 4.13 > offload on: 37.3Gbits 38.3Gbits > offload off: 6.26Gbits 6.06Gbits > > For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster > than Jason's number before. > 4.12 4.13 > 3.33 Mpss 3.70 Mpps > > Thanks again for all the tests your have done. > > Wei > > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net) > /* On error, stop handling until the next kick. */ > if (unlikely(headcount < 0)) > goto out; > - if (nvq->rx_array) > - msg.msg_control = vhost_net_buf_consume(&nvq->rxq); > /* On overrun, truncate and discard */ > if (unlikely(headcount > UIO_MAXIOV)) { I think you need do msg.msg_control = vhost_net_buf_consume() here too. > iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1); > @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net) > * they refilled. */ > goto out; > } > + > + if (nvq->rx_array) > + msg.msg_control = vhost_net_buf_consume(&nvq->rxq); > + > /* We don't need to be notified again. */ > iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len); > fixup = msg.msg_iter; > > Good catch, this fixes the memory leak too. I suggest to post a formal patch for -net as soon as possible too since it was a valid fix even if it does not help for performance. Thanks