From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Rosato Subject: Re: Regression in throughput between kvm guests over virtual bridge Date: Mon, 27 Nov 2017 21:44:07 -0500 Message-ID: References: <20171026094415.uyogf2iw7yoavnoc@Wei-Dev> <20171031070717.wcbgrp6thrjmtrh3@Wei-Dev> <56710dc8-f289-0211-db97-1a1ea29e38f7@linux.vnet.ibm.com> <20171104233519.7jwja7t2itooyeak@Wei-Dev> <1611b26f-0997-3b22-95f5-debf57b7be8c@linux.vnet.ibm.com> <101d1fdf-9df1-44bd-73a7-e7d8fbc09160@linux.vnet.ibm.com> <20171112183406.zuuj7w3fmtb4eduf@Wei-Dev> <9996b0f1-ffa6-ff95-2e9c-0deccf4623ae@linux.vnet.ibm.com> <20171127162109.eriexz7gpvz6vxnx@Wei-Dev> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net To: Jason Wang , Wei Xu Return-path: Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:42306 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752997AbdK1CoL (ORCPT ); Mon, 27 Nov 2017 21:44:11 -0500 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vAS2ce79072112 for ; Mon, 27 Nov 2017 21:44:11 -0500 Received: from e11.ny.us.ibm.com (e11.ny.us.ibm.com [129.33.205.201]) by mx0b-001b2d01.pphosted.com with ESMTP id 2egw5d4vrw-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Mon, 27 Nov 2017 21:44:10 -0500 Received: from localhost by e11.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 27 Nov 2017 21:44:10 -0500 In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On 11/27/2017 08:36 PM, Jason Wang wrote: > > > On 2017年11月28日 00:21, Wei Xu wrote: >> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote: >>> On 11/14/2017 03:11 PM, Matthew Rosato wrote: >>>> On 11/12/2017 01:34 PM, Wei Xu wrote: >>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote: >>>>>>>> This case should be quite similar with pkgten, if you got >>>>>>>> improvement with >>>>>>>> pktgen, usually it was also the same for UDP, could you please >>>>>>>> try to disable >>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net >>>>>>>> devices? Currently >>>>>>>> the most significant tests would be like this AFAICT: >>>>>>>> >>>>>>>> Host->VM     4.12    4.13 >>>>>>>>   TCP: >>>>>>>>   UDP: >>>>>>>> pktgen: >>> So, I automated these scenarios for extended overnight runs and started >>> experiencing OOM conditions overnight on a 40G system.  I did a bisect >>> and it also points to c67df11f.  I can see a leak in at least all of the >>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the >>> fastest leak. >>> >>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short >>> intervals until a large% of host memory was consumed.  Numbers below >>> after the last pktgen run completed. The summary is that a very large # >>> of active skbuff_head_cache entries can be seen - The sum of alloc/free >>> calls match up, but the # of active skbuff_head_cache entries keeps >>> growing each time the workload is run and never goes back down in >>> between runs. >>> >>> free -h: >>>       total        used        free      shared  buff/cache   available >>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G >>> >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME >>> >>> 1001952 1000610  99%    0.75K  23856       42    763392K >>> skbuff_head_cache >>> 126192 126153  99%    0.36K   2868     44     45888K ksm_rmap_item >>> 100485 100435  99%    0.41K   1305     77     41760K kernfs_node_cache >>>   63294  39598  62%    0.48K    959     66     30688K dentry >>>   31968  31719  99%    0.88K    888     36     28416K inode_cache >>> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls : >>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 >>> cpus=0,2,4,18 >>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 >>> cpus=0,10 >>> >>> /sys/kernel/slab/skbuff_head_cache/free_calls: >>>    13492 age=4295073614 pid=0 cpus=0 >>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733 >>> cpus=1-19 >>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325 >>> cpus=4,8,10,12,14 >>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269 >>> pid=0-11605 cpus=5,7,12 >>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4 >>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 >>> cpus=4,12 >>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12 >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6 >>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273 >>> pid=9915-11581 cpus=8,16,18 >>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155 >>> pid=11605-11699 cpus=2,9 >>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835 >>> pid=331 cpus=11 >>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] >>> age=8545/62184/110571 >>> pid=11863 cpus=0 >>> >>> >>> By comparison, when running 4.13 with c67df11f reverted, here's the same >>> output after the exact same test: >>> >>> free -h: >>>         total        used        free      shared  buff/cache   >>> available >>> Mem:     39G        783M         37G        472K        637M         37G >>> >>> slabtop: >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME >>>     714    256  35%    0.75K     17     42      544K skbuff_head_cache >>> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls: >>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15 >>> /sys/kernel/slab/skbuff_head_cache/free_calls: >>>      255 age=4295003081 pid=0 cpus=0 >>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15 >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16 >>> >> Thanks a lot for the test, and sorry for the late update, I was >> working on >> the code path and didn't find anything helpful to you till today. >> >> I did some tests and initially it turned out that the bottleneck was >> the guest >> kernel stack(napi) side, followed by tracking the traffic footprints >> and it >> appeared as the loss happened when vring was full and could not be >> drained >> out by the guest, afterwards it triggered a SKB drop in vhost driver due >> to no headcount to fill it with, it can be avoided by deferring >> consuming the >> SKB after having obtained a sufficient headcount with below patch. >> >> Could you please try it? It is based on 4.13 and I also applied Jason's >> 'conditionally enable tx polling' patch. >>      https://lkml.org/lkml/2016/6/1/39 > > This patch has already been merged. > >> >> I only tested one instance case from Host -> VM with uperf & iperf3, I >> like >> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd >> during testing. :) >> >> To maximize the performance of one instance case, two vcpus are needed, >> one does the kernel napi and the other one should serve the socket >> syscall >> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the >> guest >> and pinned the iperf/uperf slave to the one not used by kernel napi, >> you may >> need to check out which one you should pin properly by seeing the CPU >> utilization with a quick trial test before running the long duration >> test. >> >> Slight performance improvement for tcp with the patch(host/guest >> offload off) >> on x86, also 4.12 wins the game with 20-30% possibility from time to >> time, but >> the cwnd and retransmitted statistics are almost the same now, the >> 'retrans' >> was about 10x times more and cwnd was 6x smaller than 4.12 before. >> >> Here is one typical sample of my tests. >>                  4.12          4.13 >> offload on:   36.8Gbits     37.4Gbits >> offload off:  7.68Gbits     7.84Gbits >> >> I also borrowed a s390x machine with 6 cpus and 4G memory from system >> z team, >> it seems 4.12 is still a bit faster than 4.13, could you please see if >> this >> is aligned with your test bed? >>                  4.12          4.13 >> offload on:   37.3Gbits     38.3Gbits >> offload off:  6.26Gbits     6.06Gbits >> >> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit >> faster >> than Jason's number before. >>                  4.12          4.13 >>                3.33 Mpss     3.70 Mpps >> >> Thanks again for all the tests your have done. >> >> Wei >> >> --- a/drivers/vhost/net.c >> +++ b/drivers/vhost/net.c >> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net) >>                  /* On error, stop handling until the next kick. */ >>                  if (unlikely(headcount < 0)) >>                          goto out; >> -               if (nvq->rx_array) >> -                       msg.msg_control = >> vhost_net_buf_consume(&nvq->rxq); >>                  /* On overrun, truncate and discard */ >>                  if (unlikely(headcount > UIO_MAXIOV)) { > > I think you need do msg.msg_control = vhost_net_buf_consume() here too. > >>                          iov_iter_init(&msg.msg_iter, READ, vq->iov, >> 1, 1); >> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net) >>                           * they refilled. */ >>                          goto out; >>                  } >> + >> +               if (nvq->rx_array) >> +                       msg.msg_control = >> vhost_net_buf_consume(&nvq->rxq); >> + >>                  /* We don't need to be notified again. */ >>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in, >> vhost_len); >>                  fixup = msg.msg_iter; >> >> > > Good catch, this fixes the memory leak too. > > I suggest to post a formal patch for -net as soon as possible too since > it was a valid fix even if it does not help for performance. >> Thanks > +1 to posting this patch formally. I also verified that it resolves the memory leak I was experiencing. In terms of performance numbers, here are quick #s using the original environment where the regression was noted (4GB, 4vcpu guests, no CPU binding, TCP VM<->VM): 4.12: 34.71Gb/s 4.13: 18.80Gb/s 4.13+: 38.26Gb/s I'll keep running numbers, but that looks very promising.