From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jason Wang <jasowang@redhat.com>
Subject: Re: Regression in throughput between kvm guests over virtual bridge
Date: Tue, 28 Nov 2017 09:36:37 +0800
Message-ID: <bcd4051d-5573-0841-a86b-8fccf03931c9@redhat.com>
References: <20171026094415.uyogf2iw7yoavnoc@Wei-Dev>
 <da80025f-6942-615f-570e-5005a25eb147@linux.vnet.ibm.com>
 <20171031070717.wcbgrp6thrjmtrh3@Wei-Dev>
 <56710dc8-f289-0211-db97-1a1ea29e38f7@linux.vnet.ibm.com>
 <20171104233519.7jwja7t2itooyeak@Wei-Dev>
 <1611b26f-0997-3b22-95f5-debf57b7be8c@linux.vnet.ibm.com>
 <101d1fdf-9df1-44bd-73a7-e7d8fbc09160@linux.vnet.ibm.com>
 <20171112183406.zuuj7w3fmtb4eduf@Wei-Dev>
 <9996b0f1-ffa6-ff95-2e9c-0deccf4623ae@linux.vnet.ibm.com>
 <edfeb137-9047-53e1-72d7-26efed923682@linux.vnet.ibm.com>
 <20171127162109.eriexz7gpvz6vxnx@Wei-Dev>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Cc: mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net
To: Wei Xu <wexu@redhat.com>,
        Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:57364 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752610AbdK1Bgp (ORCPT <rfc822;netdev@vger.kernel.org>);
        Mon, 27 Nov 2017 20:36:45 -0500
In-Reply-To: <20171127162109.eriexz7gpvz6vxnx@Wei-Dev>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>



On 2017年11月28日 00:21, Wei Xu wrote:
> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>>>> This case should be quite similar with pkgten, if you got improvement with
>>>>>>> pktgen, usually it was also the same for UDP, could you please try to disable
>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
>>>>>>> the most significant tests would be like this AFAICT:
>>>>>>>
>>>>>>> Host->VM     4.12    4.13
>>>>>>>   TCP:
>>>>>>>   UDP:
>>>>>>> pktgen:
>> So, I automated these scenarios for extended overnight runs and started
>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
>> and it also points to c67df11f.  I can see a leak in at least all of the
>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
>> fastest leak.
>>
>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
>> intervals until a large% of host memory was consumed.  Numbers below
>> after the last pktgen run completed. The summary is that a very large #
>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
>> calls match up, but the # of active skbuff_head_cache entries keeps
>> growing each time the workload is run and never goes back down in
>> between runs.
>>
>> free -h:
>>       total        used        free      shared  buff/cache   available
>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
>>
>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>
>> 1001952 1000610  99%    0.75K  23856	   42    763392K skbuff_head_cache
>> 126192 126153  99%    0.36K   2868	 44     45888K ksm_rmap_item
>> 100485 100435  99%    0.41K   1305	 77     41760K kernfs_node_cache
>>   63294  39598  62%    0.48K    959	 66     30688K dentry
>>   31968  31719  99%    0.88K    888	 36     28416K inode_cache
>>
>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10
>>
>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>    13492 <not-available> age=4295073614 pid=0 cpus=0
>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
>> cpus=1-19
>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
>> cpus=4,8,10,12,14
>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
>> pid=0-11605 cpus=5,7,12
>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
>> pid=9915-11581 cpus=8,16,18
>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
>> pid=11605-11699 cpus=2,9
>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
>> pid=331 cpus=11
>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
>> pid=11863 cpus=0
>>
>>
>> By comparison, when running 4.13 with c67df11f reverted, here's the same
>> output after the exact same test:
>>
>> free -h:
>>         total        used        free      shared  buff/cache   available
>> Mem:     39G        783M         37G        472K        637M         37G
>>
>> slabtop:
>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>     714    256  35%    0.75K     17	 42	  544K skbuff_head_cache
>>
>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>      255 <not-available> age=4295003081 pid=0 cpus=0
>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
>>
> Thanks a lot for the test, and sorry for the late update, I was working on
> the code path and didn't find anything helpful to you till today.
>
> I did some tests and initially it turned out that the bottleneck was the guest
> kernel stack(napi) side, followed by tracking the traffic footprints and it
> appeared as the loss happened when vring was full and could not be drained
> out by the guest, afterwards it triggered a SKB drop in vhost driver due
> to no headcount to fill it with, it can be avoided by deferring consuming the
> SKB after having obtained a sufficient headcount with below patch.
>
> Could you please try it? It is based on 4.13 and I also applied Jason's
> 'conditionally enable tx polling' patch.
>      https://lkml.org/lkml/2016/6/1/39

This patch has already been merged.

>
> I only tested one instance case from Host -> VM with uperf & iperf3, I like
> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> during testing. :)
>
> To maximize the performance of one instance case, two vcpus are needed,
> one does the kernel napi and the other one should serve the socket syscall
> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest
> and pinned the iperf/uperf slave to the one not used by kernel napi, you may
> need to check out which one you should pin properly by seeing the CPU
> utilization with a quick trial test before running the long duration test.
>
> Slight performance improvement for tcp with the patch(host/guest offload off)
> on x86, also 4.12 wins the game with 20-30% possibility from time to time, but
> the cwnd and retransmitted statistics are almost the same now, the 'retrans'
> was about 10x times more and cwnd was 6x smaller than 4.12 before.
>
> Here is one typical sample of my tests.
>                  4.12          4.13
> offload on:   36.8Gbits     37.4Gbits
> offload off:  7.68Gbits     7.84Gbits
>
> I also borrowed a s390x machine with 6 cpus and 4G memory from system z team,
> it seems 4.12 is still a bit faster than 4.13, could you please see if this
> is aligned with your test bed?
>                  4.12          4.13
> offload on:   37.3Gbits     38.3Gbits
> offload off:  6.26Gbits     6.06Gbits
>
> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster
> than Jason's number before.
>                  4.12          4.13
>                3.33 Mpss     3.70 Mpps
>
> Thanks again for all the tests your have done.
>
> Wei
>
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
>                  /* On error, stop handling until the next kick. */
>                  if (unlikely(headcount < 0))
>                          goto out;
> -               if (nvq->rx_array)
> -                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
>                  /* On overrun, truncate and discard */
>                  if (unlikely(headcount > UIO_MAXIOV)) {

I think you need do msg.msg_control = vhost_net_buf_consume() here too.

>                          iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
>                           * they refilled. */
>                          goto out;
>                  }
> +
> +               if (nvq->rx_array)
> +                       msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> +
>                  /* We don't need to be notified again. */
>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
>                  fixup = msg.msg_iter;
>
>

Good catch, this fixes the memory leak too.

I suggest to post a formal patch for -net as soon as possible too since 
it was a valid fix even if it does not help for performance.

Thanks