Re: Regression in throughput between kvm guests over virtual bridge

Netdev List
 help / color / mirror / Atom feed

From: Wei Xu <wexu@redhat.com>
To: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>,
	mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: Regression in throughput between kvm guests over virtual bridge
Date: Wed, 29 Nov 2017 02:00:47 +0800	[thread overview]
Message-ID: <20171128180047.5tojixyh4zagened@Wei-Dev> (raw)
In-Reply-To: <edb28fe5-cedb-8e63-88b2-122d3dfe3014@linux.vnet.ibm.com>

On Mon, Nov 27, 2017 at 09:44:07PM -0500, Matthew Rosato wrote:
> On 11/27/2017 08:36 PM, Jason Wang wrote:
> > 
> > 
> > On 2017年11月28日 00:21, Wei Xu wrote:
> >> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
> >>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> >>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
> >>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> >>>>>>>> This case should be quite similar with pkgten, if you got
> >>>>>>>> improvement with
> >>>>>>>> pktgen, usually it was also the same for UDP, could you please
> >>>>>>>> try to disable
> >>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net
> >>>>>>>> devices? Currently
> >>>>>>>> the most significant tests would be like this AFAICT:
> >>>>>>>>
> >>>>>>>> Host->VM     4.12    4.13
> >>>>>>>>   TCP:
> >>>>>>>>   UDP:
> >>>>>>>> pktgen:
> >>> So, I automated these scenarios for extended overnight runs and started
> >>> experiencing OOM conditions overnight on a 40G system.  I did a bisect
> >>> and it also points to c67df11f.  I can see a leak in at least all of the
> >>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
> >>> fastest leak.
> >>>
> >>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
> >>> intervals until a large% of host memory was consumed.  Numbers below
> >>> after the last pktgen run completed. The summary is that a very large #
> >>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
> >>> calls match up, but the # of active skbuff_head_cache entries keeps
> >>> growing each time the workload is run and never goes back down in
> >>> between runs.
> >>>
> >>> free -h:
> >>>       total        used        free      shared  buff/cache   available
> >>> Mem:   39G         31G        6.6G        472K        1.4G        6.8G
> >>>
> >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>
> >>> 1001952 1000610  99%    0.75K  23856       42    763392K
> >>> skbuff_head_cache
> >>> 126192 126153  99%    0.36K   2868     44     45888K ksm_rmap_item
> >>> 100485 100435  99%    0.41K   1305     77     41760K kernfs_node_cache
> >>>   63294  39598  62%    0.48K    959     66     30688K dentry
> >>>   31968  31719  99%    0.88K    888     36     28416K inode_cache
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
> >>>      259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776
> >>> cpus=0,2,4,18
> >>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863
> >>> cpus=0,10
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/free_calls:
> >>>    13492 <not-available> age=4295073614 pid=0 cpus=0
> >>>   978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
> >>> cpus=1-19
> >>>        6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
> >>> cpus=4,8,10,12,14
> >>>        3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
> >>> pid=0-11605 cpus=5,7,12
> >>>        1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
> >>>        2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325
> >>> cpus=4,12
> >>>        1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
> >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
> >>>        3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
> >>> pid=9915-11581 cpus=8,16,18
> >>>        2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
> >>> pid=11605-11699 cpus=2,9
> >>>        1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
> >>> pid=331 cpus=11
> >>>     8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen]
> >>> age=8545/62184/110571
> >>> pid=11863 cpus=0
> >>>
> >>>
> >>> By comparison, when running 4.13 with c67df11f reverted, here's the same
> >>> output after the exact same test:
> >>>
> >>> free -h:
> >>>         total        used        free      shared  buff/cache  
> >>> available
> >>> Mem:     39G        783M         37G        472K        637M         37G
> >>>
> >>> slabtop:
> >>>    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>     714    256  35%    0.75K     17     42      544K skbuff_head_cache
> >>>
> >>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
> >>>      257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
> >>> /sys/kernel/slab/skbuff_head_cache/free_calls:
> >>>      255 <not-available> age=4295003081 pid=0 cpus=0
> >>>        1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
> >>>        1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
> >>>
> >> Thanks a lot for the test, and sorry for the late update, I was
> >> working on
> >> the code path and didn't find anything helpful to you till today.
> >>
> >> I did some tests and initially it turned out that the bottleneck was
> >> the guest
> >> kernel stack(napi) side, followed by tracking the traffic footprints
> >> and it
> >> appeared as the loss happened when vring was full and could not be
> >> drained
> >> out by the guest, afterwards it triggered a SKB drop in vhost driver due
> >> to no headcount to fill it with, it can be avoided by deferring
> >> consuming the
> >> SKB after having obtained a sufficient headcount with below patch.
> >>
> >> Could you please try it? It is based on 4.13 and I also applied Jason's
> >> 'conditionally enable tx polling' patch.
> >>      https://lkml.org/lkml/2016/6/1/39
> > 
> > This patch has already been merged.
> > 
> >>
> >> I only tested one instance case from Host -> VM with uperf & iperf3, I
> >> like
> >> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> >> during testing. :)
> >>
> >> To maximize the performance of one instance case, two vcpus are needed,
> >> one does the kernel napi and the other one should serve the socket
> >> syscall
> >> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the
> >> guest
> >> and pinned the iperf/uperf slave to the one not used by kernel napi,
> >> you may
> >> need to check out which one you should pin properly by seeing the CPU
> >> utilization with a quick trial test before running the long duration
> >> test.
> >>
> >> Slight performance improvement for tcp with the patch(host/guest
> >> offload off)
> >> on x86, also 4.12 wins the game with 20-30% possibility from time to
> >> time, but
> >> the cwnd and retransmitted statistics are almost the same now, the
> >> 'retrans'
> >> was about 10x times more and cwnd was 6x smaller than 4.12 before.
> >>
> >> Here is one typical sample of my tests.
> >>                  4.12          4.13
> >> offload on:   36.8Gbits     37.4Gbits
> >> offload off:  7.68Gbits     7.84Gbits
> >>
> >> I also borrowed a s390x machine with 6 cpus and 4G memory from system
> >> z team,
> >> it seems 4.12 is still a bit faster than 4.13, could you please see if
> >> this
> >> is aligned with your test bed?
> >>                  4.12          4.13
> >> offload on:   37.3Gbits     38.3Gbits
> >> offload off:  6.26Gbits     6.06Gbits
> >>
> >> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit
> >> faster
> >> than Jason's number before.
> >>                  4.12          4.13
> >>                3.33 Mpss     3.70 Mpps
> >>
> >> Thanks again for all the tests your have done.
> >>
> >> Wei
> >>
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
> >>                  /* On error, stop handling until the next kick. */
> >>                  if (unlikely(headcount < 0))
> >>                          goto out;
> >> -               if (nvq->rx_array)
> >> -                       msg.msg_control =
> >> vhost_net_buf_consume(&nvq->rxq);
> >>                  /* On overrun, truncate and discard */
> >>                  if (unlikely(headcount > UIO_MAXIOV)) {
> > 
> > I think you need do msg.msg_control = vhost_net_buf_consume() here too.
> > 
> >>                          iov_iter_init(&msg.msg_iter, READ, vq->iov,
> >> 1, 1);
> >> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
> >>                           * they refilled. */
> >>                          goto out;
> >>                  }
> >> +
> >> +               if (nvq->rx_array)
> >> +                       msg.msg_control =
> >> vhost_net_buf_consume(&nvq->rxq);
> >> +
> >>                  /* We don't need to be notified again. */
> >>                  iov_iter_init(&msg.msg_iter, READ, vq->iov, in,
> >> vhost_len);
> >>                  fixup = msg.msg_iter;
> >>
> >>
> > 
> > Good catch, this fixes the memory leak too.
> > 
> > I suggest to post a formal patch for -net as soon as possible too since
> > it was a valid fix even if it does not help for performance.
> >> Thanks
> > 
> 
> +1 to posting this patch formally.  I also verified that it resolves the
> memory leak I was experiencing.
> 
> In terms of performance numbers, here are quick #s using the original
> environment where the regression was noted (4GB, 4vcpu guests, no CPU
> binding, TCP VM<->VM):
> 
> 4.12:	34.71Gb/s
> 4.13:	18.80Gb/s
> 4.13+:	38.26Gb/s
> 

Great to know the number, patch sent, thanks you so much for all your
profound tests, it really helped a lot to figure it out.

Wei

> I'll keep running numbers, but that looks very promising.
>

next prev parent reply	other threads:[~2017-11-28 17:39 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-12 17:56 Regression in throughput between kvm guests over virtual bridge Matthew Rosato
2017-09-13  1:16 ` Jason Wang
2017-09-13  8:13   ` Jason Wang
2017-09-13 16:59     ` Matthew Rosato
2017-09-14  4:21       ` Jason Wang
2017-09-15  3:36         ` Matthew Rosato
2017-09-15  8:55           ` Jason Wang
2017-09-15 19:19             ` Matthew Rosato
2017-09-18  3:13               ` Jason Wang
2017-09-18  4:14                 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
2017-09-18  7:36                 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
2017-09-18 18:11                   ` Matthew Rosato
2017-09-20  6:27                     ` Jason Wang
2017-09-20 19:38                       ` Matthew Rosato
2017-09-22  4:03                         ` Jason Wang
2017-09-25 20:18                           ` Matthew Rosato
2017-10-05 20:07                             ` Matthew Rosato
2017-10-11  2:41                               ` Jason Wang
2017-10-12 18:31                               ` Wei Xu
2017-10-18 20:17                                 ` Matthew Rosato
2017-10-23  2:06                                   ` Jason Wang
2017-10-23  2:13                                     ` Michael S. Tsirkin
2017-10-25 20:21                                     ` Matthew Rosato
2017-10-26  9:44                                       ` Wei Xu
2017-10-26 17:53                                         ` Matthew Rosato
2017-10-31  7:07                                           ` Wei Xu
2017-10-31  7:00                                             ` Jason Wang
2017-11-03  4:30                                             ` Matthew Rosato
2017-11-04 23:35                                               ` Wei Xu
2017-11-08  1:02                                                 ` Matthew Rosato
2017-11-11 20:59                                                   ` Matthew Rosato
2017-11-12 18:34                                                     ` Wei Xu
2017-11-14 20:11                                                       ` Matthew Rosato
2017-11-20 19:25                                                         ` Matthew Rosato
2017-11-27 16:21                                                           ` Wei Xu
2017-11-28  1:36                                                             ` Jason Wang
2017-11-28  2:44                                                               ` Matthew Rosato
2017-11-28 18:00                                                                 ` Wei Xu [this message]
2017-11-28  3:51                                                               ` Wei Xu
2017-11-12 15:40                                                   ` Wei Xu
2017-10-23 13:57                                   ` Wei Xu
2017-10-25 20:31                                     ` Matthew Rosato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171128180047.5tojixyh4zagened@Wei-Dev \
    --to=wexu@redhat.com \
    --cc=davem@davemloft.net \
    --cc=jasowang@redhat.com \
    --cc=mjrosato@linux.vnet.ibm.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox