From: Wei Xu <wexu@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>,
mst@redhat.com, netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: Regression in throughput between kvm guests over virtual bridge
Date: Tue, 28 Nov 2017 11:51:04 +0800 [thread overview]
Message-ID: <20171128035104.a3dcfkwqu65uhzib@Wei-Dev> (raw)
In-Reply-To: <bcd4051d-5573-0841-a86b-8fccf03931c9@redhat.com>
On Tue, Nov 28, 2017 at 09:36:37AM +0800, Jason Wang wrote:
>
>
> On 2017年11月28日 00:21, Wei Xu wrote:
> > On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
> > > On 11/14/2017 03:11 PM, Matthew Rosato wrote:
> > > > On 11/12/2017 01:34 PM, Wei Xu wrote:
> > > > > On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
> > > > > > > > This case should be quite similar with pkgten, if you got improvement with
> > > > > > > > pktgen, usually it was also the same for UDP, could you please try to disable
> > > > > > > > tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? Currently
> > > > > > > > the most significant tests would be like this AFAICT:
> > > > > > > >
> > > > > > > > Host->VM 4.12 4.13
> > > > > > > > TCP:
> > > > > > > > UDP:
> > > > > > > > pktgen:
> > > So, I automated these scenarios for extended overnight runs and started
> > > experiencing OOM conditions overnight on a 40G system. I did a bisect
> > > and it also points to c67df11f. I can see a leak in at least all of the
> > > Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
> > > fastest leak.
> > >
> > > I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
> > > intervals until a large% of host memory was consumed. Numbers below
> > > after the last pktgen run completed. The summary is that a very large #
> > > of active skbuff_head_cache entries can be seen - The sum of alloc/free
> > > calls match up, but the # of active skbuff_head_cache entries keeps
> > > growing each time the workload is run and never goes back down in
> > > between runs.
> > >
> > > free -h:
> > > total used free shared buff/cache available
> > > Mem: 39G 31G 6.6G 472K 1.4G 6.8G
> > >
> > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> > >
> > > 1001952 1000610 99% 0.75K 23856 42 763392K skbuff_head_cache
> > > 126192 126153 99% 0.36K 2868 44 45888K ksm_rmap_item
> > > 100485 100435 99% 0.41K 1305 77 41760K kernfs_node_cache
> > > 63294 39598 62% 0.48K 959 66 30688K dentry
> > > 31968 31719 99% 0.88K 888 36 28416K inode_cache
> > >
> > > /sys/kernel/slab/skbuff_head_cache/alloc_calls :
> > > 259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776 cpus=0,2,4,18
> > > 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863 cpus=0,10
> > >
> > > /sys/kernel/slab/skbuff_head_cache/free_calls:
> > > 13492 <not-available> age=4295073614 pid=0 cpus=0
> > > 978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
> > > cpus=1-19
> > > 6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
> > > cpus=4,8,10,12,14
> > > 3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
> > > pid=0-11605 cpus=5,7,12
> > > 1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
> > > 2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325 cpus=4,12
> > > 1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
> > > 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
> > > 3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
> > > pid=9915-11581 cpus=8,16,18
> > > 2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
> > > pid=11605-11699 cpus=2,9
> > > 1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
> > > pid=331 cpus=11
> > > 8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen] age=8545/62184/110571
> > > pid=11863 cpus=0
> > >
> > >
> > > By comparison, when running 4.13 with c67df11f reverted, here's the same
> > > output after the exact same test:
> > >
> > > free -h:
> > > total used free shared buff/cache available
> > > Mem: 39G 783M 37G 472K 637M 37G
> > >
> > > slabtop:
> > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> > > 714 256 35% 0.75K 17 42 544K skbuff_head_cache
> > >
> > > /sys/kernel/slab/skbuff_head_cache/alloc_calls:
> > > 257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
> > > /sys/kernel/slab/skbuff_head_cache/free_calls:
> > > 255 <not-available> age=4295003081 pid=0 cpus=0
> > > 1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
> > > 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
> > >
> > Thanks a lot for the test, and sorry for the late update, I was working on
> > the code path and didn't find anything helpful to you till today.
> >
> > I did some tests and initially it turned out that the bottleneck was the guest
> > kernel stack(napi) side, followed by tracking the traffic footprints and it
> > appeared as the loss happened when vring was full and could not be drained
> > out by the guest, afterwards it triggered a SKB drop in vhost driver due
> > to no headcount to fill it with, it can be avoided by deferring consuming the
> > SKB after having obtained a sufficient headcount with below patch.
> >
> > Could you please try it? It is based on 4.13 and I also applied Jason's
> > 'conditionally enable tx polling' patch.
> > https://lkml.org/lkml/2016/6/1/39
>
> This patch has already been merged.
>
> >
> > I only tested one instance case from Host -> VM with uperf & iperf3, I like
> > iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
> > during testing. :)
> >
> > To maximize the performance of one instance case, two vcpus are needed,
> > one does the kernel napi and the other one should serve the socket syscall
> > (mostly reading) from uperf/iperf userspace, so I set two vcpus to the guest
> > and pinned the iperf/uperf slave to the one not used by kernel napi, you may
> > need to check out which one you should pin properly by seeing the CPU
> > utilization with a quick trial test before running the long duration test.
> >
> > Slight performance improvement for tcp with the patch(host/guest offload off)
> > on x86, also 4.12 wins the game with 20-30% possibility from time to time, but
> > the cwnd and retransmitted statistics are almost the same now, the 'retrans'
> > was about 10x times more and cwnd was 6x smaller than 4.12 before.
> >
> > Here is one typical sample of my tests.
> > 4.12 4.13
> > offload on: 36.8Gbits 37.4Gbits
> > offload off: 7.68Gbits 7.84Gbits
> >
> > I also borrowed a s390x machine with 6 cpus and 4G memory from system z team,
> > it seems 4.12 is still a bit faster than 4.13, could you please see if this
> > is aligned with your test bed?
> > 4.12 4.13
> > offload on: 37.3Gbits 38.3Gbits
> > offload off: 6.26Gbits 6.06Gbits
> >
> > For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit faster
> > than Jason's number before.
> > 4.12 4.13
> > 3.33 Mpss 3.70 Mpps
> >
> > Thanks again for all the tests your have done.
> >
> > Wei
> >
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
> > /* On error, stop handling until the next kick. */
> > if (unlikely(headcount < 0))
> > goto out;
> > - if (nvq->rx_array)
> > - msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> > /* On overrun, truncate and discard */
> > if (unlikely(headcount > UIO_MAXIOV)) {
>
> I think you need do msg.msg_control = vhost_net_buf_consume() here too.
>
> > iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> > @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
> > * they refilled. */
> > goto out;
> > }
> > +
> > + if (nvq->rx_array)
> > + msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> > +
> > /* We don't need to be notified again. */
> > iov_iter_init(&msg.msg_iter, READ, vq->iov, in, vhost_len);
> > fixup = msg.msg_iter;
> >
> >
>
> Good catch, this fixes the memory leak too.
>
> I suggest to post a formal patch for -net as soon as possible too since it
> was a valid fix even if it does not help for performance.
OK, will post it soon.
Wei
>
> Thanks
next prev parent reply other threads:[~2017-11-28 3:29 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-09-12 17:56 Regression in throughput between kvm guests over virtual bridge Matthew Rosato
2017-09-13 1:16 ` Jason Wang
2017-09-13 8:13 ` Jason Wang
2017-09-13 16:59 ` Matthew Rosato
2017-09-14 4:21 ` Jason Wang
2017-09-15 3:36 ` Matthew Rosato
2017-09-15 8:55 ` Jason Wang
2017-09-15 19:19 ` Matthew Rosato
2017-09-18 3:13 ` Jason Wang
2017-09-18 4:14 ` [PATCH] vhost_net: conditionally enable tx polling kbuild test robot
2017-09-18 7:36 ` Regression in throughput between kvm guests over virtual bridge Jason Wang
2017-09-18 18:11 ` Matthew Rosato
2017-09-20 6:27 ` Jason Wang
2017-09-20 19:38 ` Matthew Rosato
2017-09-22 4:03 ` Jason Wang
2017-09-25 20:18 ` Matthew Rosato
2017-10-05 20:07 ` Matthew Rosato
2017-10-11 2:41 ` Jason Wang
2017-10-12 18:31 ` Wei Xu
2017-10-18 20:17 ` Matthew Rosato
2017-10-23 2:06 ` Jason Wang
2017-10-23 2:13 ` Michael S. Tsirkin
2017-10-25 20:21 ` Matthew Rosato
2017-10-26 9:44 ` Wei Xu
2017-10-26 17:53 ` Matthew Rosato
2017-10-31 7:07 ` Wei Xu
2017-10-31 7:00 ` Jason Wang
2017-11-03 4:30 ` Matthew Rosato
2017-11-04 23:35 ` Wei Xu
2017-11-08 1:02 ` Matthew Rosato
2017-11-11 20:59 ` Matthew Rosato
2017-11-12 18:34 ` Wei Xu
2017-11-14 20:11 ` Matthew Rosato
2017-11-20 19:25 ` Matthew Rosato
2017-11-27 16:21 ` Wei Xu
2017-11-28 1:36 ` Jason Wang
2017-11-28 2:44 ` Matthew Rosato
2017-11-28 18:00 ` Wei Xu
2017-11-28 3:51 ` Wei Xu [this message]
2017-11-12 15:40 ` Wei Xu
2017-10-23 13:57 ` Wei Xu
2017-10-25 20:31 ` Matthew Rosato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171128035104.a3dcfkwqu65uhzib@Wei-Dev \
--to=wexu@redhat.com \
--cc=davem@davemloft.net \
--cc=jasowang@redhat.com \
--cc=mjrosato@linux.vnet.ibm.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox