Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: "Paweł Staszewski" <pstaszewski@itcare.pl>
Cc: brouer@redhat.com,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Tariq Toukan <tariqt@mellanox.com>
Subject: Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Date: Tue, 15 Aug 2017 11:23:07 +0200	[thread overview]
Message-ID: <20170815112307.2dd366fe@redhat.com> (raw)
In-Reply-To: <1ff1b747-758e-afdd-9376-80ff3bd8a6d5@itcare.pl>

On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski <pstaszewski@itcare.pl> wrote:

> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
> > On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski <pstaszewski@itcare.pl> wrote:
> >  
> >> To show some difference below comparision vlan/no-vlan traffic
> >>
> >> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan  
> > I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
> > performance reduction of about 10-19% when I forward out a VLAN
> > interface.  This is larger than I expected, but still lower than what
> > you reported 30-40% slowdown.
> >
> > [...]  
> Ok mellanox afrrived (MT27700 - mlnx5 driver)
> And to compare melannox with vlans and without: 33% performance 
> degradation (less than with ixgbe where i reach ~40% with same settings)
> 
> Mellanox without TX traffix on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;11089305;709715520;8871553;567779392
> 1;16;64;11096292;710162688;11095566;710116224
> 2;16;64;11095770;710129280;11096799;710195136
> 3;16;64;11097199;710220736;11097702;710252928
> 4;16;64;11080984;567081856;11079662;709098368
> 5;16;64;11077696;708972544;11077039;708930496
> 6;16;64;11082991;709311424;8864802;567347328
> 7;16;64;11089596;709734144;8870927;709789184
> 8;16;64;11094043;710018752;11095391;710105024
> 
> Mellanox with TX traffic on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;7369914;471674496;7370281;471697980
> 1;16;64;7368896;471609408;7368043;471554752
> 2;16;64;7367577;471524864;7367759;471536576
> 3;16;64;7368744;377305344;7369391;471641024
> 4;16;64;7366824;471476736;7364330;471237120
> 5;16;64;7368352;471574528;7367239;471503296
> 6;16;64;7367459;471517376;7367806;471539584
> 7;16;64;7367190;471500160;7367988;471551232
> 8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)


> ethtool settings for both tests:
> ifc='enp175s0f0 enp175s0f1'
> for i in $ifc
>          do
>          ip link set up dev $i
>          ethtool -A $i autoneg off rx off tx off
>          ethtool -G $i rx 128 tx 256

The ring queue size recommendations, might be different for the mlx5
driver (Cc'ing Mellanox maintainers).  


>          ip link set $i txqueuelen 1000
>          ethtool -C $i rx-usecs 25
>          ethtool -L $i combined 16
>          ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off 
> tx-nocache-copy off ntuple on
>          ethtool -N $i rx-flow-hash udp4 sdfn
>          done

Thanks for being explicit about what you setup is :-)
 
> and perf top:
>     PerfTop:   83650 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
> cycles],  (all, 56 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>      14.25%  [kernel]       [k] dst_release
>      14.17%  [kernel]       [k] skb_dst_force
>      13.41%  [kernel]       [k] rt_cache_valid
>      11.47%  [kernel]       [k] ip_finish_output2
>       7.01%  [kernel]       [k] do_raw_spin_lock
>       5.07%  [kernel]       [k] page_frag_free
>       3.47%  [mlx5_core]    [k] mlx5e_xmit
>       2.88%  [kernel]       [k] fib_table_lookup
>       2.43%  [mlx5_core]    [k] skb_from_cqe.isra.32
>       1.97%  [kernel]       [k] virt_to_head_page
>       1.81%  [mlx5_core]    [k] mlx5e_poll_tx_cq
>       0.93%  [kernel]       [k] __dev_queue_xmit
>       0.87%  [kernel]       [k] __build_skb
>       0.84%  [kernel]       [k] ipt_do_table
>       0.79%  [kernel]       [k] ip_rcv
>       0.79%  [kernel]       [k] acpi_processor_ffh_cstate_enter
>       0.78%  [kernel]       [k] netif_skb_features
>       0.73%  [kernel]       [k] __netif_receive_skb_core
>       0.52%  [kernel]       [k] dev_hard_start_xmit
>       0.52%  [kernel]       [k] build_skb
>       0.51%  [kernel]       [k] ip_route_input_rcu
>       0.50%  [kernel]       [k] skb_unref
>       0.49%  [kernel]       [k] ip_forward
>       0.48%  [mlx5_core]    [k] mlx5_cqwq_get_cqe
>       0.44%  [kernel]       [k] udp_v4_early_demux
>       0.41%  [kernel]       [k] napi_consume_skb
>       0.40%  [kernel]       [k] __local_bh_enable_ip
>       0.39%  [kernel]       [k] ip_rcv_finish
>       0.39%  [kernel]       [k] kmem_cache_alloc
>       0.38%  [kernel]       [k] sch_direct_xmit
>       0.33%  [kernel]       [k] validate_xmit_skb
>       0.32%  [mlx5_core]    [k] mlx5e_free_rx_wqe_reuse
>       0.29%  [kernel]       [k] netdev_pick_tx
>       0.28%  [mlx5_core]    [k] mlx5e_build_rx_skb
>       0.27%  [kernel]       [k] deliver_ptype_list_skb
>       0.26%  [kernel]       [k] fib_validate_source
>       0.26%  [mlx5_core]    [k] mlx5e_napi_poll
>       0.26%  [mlx5_core]    [k] mlx5e_handle_rx_cqe
>       0.26%  [mlx5_core]    [k] mlx5e_rx_cache_get
>       0.25%  [kernel]       [k] eth_header
>       0.23%  [kernel]       [k] skb_network_protocol
>       0.20%  [kernel]       [k] nf_hook_slow
>       0.20%  [kernel]       [k] vlan_passthru_hard_header
>       0.20%  [kernel]       [k] vlan_dev_hard_start_xmit
>       0.19%  [kernel]       [k] swiotlb_map_page
>       0.18%  [kernel]       [k] compound_head
>       0.18%  [kernel]       [k] neigh_connected_output
>       0.18%  [mlx5_core]    [k] mlx5e_alloc_rx_wqe
>       0.18%  [kernel]       [k] ip_output
>       0.17%  [kernel]       [k] prefetch_freepointer.isra.70
>       0.17%  [kernel]       [k] __slab_free
>       0.16%  [kernel]       [k] eth_type_vlan
>       0.16%  [kernel]       [k] ip_finish_output
>       0.15%  [kernel]       [k] kmem_cache_free_bulk
>       0.14%  [kernel]       [k] netif_receive_skb_internal
> 
> 
> 
> 
> wondering why this:
>       1.97%  [kernel]       [k] virt_to_head_page
> is in top...

This is related to the page_frag_free() call, but it is weird that it
shows up because it is suppose to be inlined (it is explicitly marked
inline in include/linux/mm.h).


> >>>>> perf top:
> >>>>>
> >>>>>     PerfTop:   77835 irqs/sec  kernel:99.7%
> >>>>> ---------------------------------------------
> >>>>>
> >>>>>        16.32%  [kernel]       [k] skb_dst_force
> >>>>>        16.30%  [kernel]       [k] dst_release
> >>>>>        15.11%  [kernel]       [k] rt_cache_valid
> >>>>>        12.62%  [kernel]       [k] ipv4_mtu  
> >>>> It seems a little strange that these 4 functions are on the top  
> > I don't see these in my test.
> >  
> >>>>     
> >>>>>         5.60%  [kernel]       [k] do_raw_spin_lock  
> >>>> Why is calling/taking this lock? (Use perf call-graph recording).  
> >>> can be hard to paste it here:)
> >>> attached file  
> > The attached was very big. Please don't attach so big file on mailing
> > lists.  Next time plase share them via e.g. pastebin. The output was a
> > capture from your terminal, which made the output more difficult to
> > read.  Hint: You can/could use perf --stdio and place it in a file
> > instead.
> >
> > The output (extracted below) didn't show who called 'do_raw_spin_lock',
> > BUT it showed another interesting thing.  The kernel code
> > __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> > as it will first call skb_dst_force() and then skb_dst_drop() when the
> > packet is transmitted on a VLAN.
> >
> >   static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >   {
> >   [...]
> > 	/* If device/qdisc don't need skb->dst, release it right now while
> > 	 * its hot in this cpu cache.
> > 	 */
> > 	if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> > 		skb_dst_drop(skb);
> > 	else
> > 		skb_dst_force(skb);
> >
> >
> >
> > Extracted part of attached perf output:
> >
> >   --5.37%--ip_rcv_finish
> >     |
> >     |--4.02%--ip_forward
> >     |   |
> >     |    --3.92%--ip_forward_finish
> >     |       |
> >     |        --3.91%--ip_output
> >     |          |
> >     |           --3.90%--ip_finish_output
> >     |              |
> >     |               --3.88%--ip_finish_output2
> >     |                  |
> >     |                   --2.77%--neigh_connected_output
> >     |                     |
> >     |                      --2.74%--dev_queue_xmit
> >     |                         |
> >     |                          --2.73%--__dev_queue_xmit
> >     |                             |
> >     |                             |--1.66%--dev_hard_start_xmit
> >     |                             |   |
> >     |                             |    --1.64%--vlan_dev_hard_start_xmit
> >     |                             |       |
> >     |                             |        --1.63%--dev_queue_xmit
> >     |                             |           |
> >     |                             |            --1.62%--__dev_queue_xmit
> >     |                             |               |
> >     |                             |               |--0.99%--skb_dst_drop.isra.77
> >     |                             |               |   |
> >     |                             |               |   --0.99%--dst_release
> >     |                             |               |
> >     |                             |                --0.55%--sch_direct_xmit
> >     |                             |
> >     |                              --0.99%--skb_dst_force
> >     |
> >      --1.29%--ip_route_input_noref
> >          |
> >           --1.29%--ip_route_input_rcu
> >               |
> >                --1.05%--rt_cache_valid
> >  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2017-08-15  9:23 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11 17:51 Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on Paweł Staszewski
2017-08-12 12:23 ` Jesper Dangaard Brouer
2017-08-12 17:27   ` Paweł Staszewski
2017-08-13 16:58     ` Paweł Staszewski
2017-08-14 16:19       ` Jesper Dangaard Brouer
2017-08-14 16:33         ` Eric Dumazet
2017-08-14 16:57         ` Paolo Abeni
2017-08-15  0:45           ` Paweł Staszewski
2017-08-15  1:07             ` Eric Dumazet
2017-08-15  1:17               ` Eric Dumazet
2017-08-15  9:11                 ` Paweł Staszewski
2017-08-15  9:19                   ` Paweł Staszewski
2017-08-15 10:05                   ` Jesper Dangaard Brouer
2017-09-21 21:26                   ` Paweł Staszewski
2017-09-21 21:34                     ` Eric Dumazet
2017-09-21 21:34                       ` Paweł Staszewski
2017-09-21 21:41                     ` Florian Fainelli
2017-09-21 21:43                       ` Paweł Staszewski
2017-09-21 21:54                       ` Eric Dumazet
2017-09-21 22:07                         ` Florian Fainelli
2017-09-22  0:37                           ` Eric Dumazet
2017-10-18 21:49                       ` Paweł Staszewski
2017-10-18 21:54                         ` Eric Dumazet
2017-10-18 22:45                           ` Paweł Staszewski
2017-09-09  9:03                 ` Paweł Staszewski
2017-09-11 16:57                   ` Paweł Staszewski
2017-09-11 22:11                     ` Paweł Staszewski
2017-08-15  9:35           ` Jesper Dangaard Brouer
2017-08-15  0:38         ` Paweł Staszewski
2017-08-15  9:23           ` Jesper Dangaard Brouer [this message]
2017-08-15  9:30             ` Paweł Staszewski
2017-08-15  9:57               ` Jesper Dangaard Brouer
2017-08-15 10:02                 ` Paweł Staszewski
2017-08-15 10:05                   ` Paweł Staszewski
2017-08-15 10:28                     ` Jesper Dangaard Brouer
2017-08-14  0:07     ` Alexander Duyck
2017-08-14 15:07       ` Paweł Staszewski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170815112307.2dd366fe@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pstaszewski@itcare.pl \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).