All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: "Paweł Staszewski" <pstaszewski@itcare.pl>
Cc: brouer@redhat.com,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Saeed Mahameed <saeedm@mellanox.com>,
	Tariq Toukan <tariqt@mellanox.com>
Subject: Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on
Date: Tue, 15 Aug 2017 11:23:07 +0200	[thread overview]
Message-ID: <20170815112307.2dd366fe@redhat.com> (raw)
In-Reply-To: <1ff1b747-758e-afdd-9376-80ff3bd8a6d5@itcare.pl>

On Tue, 15 Aug 2017 02:38:56 +0200
Paweł Staszewski <pstaszewski@itcare.pl> wrote:

> W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:
> > On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski <pstaszewski@itcare.pl> wrote:
> >  
> >> To show some difference below comparision vlan/no-vlan traffic
> >>
> >> 10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan  
> > I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
> > performance reduction of about 10-19% when I forward out a VLAN
> > interface.  This is larger than I expected, but still lower than what
> > you reported 30-40% slowdown.
> >
> > [...]  
> Ok mellanox afrrived (MT27700 - mlnx5 driver)
> And to compare melannox with vlans and without: 33% performance 
> degradation (less than with ixgbe where i reach ~40% with same settings)
> 
> Mellanox without TX traffix on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;11089305;709715520;8871553;567779392
> 1;16;64;11096292;710162688;11095566;710116224
> 2;16;64;11095770;710129280;11096799;710195136
> 3;16;64;11097199;710220736;11097702;710252928
> 4;16;64;11080984;567081856;11079662;709098368
> 5;16;64;11077696;708972544;11077039;708930496
> 6;16;64;11082991;709311424;8864802;567347328
> 7;16;64;11089596;709734144;8870927;709789184
> 8;16;64;11094043;710018752;11095391;710105024
> 
> Mellanox with TX traffic on vlan:
> ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
> 0;16;64;7369914;471674496;7370281;471697980
> 1;16;64;7368896;471609408;7368043;471554752
> 2;16;64;7367577;471524864;7367759;471536576
> 3;16;64;7368744;377305344;7369391;471641024
> 4;16;64;7366824;471476736;7364330;471237120
> 5;16;64;7368352;471574528;7367239;471503296
> 6;16;64;7367459;471517376;7367806;471539584
> 7;16;64;7367190;471500160;7367988;471551232
> 8;16;64;7368023;471553472;7368076;471556864

I wonder if the drivers page recycler is active/working or not, and if
the situation is different between VLAN vs no-vlan (given
page_frag_free is so high in you perf top).  The Mellanox drivers
fortunately have a stats counter to tell us this explicitly (which the
ixgbe driver doesn't).

You can use my ethtool_stats.pl script watch these stats:
 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
(Hint perl dependency:  dnf install perl-Time-HiRes)


> ethtool settings for both tests:
> ifc='enp175s0f0 enp175s0f1'
> for i in $ifc
>          do
>          ip link set up dev $i
>          ethtool -A $i autoneg off rx off tx off
>          ethtool -G $i rx 128 tx 256

The ring queue size recommendations, might be different for the mlx5
driver (Cc'ing Mellanox maintainers).  


>          ip link set $i txqueuelen 1000
>          ethtool -C $i rx-usecs 25
>          ethtool -L $i combined 16
>          ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off 
> tx-nocache-copy off ntuple on
>          ethtool -N $i rx-flow-hash udp4 sdfn
>          done

Thanks for being explicit about what you setup is :-)
 
> and perf top:
>     PerfTop:   83650 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
> cycles],  (all, 56 CPUs)
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>      14.25%  [kernel]       [k] dst_release
>      14.17%  [kernel]       [k] skb_dst_force
>      13.41%  [kernel]       [k] rt_cache_valid
>      11.47%  [kernel]       [k] ip_finish_output2
>       7.01%  [kernel]       [k] do_raw_spin_lock
>       5.07%  [kernel]       [k] page_frag_free
>       3.47%  [mlx5_core]    [k] mlx5e_xmit
>       2.88%  [kernel]       [k] fib_table_lookup
>       2.43%  [mlx5_core]    [k] skb_from_cqe.isra.32
>       1.97%  [kernel]       [k] virt_to_head_page
>       1.81%  [mlx5_core]    [k] mlx5e_poll_tx_cq
>       0.93%  [kernel]       [k] __dev_queue_xmit
>       0.87%  [kernel]       [k] __build_skb
>       0.84%  [kernel]       [k] ipt_do_table
>       0.79%  [kernel]       [k] ip_rcv
>       0.79%  [kernel]       [k] acpi_processor_ffh_cstate_enter
>       0.78%  [kernel]       [k] netif_skb_features
>       0.73%  [kernel]       [k] __netif_receive_skb_core
>       0.52%  [kernel]       [k] dev_hard_start_xmit
>       0.52%  [kernel]       [k] build_skb
>       0.51%  [kernel]       [k] ip_route_input_rcu
>       0.50%  [kernel]       [k] skb_unref
>       0.49%  [kernel]       [k] ip_forward
>       0.48%  [mlx5_core]    [k] mlx5_cqwq_get_cqe
>       0.44%  [kernel]       [k] udp_v4_early_demux
>       0.41%  [kernel]       [k] napi_consume_skb
>       0.40%  [kernel]       [k] __local_bh_enable_ip
>       0.39%  [kernel]       [k] ip_rcv_finish
>       0.39%  [kernel]       [k] kmem_cache_alloc
>       0.38%  [kernel]       [k] sch_direct_xmit
>       0.33%  [kernel]       [k] validate_xmit_skb
>       0.32%  [mlx5_core]    [k] mlx5e_free_rx_wqe_reuse
>       0.29%  [kernel]       [k] netdev_pick_tx
>       0.28%  [mlx5_core]    [k] mlx5e_build_rx_skb
>       0.27%  [kernel]       [k] deliver_ptype_list_skb
>       0.26%  [kernel]       [k] fib_validate_source
>       0.26%  [mlx5_core]    [k] mlx5e_napi_poll
>       0.26%  [mlx5_core]    [k] mlx5e_handle_rx_cqe
>       0.26%  [mlx5_core]    [k] mlx5e_rx_cache_get
>       0.25%  [kernel]       [k] eth_header
>       0.23%  [kernel]       [k] skb_network_protocol
>       0.20%  [kernel]       [k] nf_hook_slow
>       0.20%  [kernel]       [k] vlan_passthru_hard_header
>       0.20%  [kernel]       [k] vlan_dev_hard_start_xmit
>       0.19%  [kernel]       [k] swiotlb_map_page
>       0.18%  [kernel]       [k] compound_head
>       0.18%  [kernel]       [k] neigh_connected_output
>       0.18%  [mlx5_core]    [k] mlx5e_alloc_rx_wqe
>       0.18%  [kernel]       [k] ip_output
>       0.17%  [kernel]       [k] prefetch_freepointer.isra.70
>       0.17%  [kernel]       [k] __slab_free
>       0.16%  [kernel]       [k] eth_type_vlan
>       0.16%  [kernel]       [k] ip_finish_output
>       0.15%  [kernel]       [k] kmem_cache_free_bulk
>       0.14%  [kernel]       [k] netif_receive_skb_internal
> 
> 
> 
> 
> wondering why this:
>       1.97%  [kernel]       [k] virt_to_head_page
> is in top...

This is related to the page_frag_free() call, but it is weird that it
shows up because it is suppose to be inlined (it is explicitly marked
inline in include/linux/mm.h).


> >>>>> perf top:
> >>>>>
> >>>>>     PerfTop:   77835 irqs/sec  kernel:99.7%
> >>>>> ---------------------------------------------
> >>>>>
> >>>>>        16.32%  [kernel]       [k] skb_dst_force
> >>>>>        16.30%  [kernel]       [k] dst_release
> >>>>>        15.11%  [kernel]       [k] rt_cache_valid
> >>>>>        12.62%  [kernel]       [k] ipv4_mtu  
> >>>> It seems a little strange that these 4 functions are on the top  
> > I don't see these in my test.
> >  
> >>>>     
> >>>>>         5.60%  [kernel]       [k] do_raw_spin_lock  
> >>>> Why is calling/taking this lock? (Use perf call-graph recording).  
> >>> can be hard to paste it here:)
> >>> attached file  
> > The attached was very big. Please don't attach so big file on mailing
> > lists.  Next time plase share them via e.g. pastebin. The output was a
> > capture from your terminal, which made the output more difficult to
> > read.  Hint: You can/could use perf --stdio and place it in a file
> > instead.
> >
> > The output (extracted below) didn't show who called 'do_raw_spin_lock',
> > BUT it showed another interesting thing.  The kernel code
> > __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> > as it will first call skb_dst_force() and then skb_dst_drop() when the
> > packet is transmitted on a VLAN.
> >
> >   static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >   {
> >   [...]
> > 	/* If device/qdisc don't need skb->dst, release it right now while
> > 	 * its hot in this cpu cache.
> > 	 */
> > 	if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> > 		skb_dst_drop(skb);
> > 	else
> > 		skb_dst_force(skb);
> >
> >
> >
> > Extracted part of attached perf output:
> >
> >   --5.37%--ip_rcv_finish
> >     |
> >     |--4.02%--ip_forward
> >     |   |
> >     |    --3.92%--ip_forward_finish
> >     |       |
> >     |        --3.91%--ip_output
> >     |          |
> >     |           --3.90%--ip_finish_output
> >     |              |
> >     |               --3.88%--ip_finish_output2
> >     |                  |
> >     |                   --2.77%--neigh_connected_output
> >     |                     |
> >     |                      --2.74%--dev_queue_xmit
> >     |                         |
> >     |                          --2.73%--__dev_queue_xmit
> >     |                             |
> >     |                             |--1.66%--dev_hard_start_xmit
> >     |                             |   |
> >     |                             |    --1.64%--vlan_dev_hard_start_xmit
> >     |                             |       |
> >     |                             |        --1.63%--dev_queue_xmit
> >     |                             |           |
> >     |                             |            --1.62%--__dev_queue_xmit
> >     |                             |               |
> >     |                             |               |--0.99%--skb_dst_drop.isra.77
> >     |                             |               |   |
> >     |                             |               |   --0.99%--dst_release
> >     |                             |               |
> >     |                             |                --0.55%--sch_direct_xmit
> >     |                             |
> >     |                              --0.99%--skb_dst_force
> >     |
> >      --1.29%--ip_route_input_noref
> >          |
> >           --1.29%--ip_route_input_rcu
> >               |
> >                --1.05%--rt_cache_valid
> >  
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2017-08-15  9:23 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11 17:51 Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on Paweł Staszewski
2017-08-12 12:23 ` Jesper Dangaard Brouer
2017-08-12 17:27   ` Paweł Staszewski
2017-08-13 16:58     ` Paweł Staszewski
2017-08-14 16:19       ` Jesper Dangaard Brouer
2017-08-14 16:33         ` Eric Dumazet
2017-08-14 16:57         ` Paolo Abeni
2017-08-15  0:45           ` Paweł Staszewski
2017-08-15  1:07             ` Eric Dumazet
2017-08-15  1:17               ` Eric Dumazet
2017-08-15  9:11                 ` Paweł Staszewski
2017-08-15  9:19                   ` Paweł Staszewski
2017-08-15 10:05                   ` Jesper Dangaard Brouer
2017-09-21 21:26                   ` Paweł Staszewski
2017-09-21 21:34                     ` Eric Dumazet
2017-09-21 21:34                       ` Paweł Staszewski
2017-09-21 21:41                     ` Florian Fainelli
2017-09-21 21:43                       ` Paweł Staszewski
2017-09-21 21:54                       ` Eric Dumazet
2017-09-21 22:07                         ` Florian Fainelli
2017-09-22  0:37                           ` Eric Dumazet
2017-10-18 21:49                       ` Paweł Staszewski
2017-10-18 21:54                         ` Eric Dumazet
2017-10-18 22:45                           ` Paweł Staszewski
2017-09-09  9:03                 ` Paweł Staszewski
2017-09-11 16:57                   ` Paweł Staszewski
2017-09-11 22:11                     ` Paweł Staszewski
2017-08-15  9:35           ` Jesper Dangaard Brouer
2017-08-15  0:38         ` Paweł Staszewski
2017-08-15  9:23           ` Jesper Dangaard Brouer [this message]
2017-08-15  9:30             ` Paweł Staszewski
2017-08-15  9:57               ` Jesper Dangaard Brouer
2017-08-15 10:02                 ` Paweł Staszewski
2017-08-15 10:05                   ` Paweł Staszewski
2017-08-15 10:28                     ` Jesper Dangaard Brouer
2017-08-14  0:07     ` Alexander Duyck
2017-08-14 15:07       ` Paweł Staszewski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170815112307.2dd366fe@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pstaszewski@itcare.pl \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.