All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hpe.com>,
	netdev@vger.kernel.org, brouer@redhat.com,
	Saeed Mahameed <saeedm@mellanox.com>,
	Tariq Toukan <tariqt@mellanox.com>
Subject: Re: Netperf UDP issue with connected sockets
Date: Thu, 17 Nov 2016 19:30:21 +0100	[thread overview]
Message-ID: <20161117193021.580589ae@redhat.com> (raw)
In-Reply-To: <1479399679.8455.255.camel@edumazet-glaptop3.roam.corp.google.com>

On Thu, 17 Nov 2016 08:21:19 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 17 Nov 2016 06:17:38 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > >   
> > > > I can see that qdisc layer does not activate xmit_more in this case.
> > > >     
> > > 
> > > Sure. Not enough pressure from the sender(s).
> > > 
> > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > > limit is kept at a small value.
> > > 
> > > (BTW not all NIC have expensive doorbells)  
> > 
> > I believe this NIC mlx5 (50G edition) does.
> > 
> > I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> > 2414 cycles(tsc) 603.86 ns
> > 
> > Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> > 
> >  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> >    Overhead  Command        Shared Object        Symbol
> >  +    8.92%  udp_flood      [kernel.vmlinux]     [k] _raw_spin_lock
> >    - _raw_spin_lock
> >       + 90.78% __dev_queue_xmit
> >       + 7.83% dev_queue_xmit
> >       + 1.30% ___slab_alloc
> >  +    5.59%  udp_flood      [kernel.vmlinux]     [k] skb_set_owner_w  
> 
> Does TX completion happens on same cpu ?
> 
> >  +    4.77%  udp_flood      [mlx5_core]          [k] mlx5e_sq_xmit
> >  +    4.09%  udp_flood      [kernel.vmlinux]     [k] fib_table_lookup  
> 
> Why fib_table_lookup() is used with connected UDP flow ???

Don't know. I would be interested in hints howto avoid this!...

I also see it with netperf, and my udp_flood code is here:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c


> >  +    4.00%  swapper        [mlx5_core]          [k] mlx5e_poll_tx_cq
> >  +    3.11%  udp_flood      [kernel.vmlinux]     [k] __ip_route_output_key_hash  
> 
> Same here, this is suspect.

It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow().
 
> >  +    2.49%  swapper        [kernel.vmlinux]     [k] __slab_free
> > 
> > In this setup the spinlock in __dev_queue_xmit should be uncongested.
> > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> > 
> > But 8.92% of the time is spend on it, which corresponds to a cost of 215
> > cycles (2414*0.0892).  This cost is too high, thus something else is
> > going on... I claim this mysterious extra cost is the tailptr/doorbell.  
> 
> Well, with no pressure, doorbell is triggered for each packet.
> 
> Since we can not predict that your application is going to send yet
> another packet one usec later, we can not avoid this.

The point is I can see a socket Send-Q forming, thus we do know the
application have something to send. Thus, and possibility for
non-opportunistic bulking. Allowing/implementing bulk enqueue from
socket layer into qdisc layer, should be fairly simple (and rest of
xmit_more is already in place).  


> Note that with the patches I am working on (busypolling extentions),
> we could decide to let the busypoller doing the doorbells, say one every
> 10 usec. (or after ~16 packets were queued)

Sounds interesting! but an opportunistically approach.

> But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
> strategy .

It is a possibility that TX completions were happening on another CPU
(but I don't think so for mlx5).

To rule that out, I reran the experiment making sure to pin everything
to CPU-0 and the results are the same.

sudo ethtool -L mlx5p2 combined 1

sudo sh -c '\                                                    
 for x in /proc/irq/*/mlx5*/../smp_affinity; do \
   echo 01 > $x; grep . -H $x; done \
'

$ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9))

Perf report validating CPU in use:

$ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none

# Overhead  CPU  Command         Shared Object        Symbol                                   
# ........  ...  ..............  ...................  .........................................
#
     9.97%  000  udp_flood       [kernel.vmlinux]     [k] _raw_spin_lock                       
     4.37%  000  udp_flood       [kernel.vmlinux]     [k] fib_table_lookup                     
     3.97%  000  udp_flood       [mlx5_core]          [k] mlx5e_sq_xmit                        
     3.06%  000  udp_flood       [kernel.vmlinux]     [k] __ip_route_output_key_hash           
     2.51%  000  udp_flood       [mlx5_core]          [k] mlx5e_poll_tx_cq                     
     2.48%  000  udp_flood       [kernel.vmlinux]     [k] copy_user_enhanced_fast_string       
     2.47%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64                     
     2.42%  000  udp_flood       [kernel.vmlinux]     [k] udp_sendmsg                          
     2.39%  000  udp_flood       [kernel.vmlinux]     [k] __ip_append_data.isra.47             
     2.29%  000  udp_flood       [kernel.vmlinux]     [k] sock_def_write_space                 
     2.19%  000  udp_flood       [mlx5_core]          [k] mlx5e_get_cqe                        
     1.95%  000  udp_flood       [kernel.vmlinux]     [k] __ip_make_skb                        
     1.94%  000  udp_flood       [kernel.vmlinux]     [k] __alloc_skb                          
     1.90%  000  udp_flood       [kernel.vmlinux]     [k] sock_wfree                           
     1.85%  000  udp_flood       [kernel.vmlinux]     [k] skb_set_owner_w                      
     1.62%  000  udp_flood       [kernel.vmlinux]     [k] ip_finish_output2                    
     1.61%  000  udp_flood       [kernel.vmlinux]     [k] kfree                                
     1.54%  000  udp_flood       [kernel.vmlinux]     [k] entry_SYSCALL_64_fastpath            
     1.35%  000  udp_flood       libc-2.17.so         [.] __sendmsg_nocancel                   
     1.26%  000  udp_flood       [kernel.vmlinux]     [k] __kmalloc_node_track_caller          
     1.24%  000  udp_flood       [kernel.vmlinux]     [k] __rcu_read_unlock                    
     1.23%  000  udp_flood       [kernel.vmlinux]     [k] __local_bh_enable_ip                 
     1.22%  000  udp_flood       [kernel.vmlinux]     [k] ip_idents_reserve         

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2016-11-17 18:30 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-03 14:59 High perf top ip_idents_reserve doing netperf UDP_STREAM Jesper Dangaard Brouer
2014-09-03 15:17 ` Eric Dumazet
2016-11-16 12:16   ` Netperf UDP issue with connected sockets Jesper Dangaard Brouer
2016-11-16 17:46     ` Rick Jones
2016-11-16 22:40       ` Jesper Dangaard Brouer
2016-11-16 22:50         ` Rick Jones
2016-11-17  0:34         ` Eric Dumazet
2016-11-17  8:16           ` Jesper Dangaard Brouer
2016-11-17 13:20             ` Eric Dumazet
2016-11-17 13:42               ` Jesper Dangaard Brouer
2016-11-17 14:17                 ` Eric Dumazet
2016-11-17 14:57                   ` Jesper Dangaard Brouer
2016-11-17 16:21                     ` Eric Dumazet
2016-11-17 18:30                       ` Jesper Dangaard Brouer [this message]
2016-11-17 18:51                         ` Eric Dumazet
2016-11-17 21:19                           ` Jesper Dangaard Brouer
2016-11-17 21:44                             ` Eric Dumazet
2016-11-17 23:08                               ` Rick Jones
2016-11-18  0:37                                 ` Julian Anastasov
2016-11-18  0:42                                   ` Rick Jones
2016-11-18 17:12                               ` Jesper Dangaard Brouer
2016-11-21 16:03                           ` Jesper Dangaard Brouer
2016-11-21 18:10                             ` Eric Dumazet
2016-11-29  6:58                               ` [WIP] net+mlx4: auto doorbell Eric Dumazet
2016-11-30 11:38                                 ` Jesper Dangaard Brouer
2016-11-30 15:56                                   ` Eric Dumazet
2016-11-30 19:17                                     ` Jesper Dangaard Brouer
2016-11-30 19:30                                       ` Eric Dumazet
2016-11-30 22:30                                         ` Jesper Dangaard Brouer
2016-11-30 22:40                                           ` Eric Dumazet
2016-12-01  0:27                                         ` Eric Dumazet
2016-12-01  1:16                                           ` Tom Herbert
2016-12-01  2:32                                             ` Eric Dumazet
2016-12-01  2:50                                               ` Eric Dumazet
2016-12-02 18:16                                                 ` Eric Dumazet
2016-12-01  5:03                                               ` Tom Herbert
2016-12-01 19:24                                                 ` Willem de Bruijn
2016-11-30 13:50                                 ` Saeed Mahameed
2016-11-30 15:44                                   ` Eric Dumazet
2016-11-30 16:27                                     ` Saeed Mahameed
2016-11-30 17:28                                       ` Eric Dumazet
2016-12-01 12:05                                       ` Jesper Dangaard Brouer
2016-12-01 14:24                                         ` Eric Dumazet
2016-12-01 16:04                                           ` Jesper Dangaard Brouer
2016-12-01 17:04                                             ` Eric Dumazet
2016-12-01 19:17                                               ` Jesper Dangaard Brouer
2016-12-01 20:11                                                 ` Eric Dumazet
2016-12-01 20:20                                               ` David Miller
2016-12-01 22:10                                                 ` Eric Dumazet
2016-12-02 14:23                                               ` Eric Dumazet
2016-12-01 21:32                                 ` Alexander Duyck
2016-12-01 22:04                                   ` Eric Dumazet
2016-11-17 17:34                     ` Netperf UDP issue with connected sockets David Laight
2016-11-17 22:39                       ` Alexander Duyck
2016-11-17 17:42             ` Rick Jones
2016-11-28 18:33             ` Rick Jones
2016-11-28 18:40               ` Rick Jones
2016-11-30 10:43               ` Jesper Dangaard Brouer
2016-11-30 17:42                 ` Rick Jones
2016-11-30 18:11                   ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161117193021.580589ae@redhat.com \
    --to=brouer@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=rick.jones2@hpe.com \
    --cc=saeedm@mellanox.com \
    --cc=tariqt@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.