All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Tom Barbette <barbette@kth.se>
Cc: xdp-newbies@vger.kernel.org, brouer@redhat.com,
	"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: Bad XDP performance with mlx5
Date: Thu, 30 May 2019 09:40:53 +0200	[thread overview]
Message-ID: <20190530094053.364b1147@carbon> (raw)
In-Reply-To: <0836bd30-828a-9126-5d99-1d35b931e3ab@kth.se>

On Wed, 29 May 2019 20:16:46 +0200
Tom Barbette <barbette@kth.se> wrote:

> On 2019-05-29 19:16, Jesper Dangaard Brouer wrote:
> > On Wed, 29 May 2019 18:03:08 +0200
> > Tom Barbette <barbette@kth.se> wrote:
> >   
> >> Hi all,
> >>
> >> I've got a very simple eBPF program that counts packets per queue in a
> >> per-cpu map.  
> > 
> > Like xdp_rxq_info --dev mlx5p1 --action XDP_PASS ?  
> 
> Even simpler.
> 
> >   
> >> I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is
> >> limited by CPU (always at 100%).
> >>
> >> With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5.
> >> Without I get ~33.3Gbps. Pretty similar, somehow expected.
> >>
> >> With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but
> >> ~26 with it. The behavior seems similar with a simple XDP_PASS program.  
> > 
> > Are you sure?  
> 
> 
> xdp_pass.c:
> ---
> #include <linux/bpf.h>
> 
> #ifndef __section
> # define __section(NAME)                  \
>     __attribute__((section(NAME), used))
> #endif
> 
> __section("prog")
> int xdp_drop(struct xdp_md *ctx) {
>      return XDP_PASS;
> }
> 
> char __license[] __section("license") = "GPL";
> ---
> clang -O2 -target bpf -c xdp_pass.c -o xdp_pass.o
> 
> Then see results with netperf below.
> 
> > 
> > My test on a ConnectX 5 100G link show:
> >   - 33.8 Gbits/sec = with no-XDP prog
> >   - 34.5 Gbits/sec - with xdp_rxq_info
> >   
> 
> Even faster? :p
> 
> >> Any idea why MLX5 driver behaves like this? perf top is not conclusive
> >> at first glance. I'd say check_object_size and
> >> copy_user_enhanced_fast_string rise up but the stack is unclear from where.  
> >   
> > It is possible to get very different and varying TCP bandwidth results,
> > depending on if TCP-server-process is running on the same CPU as the
> > NAPI-RX loop.  If they share the CPU then results are worse, as
> > process-context scheduling is setting a limit.  
> 
> IPerf has one instance per-core, with SO_REUSEPORT and a BPF filter to 
> map queues <-> CPU in 1:1 with irqbalance killed and set_affinity*sh.
> So the setup on that regard is similar between tests and the variance do 
> not come from different assignments.
> Which is not what you're advising but ensure a similar per-core 
> "pipeline" and tests reproducibility. It's a side question but any link 
> on this L1/L2 cache misses vs scheduling question is welcome.
> 
> > 
> > This is easiest to demonstrate with netperf option -Tn,n:
> > 
> > $ netperf -H 198.18.1.1 -D1 -T2,2 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 35344.39 10^6bits/s over 1.002 seconds ending at 1559149724.219
> > Interim result: 35294.66 10^6bits/s over 1.001 seconds ending at 1559149725.221
> > Interim result: 36112.09 10^6bits/s over 1.002 seconds ending at 1559149726.222
> > Interim result: 36301.13 10^6bits/s over 1.000 seconds ending at 1559149727.222
> > ^CInterim result: 36146.78 10^6bits/s over 0.507 seconds ending at 1559149727.730
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    4.51     35801.94
> >   
> 
> server$ sudo service netperf start
> server$ sudo killall -9 irqbalance
> server$ sudo ethtool -X dpdk1 equal 2

Interesting use of ethtool -X (Set Rx flow hash indirection table), I
could use that myself in some of my tests.  I usually change the number
of RX-queue via ethtool -L (or --set-channels), which the i40e/XL710
have issues with...


> server$ sudo ip link set dev dpdk1 xdp off
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 37221.90 10^6bits/s over 1.015 seconds ending at 1559151699.433
> Interim result: 37811.52 10^6bits/s over 1.003 seconds ending at 1559151700.436
> Interim result: 38195.47 10^6bits/s over 1.001 seconds ending at 1559151701.437
> Interim result: 41089.18 10^6bits/s over 1.000 seconds ending at 1559151702.437
> Interim result: 38005.40 10^6bits/s over 1.081 seconds ending at 1559151703.518
> Interim result: 34419.33 10^6bits/s over 1.104 seconds ending at 1559151704.622
> ^CInterim result: 40634.33 10^6bits/s over 0.198 seconds ending at 1559151704.820
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    6.41     37758.53
> 
> server$ sudo ip link set dev dpdk1 xdp obj xdp_pass.o
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 31669.02 10^6bits/s over 1.021 seconds ending at 1559151575.906
> Interim result: 31164.97 10^6bits/s over 1.016 seconds ending at 1559151576.923
> Interim result: 31525.57 10^6bits/s over 1.001 seconds ending at 1559151577.924
> Interim result: 28835.03 10^6bits/s over 1.093 seconds ending at 1559151579.017
> Interim result: 36336.89 10^6bits/s over 1.000 seconds ending at 1559151580.017
> Interim result: 31021.22 10^6bits/s over 1.171 seconds ending at 1559151581.188
> Interim result: 37469.64 10^6bits/s over 1.000 seconds ending at 1559151582.189
> ^CInterim result: 33209.38 10^6bits/s over 0.403 seconds ending at 
> 1559151582.591
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.71     32518.84
> 
> server$ sudo ip link set dev dpdk1 xdp off
> server$ sudo ip link set dev dpdk1 xdp obj xdp_count.o
> netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 33090.36 10^6bits/s over 1.019 seconds ending at 1559151856.741
> Interim result: 32823.68 10^6bits/s over 1.008 seconds ending at 1559151857.749
> Interim result: 34766.21 10^6bits/s over 1.000 seconds ending at 1559151858.749
> Interim result: 36246.28 10^6bits/s over 1.034 seconds ending at 1559151859.784
> Interim result: 34757.19 10^6bits/s over 1.043 seconds ending at 1559151860.826
> Interim result: 29434.22 10^6bits/s over 1.181 seconds ending at 1559151862.007
> Interim result: 32619.29 10^6bits/s over 1.004 seconds ending at 
> 1559151863.011
> ^CInterim result: 36102.22 10^6bits/s over 0.448 seconds ending at 
> 1559151863.459
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.74     33470.75
> 
> There is a higher variance than my iperf test (50 flows) but without is 
> always around 40, while with is ranging from 32 to 37, mostly 32. What 
> I'm more sure of is that XL710 does not exhibit this behavior, with 
> netperf too :
> 
> server$ sudo ip link set dev enp213s0f0 xdp off
> client$ netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 18358.39 10^6bits/s over 1.001 seconds ending at 
> 1559152311.334
> Interim result: 18635.27 10^6bits/s over 1.001 seconds ending at 
> 1559152312.334
> Interim result: 18393.82 10^6bits/s over 1.013 seconds ending at 
> 1559152313.348
> Interim result: 18741.75 10^6bits/s over 1.000 seconds ending at 
> 1559152314.348
> Interim result: 18700.84 10^6bits/s over 1.002 seconds ending at 
> 1559152315.350
> ^CInterim result: 18059.26 10^6bits/s over 0.307 seconds ending at 
> 1559152315.657
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    5.33     18523.59
> 
> server$ sudo ip link set dev enp213s0f0 xdp obj xdp_pass.o
> netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 17867.08 10^6bits/s over 1.001 seconds ending at 
> 1559152387.230
> Interim result: 18444.22 10^6bits/s over 1.000 seconds ending at 
> 1559152388.230
> Interim result: 18226.31 10^6bits/s over 1.012 seconds ending at 
> 1559152389.242
> Interim result: 18411.24 10^6bits/s over 1.001 seconds ending at 
> 1559152390.243
> Interim result: 18420.69 10^6bits/s over 1.001 seconds ending at 
> 1559152391.244
> Interim result: 18236.47 10^6bits/s over 1.010 seconds ending at 
> 1559152392.254
> Interim result: 18026.38 10^6bits/s over 1.012 seconds ending at 
> 1559152393.265
> ^CInterim result: 18390.50 10^6bits/s over 0.465 seconds ending at 
> 1559152393.730
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.50     18236.5
> 
> For some reason, everything happens on the same core with the XL710, but 
> not mlx5 which uses 2 cores (one interrupt/napi and one netserver). Any 
> idea why? TX affinity working with XL710 but not mlx5? Anyway my iperf 
> test would not set that, so the problem does not lie there.

What SMP affinity script are you using?

The mellanox drivers uses another "layout"/name-scheme
in /proc/irq/*/*name*/../smp_affinity_list.

Normal Intel based nics I use this:

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

But for Mellanox I had to use this:

echo " --- Align IRQs : mlx5 ---"
for F in /proc/irq/*/mlx5_comp*/../smp_affinity; do
        dir=$(dirname $F) ;
        cat $dir/affinity_hint > $F
done
grep -H . /proc/irq/*/mlx5_comp*/../smp_affinity_list


> > $ netperf -H 198.18.1.1 -D1 -T1,1 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 26990.45 10^6bits/s over 1.000 seconds ending at 1559149733.554
> > Interim result: 27730.35 10^6bits/s over 1.000 seconds ending at 1559149734.554
> > Interim result: 27725.76 10^6bits/s over 1.000 seconds ending at 1559149735.554
> > Interim result: 27513.39 10^6bits/s over 1.008 seconds ending at 1559149736.561
> > Interim result: 27421.46 10^6bits/s over 1.003 seconds ending at 1559149737.565
> > ^CInterim result: 27523.62 10^6bits/s over 0.580 seconds ending at 1559149738.145
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    5.59     27473.50
> >
> >   
> >> I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.  
> > 
> > I use 5.1.0-bpf-next (with some patches on top of commit 35c99ffa20).
> >   
> I'm rebasing on 5.1.5, I do not wish to go too leading edge on this 
> project (unless needed).
>
> I do have one patch to copy the RSS hash in the xdp_buff, but the field 
> is read even if xdp is disabled.

What is you use-case for this?

Upstream will likely request that this is added as xdp_buff->metadata
and using BTF format... but it is a longer project see[1], and is
currently scheduled as a "medium-term" task... let us know if you want
to work on this...

[1] https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#metadata-available-to-programs
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2019-05-30  7:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-29 16:03 Bad XDP performance with mlx5 Tom Barbette
2019-05-29 17:16 ` Jesper Dangaard Brouer
2019-05-29 18:16   ` Tom Barbette
2019-05-30  7:40     ` Jesper Dangaard Brouer [this message]
2019-05-30  8:55       ` Tom Barbette
2019-05-31  6:51         ` Tom Barbette
2019-05-31 16:18           ` Jesper Dangaard Brouer
2019-05-31 18:00             ` David Miller
2019-05-31 18:06             ` Saeed Mahameed
2019-05-31 21:57               ` Jesper Dangaard Brouer
2019-06-04  7:28               ` Tom Barbette
2019-06-04  9:15                 ` Jesper Dangaard Brouer
2019-06-04 18:35                 ` Saeed Mahameed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190530094053.364b1147@carbon \
    --to=brouer@redhat.com \
    --cc=barbette@kth.se \
    --cc=toke@redhat.com \
    --cc=xdp-newbies@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.