Bad XDP performance with mlx5

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bad XDP performance with mlx5
@ 2019-05-29 16:03 Tom Barbette
  2019-05-29 17:16 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Barbette @ 2019-05-29 16:03 UTC (permalink / raw)
  To: xdp-newbies

Hi all,

I've got a very simple eBPF program that counts packets per queue in a 
per-cpu map.

I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is 
limited by CPU (always at 100%).

With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5. 
Without I get ~33.3Gbps. Pretty similar, somehow expected.

With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but 
~26 with it. The behavior seems similar with a simple XDP_PASS program.

Any idea why MLX5 driver behaves like this? perf top is not conclusive 
at first glance. I'd say check_object_size and 
copy_user_enhanced_fast_string rise up but the stack is unclear from where.

I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-29 16:03 Bad XDP performance with mlx5 Tom Barbette
@ 2019-05-29 17:16 ` Jesper Dangaard Brouer
  2019-05-29 18:16   ` Tom Barbette
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-29 17:16 UTC (permalink / raw)
  To: Tom Barbette; +Cc: xdp-newbies, brouer

On Wed, 29 May 2019 18:03:08 +0200
Tom Barbette <barbette@kth.se> wrote:

> Hi all,
> 
> I've got a very simple eBPF program that counts packets per queue in a 
> per-cpu map.

Like xdp_rxq_info --dev mlx5p1 --action XDP_PASS ?

> I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is 
> limited by CPU (always at 100%).
> 
> With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5. 
> Without I get ~33.3Gbps. Pretty similar, somehow expected.
> 
> With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but 
> ~26 with it. The behavior seems similar with a simple XDP_PASS program.

Are you sure?

My test on a ConnectX 5 100G link show:
 - 33.8 Gbits/sec = with no-XDP prog
 - 34.5 Gbits/sec - with xdp_rxq_info

> Any idea why MLX5 driver behaves like this? perf top is not conclusive 
> at first glance. I'd say check_object_size and 
> copy_user_enhanced_fast_string rise up but the stack is unclear from where.
 
It is possible to get very different and varying TCP bandwidth results,
depending on if TCP-server-process is running on the same CPU as the
NAPI-RX loop.  If they share the CPU then results are worse, as
process-context scheduling is setting a limit.

This is easiest to demonstrate with netperf option -Tn,n:

$ netperf -H 198.18.1.1 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
Interim result: 35344.39 10^6bits/s over 1.002 seconds ending at 1559149724.219
Interim result: 35294.66 10^6bits/s over 1.001 seconds ending at 1559149725.221
Interim result: 36112.09 10^6bits/s over 1.002 seconds ending at 1559149726.222
Interim result: 36301.13 10^6bits/s over 1.000 seconds ending at 1559149727.222
^CInterim result: 36146.78 10^6bits/s over 0.507 seconds ending at 1559149727.730
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

131072  16384  16384    4.51     35801.94   


$ netperf -H 198.18.1.1 -D1 -T1,1 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
Interim result: 26990.45 10^6bits/s over 1.000 seconds ending at 1559149733.554
Interim result: 27730.35 10^6bits/s over 1.000 seconds ending at 1559149734.554
Interim result: 27725.76 10^6bits/s over 1.000 seconds ending at 1559149735.554
Interim result: 27513.39 10^6bits/s over 1.008 seconds ending at 1559149736.561
Interim result: 27421.46 10^6bits/s over 1.003 seconds ending at 1559149737.565
^CInterim result: 27523.62 10^6bits/s over 0.580 seconds ending at 1559149738.145
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

131072  16384  16384    5.59     27473.50   


> I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.

I use 5.1.0-bpf-next (with some patches on top of commit 35c99ffa20).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-29 17:16 ` Jesper Dangaard Brouer
@ 2019-05-29 18:16   ` Tom Barbette
  2019-05-30  7:40     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Barbette @ 2019-05-29 18:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies

On 2019-05-29 19:16, Jesper Dangaard Brouer wrote:
> On Wed, 29 May 2019 18:03:08 +0200
> Tom Barbette <barbette@kth.se> wrote:
> 
>> Hi all,
>>
>> I've got a very simple eBPF program that counts packets per queue in a
>> per-cpu map.
> 
> Like xdp_rxq_info --dev mlx5p1 --action XDP_PASS ?

Even simpler.

> 
>> I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is
>> limited by CPU (always at 100%).
>>
>> With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5.
>> Without I get ~33.3Gbps. Pretty similar, somehow expected.
>>
>> With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but
>> ~26 with it. The behavior seems similar with a simple XDP_PASS program.
> 
> Are you sure?


xdp_pass.c:
---
#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
    __attribute__((section(NAME), used))
#endif

__section("prog")
int xdp_drop(struct xdp_md *ctx) {
     return XDP_PASS;
}

char __license[] __section("license") = "GPL";
---
clang -O2 -target bpf -c xdp_pass.c -o xdp_pass.o

Then see results with netperf below.

> 
> My test on a ConnectX 5 100G link show:
>   - 33.8 Gbits/sec = with no-XDP prog
>   - 34.5 Gbits/sec - with xdp_rxq_info
> 

Even faster? :p

>> Any idea why MLX5 driver behaves like this? perf top is not conclusive
>> at first glance. I'd say check_object_size and
>> copy_user_enhanced_fast_string rise up but the stack is unclear from where.
>   
> It is possible to get very different and varying TCP bandwidth results,
> depending on if TCP-server-process is running on the same CPU as the
> NAPI-RX loop.  If they share the CPU then results are worse, as
> process-context scheduling is setting a limit.

IPerf has one instance per-core, with SO_REUSEPORT and a BPF filter to 
map queues <-> CPU in 1:1 with irqbalance killed and set_affinity*sh.
So the setup on that regard is similar between tests and the variance do 
not come from different assignments.
Which is not what you're advising but ensure a similar per-core 
"pipeline" and tests reproducibility. It's a side question but any link 
on this L1/L2 cache misses vs scheduling question is welcome.

> 
> This is easiest to demonstrate with netperf option -Tn,n:
> 
> $ netperf -H 198.18.1.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> Interim result: 35344.39 10^6bits/s over 1.002 seconds ending at 1559149724.219
> Interim result: 35294.66 10^6bits/s over 1.001 seconds ending at 1559149725.221
> Interim result: 36112.09 10^6bits/s over 1.002 seconds ending at 1559149726.222
> Interim result: 36301.13 10^6bits/s over 1.000 seconds ending at 1559149727.222
> ^CInterim result: 36146.78 10^6bits/s over 0.507 seconds ending at 1559149727.730
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    4.51     35801.94
> 

server$ sudo service netperf start
server$ sudo killall -9 irqbalance
server$ sudo ethtool -X dpdk1 equal 2
server$ sudo ip link set dev dpdk1 xdp off
client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.220.0.5 () port 0 AF_INET : demo : cpu bind
Interim result: 37221.90 10^6bits/s over 1.015 seconds ending at 
1559151699.433
Interim result: 37811.52 10^6bits/s over 1.003 seconds ending at 
1559151700.436
Interim result: 38195.47 10^6bits/s over 1.001 seconds ending at 
1559151701.437
Interim result: 41089.18 10^6bits/s over 1.000 seconds ending at 
1559151702.437
Interim result: 38005.40 10^6bits/s over 1.081 seconds ending at 
1559151703.518
Interim result: 34419.33 10^6bits/s over 1.104 seconds ending at 
1559151704.622
^CInterim result: 40634.33 10^6bits/s over 0.198 seconds ending at 
1559151704.820
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    6.41     37758.53

server$ sudo ip link set dev dpdk1 xdp obj xdp_pass.o
client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.220.0.5 () port 0 AF_INET : demo : cpu bind
Interim result: 31669.02 10^6bits/s over 1.021 seconds ending at 
1559151575.906
Interim result: 31164.97 10^6bits/s over 1.016 seconds ending at 
1559151576.923
Interim result: 31525.57 10^6bits/s over 1.001 seconds ending at 
1559151577.924
Interim result: 28835.03 10^6bits/s over 1.093 seconds ending at 
1559151579.017
Interim result: 36336.89 10^6bits/s over 1.000 seconds ending at 
1559151580.017
Interim result: 31021.22 10^6bits/s over 1.171 seconds ending at 
1559151581.188
Interim result: 37469.64 10^6bits/s over 1.000 seconds ending at 
1559151582.189
^CInterim result: 33209.38 10^6bits/s over 0.403 seconds ending at 
1559151582.591
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    7.71     32518.84

server$ sudo ip link set dev dpdk1 xdp off
server$ sudo ip link set dev dpdk1 xdp obj xdp_count.o
netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.220.0.5 () port 0 AF_INET : demo : cpu bind
Interim result: 33090.36 10^6bits/s over 1.019 seconds ending at 
1559151856.741
Interim result: 32823.68 10^6bits/s over 1.008 seconds ending at 
1559151857.749
Interim result: 34766.21 10^6bits/s over 1.000 seconds ending at 
1559151858.749
Interim result: 36246.28 10^6bits/s over 1.034 seconds ending at 
1559151859.784
Interim result: 34757.19 10^6bits/s over 1.043 seconds ending at 
1559151860.826
Interim result: 29434.22 10^6bits/s over 1.181 seconds ending at 
1559151862.007
Interim result: 32619.29 10^6bits/s over 1.004 seconds ending at 
1559151863.011
^CInterim result: 36102.22 10^6bits/s over 0.448 seconds ending at 
1559151863.459
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    7.74     33470.75

There is a higher variance than my iperf test (50 flows) but without is 
always around 40, while with is ranging from 32 to 37, mostly 32. What 
I'm more sure of is that XL710 does not exhibit this behavior, with 
netperf too :

server$ sudo ip link set dev enp213s0f0 xdp off
client$ netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.230.0.1 () port 0 AF_INET : demo : cpu bind
Interim result: 18358.39 10^6bits/s over 1.001 seconds ending at 
1559152311.334
Interim result: 18635.27 10^6bits/s over 1.001 seconds ending at 
1559152312.334
Interim result: 18393.82 10^6bits/s over 1.013 seconds ending at 
1559152313.348
Interim result: 18741.75 10^6bits/s over 1.000 seconds ending at 
1559152314.348
Interim result: 18700.84 10^6bits/s over 1.002 seconds ending at 
1559152315.350
^CInterim result: 18059.26 10^6bits/s over 0.307 seconds ending at 
1559152315.657
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    5.33     18523.59

server$ sudo ip link set dev enp213s0f0 xdp obj xdp_pass.o
netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.230.0.1 () port 0 AF_INET : demo : cpu bind
Interim result: 17867.08 10^6bits/s over 1.001 seconds ending at 
1559152387.230
Interim result: 18444.22 10^6bits/s over 1.000 seconds ending at 
1559152388.230
Interim result: 18226.31 10^6bits/s over 1.012 seconds ending at 
1559152389.242
Interim result: 18411.24 10^6bits/s over 1.001 seconds ending at 
1559152390.243
Interim result: 18420.69 10^6bits/s over 1.001 seconds ending at 
1559152391.244
Interim result: 18236.47 10^6bits/s over 1.010 seconds ending at 
1559152392.254
Interim result: 18026.38 10^6bits/s over 1.012 seconds ending at 
1559152393.265
^CInterim result: 18390.50 10^6bits/s over 0.465 seconds ending at 
1559152393.730
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    7.50     18236.5

For some reason, everything happens on the same core with the XL710, but 
not mlx5 which uses 2 cores (one interrupt/napi and one netserver). Any 
idea why? TX affinity working with XL710 but not mlx5? Anyway my iperf 
test would not set that, so the problem does not lie there.

> 
> $ netperf -H 198.18.1.1 -D1 -T1,1 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> Interim result: 26990.45 10^6bits/s over 1.000 seconds ending at 1559149733.554
> Interim result: 27730.35 10^6bits/s over 1.000 seconds ending at 1559149734.554
> Interim result: 27725.76 10^6bits/s over 1.000 seconds ending at 1559149735.554
> Interim result: 27513.39 10^6bits/s over 1.008 seconds ending at 1559149736.561
> Interim result: 27421.46 10^6bits/s over 1.003 seconds ending at 1559149737.565
> ^CInterim result: 27523.62 10^6bits/s over 0.580 seconds ending at 1559149738.145
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    5.59     27473.50
>
> 
>> I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.
> 
> I use 5.1.0-bpf-next (with some patches on top of commit 35c99ffa20).
> 
I'm rebasing on 5.1.5, I do not wish to go too leading edge on this 
project (unless needed).
I do have one patch to copy the RSS hash in the xdp_buff, but the field 
is read even if xdp is disabled.

Thanks for the help !

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-29 18:16   ` Tom Barbette
@ 2019-05-30  7:40     ` Jesper Dangaard Brouer
  2019-05-30  8:55       ` Tom Barbette
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-30  7:40 UTC (permalink / raw)
  To: Tom Barbette; +Cc: xdp-newbies, brouer, Toke Høiland-Jørgensen

On Wed, 29 May 2019 20:16:46 +0200
Tom Barbette <barbette@kth.se> wrote:

> On 2019-05-29 19:16, Jesper Dangaard Brouer wrote:
> > On Wed, 29 May 2019 18:03:08 +0200
> > Tom Barbette <barbette@kth.se> wrote:
> >   
> >> Hi all,
> >>
> >> I've got a very simple eBPF program that counts packets per queue in a
> >> per-cpu map.  
> > 
> > Like xdp_rxq_info --dev mlx5p1 --action XDP_PASS ?  
> 
> Even simpler.
> 
> >   
> >> I use IPerf in TCP mode, I limit the CPU cores to 2 so performance is
> >> limited by CPU (always at 100%).
> >>
> >> With a XL710 NIC 40G link, with the XDP program loaded, I get 32.5.
> >> Without I get ~33.3Gbps. Pretty similar, somehow expected.
> >>
> >> With a ConnectX 5 100G link, I get ~33.3Gbps without the XDP program but
> >> ~26 with it. The behavior seems similar with a simple XDP_PASS program.  
> > 
> > Are you sure?  
> 
> 
> xdp_pass.c:
> ---
> #include <linux/bpf.h>
> 
> #ifndef __section
> # define __section(NAME)                  \
>     __attribute__((section(NAME), used))
> #endif
> 
> __section("prog")
> int xdp_drop(struct xdp_md *ctx) {
>      return XDP_PASS;
> }
> 
> char __license[] __section("license") = "GPL";
> ---
> clang -O2 -target bpf -c xdp_pass.c -o xdp_pass.o
> 
> Then see results with netperf below.
> 
> > 
> > My test on a ConnectX 5 100G link show:
> >   - 33.8 Gbits/sec = with no-XDP prog
> >   - 34.5 Gbits/sec - with xdp_rxq_info
> >   
> 
> Even faster? :p
> 
> >> Any idea why MLX5 driver behaves like this? perf top is not conclusive
> >> at first glance. I'd say check_object_size and
> >> copy_user_enhanced_fast_string rise up but the stack is unclear from where.  
> >   
> > It is possible to get very different and varying TCP bandwidth results,
> > depending on if TCP-server-process is running on the same CPU as the
> > NAPI-RX loop.  If they share the CPU then results are worse, as
> > process-context scheduling is setting a limit.  
> 
> IPerf has one instance per-core, with SO_REUSEPORT and a BPF filter to 
> map queues <-> CPU in 1:1 with irqbalance killed and set_affinity*sh.
> So the setup on that regard is similar between tests and the variance do 
> not come from different assignments.
> Which is not what you're advising but ensure a similar per-core 
> "pipeline" and tests reproducibility. It's a side question but any link 
> on this L1/L2 cache misses vs scheduling question is welcome.
> 
> > 
> > This is easiest to demonstrate with netperf option -Tn,n:
> > 
> > $ netperf -H 198.18.1.1 -D1 -T2,2 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 35344.39 10^6bits/s over 1.002 seconds ending at 1559149724.219
> > Interim result: 35294.66 10^6bits/s over 1.001 seconds ending at 1559149725.221
> > Interim result: 36112.09 10^6bits/s over 1.002 seconds ending at 1559149726.222
> > Interim result: 36301.13 10^6bits/s over 1.000 seconds ending at 1559149727.222
> > ^CInterim result: 36146.78 10^6bits/s over 0.507 seconds ending at 1559149727.730
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    4.51     35801.94
> >   
> 
> server$ sudo service netperf start
> server$ sudo killall -9 irqbalance
> server$ sudo ethtool -X dpdk1 equal 2

Interesting use of ethtool -X (Set Rx flow hash indirection table), I
could use that myself in some of my tests.  I usually change the number
of RX-queue via ethtool -L (or --set-channels), which the i40e/XL710
have issues with...


> server$ sudo ip link set dev dpdk1 xdp off
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 37221.90 10^6bits/s over 1.015 seconds ending at 1559151699.433
> Interim result: 37811.52 10^6bits/s over 1.003 seconds ending at 1559151700.436
> Interim result: 38195.47 10^6bits/s over 1.001 seconds ending at 1559151701.437
> Interim result: 41089.18 10^6bits/s over 1.000 seconds ending at 1559151702.437
> Interim result: 38005.40 10^6bits/s over 1.081 seconds ending at 1559151703.518
> Interim result: 34419.33 10^6bits/s over 1.104 seconds ending at 1559151704.622
> ^CInterim result: 40634.33 10^6bits/s over 0.198 seconds ending at 1559151704.820
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    6.41     37758.53
> 
> server$ sudo ip link set dev dpdk1 xdp obj xdp_pass.o
> client$ netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 31669.02 10^6bits/s over 1.021 seconds ending at 1559151575.906
> Interim result: 31164.97 10^6bits/s over 1.016 seconds ending at 1559151576.923
> Interim result: 31525.57 10^6bits/s over 1.001 seconds ending at 1559151577.924
> Interim result: 28835.03 10^6bits/s over 1.093 seconds ending at 1559151579.017
> Interim result: 36336.89 10^6bits/s over 1.000 seconds ending at 1559151580.017
> Interim result: 31021.22 10^6bits/s over 1.171 seconds ending at 1559151581.188
> Interim result: 37469.64 10^6bits/s over 1.000 seconds ending at 1559151582.189
> ^CInterim result: 33209.38 10^6bits/s over 0.403 seconds ending at 
> 1559151582.591
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.71     32518.84
> 
> server$ sudo ip link set dev dpdk1 xdp off
> server$ sudo ip link set dev dpdk1 xdp obj xdp_count.o
> netperf -H 10.220.0.5 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.220.0.5 () port 0 AF_INET : demo : cpu bind
> Interim result: 33090.36 10^6bits/s over 1.019 seconds ending at 1559151856.741
> Interim result: 32823.68 10^6bits/s over 1.008 seconds ending at 1559151857.749
> Interim result: 34766.21 10^6bits/s over 1.000 seconds ending at 1559151858.749
> Interim result: 36246.28 10^6bits/s over 1.034 seconds ending at 1559151859.784
> Interim result: 34757.19 10^6bits/s over 1.043 seconds ending at 1559151860.826
> Interim result: 29434.22 10^6bits/s over 1.181 seconds ending at 1559151862.007
> Interim result: 32619.29 10^6bits/s over 1.004 seconds ending at 
> 1559151863.011
> ^CInterim result: 36102.22 10^6bits/s over 0.448 seconds ending at 
> 1559151863.459
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.74     33470.75
> 
> There is a higher variance than my iperf test (50 flows) but without is 
> always around 40, while with is ranging from 32 to 37, mostly 32. What 
> I'm more sure of is that XL710 does not exhibit this behavior, with 
> netperf too :
> 
> server$ sudo ip link set dev enp213s0f0 xdp off
> client$ netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 18358.39 10^6bits/s over 1.001 seconds ending at 
> 1559152311.334
> Interim result: 18635.27 10^6bits/s over 1.001 seconds ending at 
> 1559152312.334
> Interim result: 18393.82 10^6bits/s over 1.013 seconds ending at 
> 1559152313.348
> Interim result: 18741.75 10^6bits/s over 1.000 seconds ending at 
> 1559152314.348
> Interim result: 18700.84 10^6bits/s over 1.002 seconds ending at 
> 1559152315.350
> ^CInterim result: 18059.26 10^6bits/s over 0.307 seconds ending at 
> 1559152315.657
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    5.33     18523.59
> 
> server$ sudo ip link set dev enp213s0f0 xdp obj xdp_pass.o
> netperf -H 10.230.0.1 -D1 -T2,2 -t TCP_STREAM -l 120
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
> 10.230.0.1 () port 0 AF_INET : demo : cpu bind
> Interim result: 17867.08 10^6bits/s over 1.001 seconds ending at 
> 1559152387.230
> Interim result: 18444.22 10^6bits/s over 1.000 seconds ending at 
> 1559152388.230
> Interim result: 18226.31 10^6bits/s over 1.012 seconds ending at 
> 1559152389.242
> Interim result: 18411.24 10^6bits/s over 1.001 seconds ending at 
> 1559152390.243
> Interim result: 18420.69 10^6bits/s over 1.001 seconds ending at 
> 1559152391.244
> Interim result: 18236.47 10^6bits/s over 1.010 seconds ending at 
> 1559152392.254
> Interim result: 18026.38 10^6bits/s over 1.012 seconds ending at 
> 1559152393.265
> ^CInterim result: 18390.50 10^6bits/s over 0.465 seconds ending at 
> 1559152393.730
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
> 131072  16384  16384    7.50     18236.5
> 
> For some reason, everything happens on the same core with the XL710, but 
> not mlx5 which uses 2 cores (one interrupt/napi and one netserver). Any 
> idea why? TX affinity working with XL710 but not mlx5? Anyway my iperf 
> test would not set that, so the problem does not lie there.

What SMP affinity script are you using?

The mellanox drivers uses another "layout"/name-scheme
in /proc/irq/*/*name*/../smp_affinity_list.

Normal Intel based nics I use this:

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

But for Mellanox I had to use this:

echo " --- Align IRQs : mlx5 ---"
for F in /proc/irq/*/mlx5_comp*/../smp_affinity; do
        dir=$(dirname $F) ;
        cat $dir/affinity_hint > $F
done
grep -H . /proc/irq/*/mlx5_comp*/../smp_affinity_list


> > $ netperf -H 198.18.1.1 -D1 -T1,1 -t TCP_STREAM -l 120
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.1.1 () port 0 AF_INET : histogram : demo : cpu bind
> > Interim result: 26990.45 10^6bits/s over 1.000 seconds ending at 1559149733.554
> > Interim result: 27730.35 10^6bits/s over 1.000 seconds ending at 1559149734.554
> > Interim result: 27725.76 10^6bits/s over 1.000 seconds ending at 1559149735.554
> > Interim result: 27513.39 10^6bits/s over 1.008 seconds ending at 1559149736.561
> > Interim result: 27421.46 10^6bits/s over 1.003 seconds ending at 1559149737.565
> > ^CInterim result: 27523.62 10^6bits/s over 0.580 seconds ending at 1559149738.145
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> > 131072  16384  16384    5.59     27473.50
> >
> >   
> >> I use 5.1-rc3, compiled myself using Ubuntu 18.04's latest .config file.  
> > 
> > I use 5.1.0-bpf-next (with some patches on top of commit 35c99ffa20).
> >   
> I'm rebasing on 5.1.5, I do not wish to go too leading edge on this 
> project (unless needed).
>
> I do have one patch to copy the RSS hash in the xdp_buff, but the field 
> is read even if xdp is disabled.

What is you use-case for this?

Upstream will likely request that this is added as xdp_buff->metadata
and using BTF format... but it is a longer project see[1], and is
currently scheduled as a "medium-term" task... let us know if you want
to work on this...

[1] https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#metadata-available-to-programs
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-30  7:40     ` Jesper Dangaard Brouer
@ 2019-05-30  8:55       ` Tom Barbette
  2019-05-31  6:51         ` Tom Barbette
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Barbette @ 2019-05-30  8:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: xdp-newbies, Toke Høiland-Jørgensen

Le 30/05/2019 à 09:40, Jesper Dangaard Brouer a écrit :
>> server$ sudo ethtool -X dpdk1 equal 2
> 
> Interesting use of ethtool -X (Set Rx flow hash indirection table), I
> could use that myself in some of my tests.  I usually change the number
> of RX-queue via ethtool -L (or --set-channels), which the i40e/XL710
> have issues with...

Yes and it doesn't kill the existing queues for multiple seconds, and 
keeps the affinity of the IRQs.

> 
> What SMP affinity script are you using?
> 
> The mellanox drivers uses another "layout"/name-scheme
> in /proc/irq/*/*name*/../smp_affinity_list.
> 
> Normal Intel based nics I use this:
> 
> echo " --- Align IRQs ---"
> # I've named my NICs ixgbe1 + ixgbe2
> for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
>     # Extract irqname e.g. "ixgbe2-TxRx-2"
>     irqname=$(basename $(dirname $(dirname $F))) ;
>     # Substring pattern removal
>     hwq_nr=${irqname#*-*-}
>     echo $hwq_nr > $F
>     #grep . -H $F;
> done
> grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list
> 
> But for Mellanox I had to use this:
> 
> echo " --- Align IRQs : mlx5 ---"
> for F in /proc/irq/*/mlx5_comp*/../smp_affinity; do
>          dir=$(dirname $F) ;
>          cat $dir/affinity_hint > $F
> done
> grep -H . /proc/irq/*/mlx5_comp*/../smp_affinity_list
> 

Correct, I used the Mellanox script installed by the mellanox OFED for 
all. I think that one works for all. Anyway this could explain why the 
netperf case did go to the same core with I40E, but with my iperf one, 
the 50 flows where distributed all the way around.

I made a video (enable subtitles). I just re-compiled with clean 5.1.5 
(no RSS modification or anything), it's the same thing that you can see 
on the video. The iperf is the normal iperf2 also in the video. Enabling 
the xdp_pass program create a huge CPU increase with CX5. With XL710 I 
get only a 1 or 2 % per CPU increase.

https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be

>>
>> I do have one patch to copy the RSS hash in the xdp_buff, but the field
>> is read even if xdp is disabled.
> 
> What is you use-case for this?

Load balancing. No need to re-compute a hash in SW if HW did it...

> 
> Upstream will likely request that this is added as xdp_buff->metadata
> and using BTF format... but it is a longer project see[1], and is
> currently scheduled as a "medium-term" task... let us know if you want
> to work on this...
> 
> [1] https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#metadata-available-to-programs
>
I would be happy to help. But we would be making a sk_buff again if we 
start throwing things in the buff that people may be using in one use 
case. Already, first time I looked at XDP it was "data + len" and now 
there is like 10 fields extracted with the queue info.

On the contrary, I found the BPF resolver (name? the thing that 
translate the BPF offset to the real struct offset) to be super neat. 
Wouldn't driver be able to expose a specific per-driver resolver that 
will know how to fetch all this random information in the descriptors 
and will therefore do it only if needed?

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-30  8:55       ` Tom Barbette
@ 2019-05-31  6:51         ` Tom Barbette
  2019-05-31 16:18           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Tom Barbette @ 2019-05-31  6:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: xdp-newbies, Toke Høiland-Jørgensen, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan

CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge 
CPU increase on CX5 when introducing a XDP program. See 
https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be around 
0:40. We're talking something like 15% while it's near 0 for other 
drivers. The machine is a recent Skylake. For us it makes XDP unusable. 
Is that a known problem?

I wonder if it doesn't simply come from mlx5/en_main.c:

rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;

Which would be inline from my observation that memory access seems 
heavier. I guess this is for the XDP_TX case.

If this is indeed the problem. Any chance we can:
a) detect automatically that a program will not return XDP_TX (I'm not 
quite sure about what the BPF limitations allow to guess in advance) or
b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
performance when not needed?

Thanks,

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31  6:51         ` Tom Barbette
@ 2019-05-31 16:18           ` Jesper Dangaard Brouer
  2019-05-31 18:00             ` David Miller
  2019-05-31 18:06             ` Saeed Mahameed
  0 siblings, 2 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-31 16:18 UTC (permalink / raw)
  To: Tom Barbette
  Cc: xdp-newbies, Toke Høiland-Jørgensen, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, brouer, netdev@vger.kernel.org


On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se> wrote:

> CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge 
> CPU increase on CX5 when introducing a XDP program.
>
> See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> around 0:40. We're talking something like 15% while it's near 0 for
> other drivers. The machine is a recent Skylake. For us it makes XDP
> unusable. Is that a known problem?

I have a similar test setup, and I can reproduce. I have found the
root-cause see below.  But on my system it was even worse, with an
XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
CPU (43% idle) and total 89.7 Gbits/sec.

 
> I wonder if it doesn't simply come from mlx5/en_main.c:
> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> 

Nope, that is not the problem.

> Which would be inline from my observation that memory access seems 
> heavier. I guess this is for the XDP_TX case.
> 
> If this is indeed the problem. Any chance we can:
> a) detect automatically that a program will not return XDP_TX (I'm not 
> quite sure about what the BPF limitations allow to guess in advance) or
> b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> performance when not needed?

This was kind of hard to root-cause, but I solved it by increasing the TCP
socket size used by the iperf tool, like this (please reproduce):

$ iperf -s --window 4M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  416 KByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------

Given I could reproduce, I took at closer look at perf record/report stats,
and it was actually quite clear that this was related to stalling on getting
pages from the page allocator (function calls top#6 get_page_from_freelist
and free_pcppages_bulk).

Using my tool: ethtool_stats.pl
 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

It was clear that the mlx5 driver page-cache was not working:
 Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat:        1034 (       1,034) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy /sec

While the non-XDP case looked like this:
 Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_csum_complete /sec
 Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy /sec
Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44

With the increased TCP window size, the mlx5 driver cache is working better,
but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage.
 Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy /sec
 Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full /sec
 Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse /sec
 Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr /sec
 Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_csum_unnecessary /sec
 Ethtool(mlx5p1  ) stat:         164 (         164) <= rx_discards_phy /sec
 Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets /sec
 Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy /sec
Manual consistence calc: 7532983-(6638518+894453) = 12

To understand why this is happening, you first have to know that the
difference is between the two RX-memory modes used by mlx5 for non-XDP vs
XDP. With non-XDP two frames are stored per memory-page, while for XDP only
a single frame per page is used.  The packets available in the RX-rings are
actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.

I believe, the real issue is that TCP use the SKB->truesize (based on frame
size) for different memory pressure and window calculations, which is why it
solved the issue to increase the window size manually.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 16:18           ` Jesper Dangaard Brouer
@ 2019-05-31 18:00             ` David Miller
  2019-05-31 18:06             ` Saeed Mahameed
  1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2019-05-31 18:00 UTC (permalink / raw)
  To: brouer; +Cc: barbette, xdp-newbies, toke, saeedm, leonro, tariqt, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 31 May 2019 18:18:17 +0200

> On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se> wrote:
> 
>> I wonder if it doesn't simply come from mlx5/en_main.c:
>> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>> 
> 
> Nope, that is not the problem.

And it's easy to test this theory by forcing DMA_FROM_DEVICE.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 16:18           ` Jesper Dangaard Brouer
  2019-05-31 18:00             ` David Miller
@ 2019-05-31 18:06             ` Saeed Mahameed
  2019-05-31 21:57               ` Jesper Dangaard Brouer
  2019-06-04  7:28               ` Tom Barbette
  1 sibling, 2 replies; 13+ messages in thread
From: Saeed Mahameed @ 2019-05-31 18:06 UTC (permalink / raw)
  To: barbette@kth.se, brouer@redhat.com
  Cc: toke@redhat.com, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, Leon Romanovsky, Tariq Toukan

On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@kth.se>
> wrote:
> 
> > CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a
> > huge 
> > CPU increase on CX5 when introducing a XDP program.
> > 
> > See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> > around 0:40. We're talking something like 15% while it's near 0 for
> > other drivers. The machine is a recent Skylake. For us it makes XDP
> > unusable. Is that a known problem?
> 

The question is, On the same packet rate/bandwidth do you see higher
cpu utilization on mlx5 compared to other drivers? you have to compare
apples to apples.


> I have a similar test setup, and I can reproduce. I have found the
> root-cause see below.  But on my system it was even worse, with an
> XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
> 100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
> CPU (43% idle) and total 89.7 Gbits/sec.
> 
>  
> > I wonder if it doesn't simply come from mlx5/en_main.c:
> > rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL :
> > DMA_FROM_DEVICE;
> > 
> 
> Nope, that is not the problem.
> 
> > Which would be inline from my observation that memory access seems 
> > heavier. I guess this is for the XDP_TX case.
> > 
> > If this is indeed the problem. Any chance we can:
> > a) detect automatically that a program will not return XDP_TX (I'm
> > not 
> > quite sure about what the BPF limitations allow to guess in
> > advance) or
> > b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> > performance when not needed?
> 
> This was kind of hard to root-cause, but I solved it by increasing
> the TCP
> socket size used by the iperf tool, like this (please reproduce):
> 
> $ iperf -s --window 4M
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size:  416 KByte (WARNING: requested 4.00 MByte)
> ------------------------------------------------------------
> 
> Given I could reproduce, I took at closer look at perf record/report
> stats,
> and it was actually quite clear that this was related to stalling on
> getting
> pages from the page allocator (function calls top#6
> get_page_from_freelist
> and free_pcppages_bulk).
> 
> Using my tool: ethtool_stats.pl
>  
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> 
> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:        1034 (       1,034) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy
> /sec
> 
> While the non-XDP case looked like this:
>  Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <=
> rx_csum_complete /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy
> /sec
> Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44
> 
> With the increased TCP window size, the mlx5 driver cache is working
> better,
> but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU
> usage.
>  Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:         164 (         164) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy
> /sec
> Manual consistence calc: 7532983-(6638518+894453) = 12
> 
> To understand why this is happening, you first have to know that the
> difference is between the two RX-memory modes used by mlx5 for non-
> XDP vs
> XDP. With non-XDP two frames are stored per memory-page, while for
> XDP only
> a single frame per page is used.  The packets available in the RX-
> rings are
> actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.
> 

Thanks Jesper ! this was a well put together explanation.
I want to point out that some other drivers are using alloc_skb APIs
which provide a good caching mechanism, which is even better than the
mlx5 internal one (which uses the alloc_page APIs directly), this can
explain the difference, and your explanation shows the root cause of
the higher cpu util with XDP on mlx5, since the mlx5 page cache works
with half of its capacity when enabling XDP.

Now do we really need to keep this page per packet in mlx5 when XDP is
enabled ? i think it is time to drop that .. 

> I believe, the real issue is that TCP use the SKB->truesize (based on
> frame
> size) for different memory pressure and window calculations, which is
> why it
> solved the issue to increase the window size manually.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 18:06             ` Saeed Mahameed
@ 2019-05-31 21:57               ` Jesper Dangaard Brouer
  2019-06-04  7:28               ` Tom Barbette
  1 sibling, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2019-05-31 21:57 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: barbette@kth.se, toke@redhat.com, xdp-newbies@vger.kernel.org,
	netdev@vger.kernel.org, Leon Romanovsky, Tariq Toukan, brouer


On Fri, 31 May 2019 18:06:01 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
[...]
> > 
> > To understand why this is happening, you first have to know that the
> > difference is between the two RX-memory modes used by mlx5 for non-
> > XDP vs XDP. With non-XDP two frames are stored per memory-page,
> > while for XDP only a single frame per page is used.  The packets
> > available in the RX- rings are  actually the same, as the ring
> > sizes are non-XDP=512 vs. XDP=1024. 
> 
> Thanks Jesper ! this was a well put together explanation.
> I want to point out that some other drivers are using alloc_skb APIs
> which provide a good caching mechanism, which is even better than the
> mlx5 internal one (which uses the alloc_page APIs directly), this can
> explain the difference, and your explanation shows the root cause of
> the higher cpu util with XDP on mlx5, since the mlx5 page cache works
> with half of its capacity when enabling XDP.
> 
> Now do we really need to keep this page per packet in mlx5 when XDP is
> enabled ? i think it is time to drop that .. 

No, we need to keep the page per packet (at least until, I've solved
some corner-cases with page_pool, which could likely require getting a
page-flag).

> > I believe, the real issue is that TCP use the SKB->truesize (based
> > on frame size) for different memory pressure and window
> > calculations, which is why it solved the issue to increase the
> > window size manually. 

The TCP performance issue is not solely a SKB->truesize issue, but also
an issue with how the driver level page-cache works.  It is actually
very fragile, as single page with elevated refcnt can block the cache
(see mlx5e_rx_cache_get()).  Which easily happens with TCP packets
that is waiting to be re-transmitted in-case of loss.  This is
happening here, as indicated by the rx_cache_busy and rx_cache_full
being the same.

We (Ilias, Tariq and I) have been planning to remove this small driver
cache, and instead use the page_pool, and create a page-return path for
SKBs.  Which should make this problem go away.  I'm going to be working
on this the next couple of weeks (the tricky part is all the corner
cases).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Fri, 31 May 2019 18:18:17 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy /sec
>  Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full /sec
>  Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse /sec
>  Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:        1034 (       1,034) <= rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets /sec
>  Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy /sec

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-05-31 18:06             ` Saeed Mahameed
  2019-05-31 21:57               ` Jesper Dangaard Brouer
@ 2019-06-04  7:28               ` Tom Barbette
  2019-06-04  9:15                 ` Jesper Dangaard Brouer
  2019-06-04 18:35                 ` Saeed Mahameed
  1 sibling, 2 replies; 13+ messages in thread
From: Tom Barbette @ 2019-06-04  7:28 UTC (permalink / raw)
  To: Saeed Mahameed, brouer@redhat.com
  Cc: toke@redhat.com, xdp-newbies@vger.kernel.org, Leon Romanovsky,
	Tariq Toukan

Le 31/05/2019 à 20:06, Saeed Mahameed a écrit :
> 
> The question is, On the same packet rate/bandwidth do you see higher
> cpu utilization on mlx5 compared to other drivers? you have to compare
> apples to apples.
> 
I meant relative increase. Of course at 40G the XL710 is using less CPU, 
but activating XDP is nearly free. As XDP is purely per packet I would 
expect the cost of it to be similar. Eg, a few instructions per packet.

Thanks Jesper for looking into this!

I don't think I will be of much help further on this matter. My take out 
would be: as a first-time user looking into XDP after watching a dozen 
of XDP talks, I would have expected XDP default settings to be identical 
to SKB, so I don't have to watch out for a set of per-driver parameter 
checklist to avoid increasing my CPU consumption by 15% when inserting 
"a super efficient and light BPF program". But I understand it's not 
that easy...

Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-06-04  7:28               ` Tom Barbette
@ 2019-06-04  9:15                 ` Jesper Dangaard Brouer
  2019-06-04 18:35                 ` Saeed Mahameed
  1 sibling, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2019-06-04  9:15 UTC (permalink / raw)
  To: Tom Barbette
  Cc: Saeed Mahameed, toke@redhat.com, xdp-newbies@vger.kernel.org,
	Leon Romanovsky, Tariq Toukan, brouer, Björn Töpel,
	Karlsson, Magnus, Jakub Kicinski, netdev@vger.kernel.org

On Tue, 4 Jun 2019 09:28:22 +0200
Tom Barbette <barbette@kth.se> wrote:

> Thanks Jesper for looking into this!
> 
> I don't think I will be of much help further on this matter. My take
> out would be: as a first-time user looking into XDP after watching a
> dozen of XDP talks, I would have expected XDP default settings to be
> identical to SKB, so I don't have to watch out for a set of
> per-driver parameter checklist to avoid increasing my CPU consumption
> by 15% when inserting "a super efficient and light BPF program". But
> I understand it's not that easy...

The gap should not be this large, but as I demonstrated it was primarily
because you hit an unfortunate interaction with TCP and how the mlx5
driver does page-caching (p.s. we are working on removing this driver
local recycle-cache).
  When loading an XDP/eBPF-prog then the driver change the underlying RX
memory model, which waste memory to gain packets-per-sec speed, but TCP
sees this memory waste and gives us a penalty.

It is important to understand, that XDP is not optimized for TCP.  XDP
is designed and optimized for L2-L3 handling of packets (TCP is L4).
Before XDP these L2-L3 use-cases were "slow", because the kernel
netstack assumes a L4/socket use-case (full SKB), when less was really
needed.

This is actually another good example of why XDP programs per RX-queue,
will be useful (notice: which is not implemented upstream, yet...).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Bad XDP performance with mlx5
  2019-06-04  7:28               ` Tom Barbette
  2019-06-04  9:15                 ` Jesper Dangaard Brouer
@ 2019-06-04 18:35                 ` Saeed Mahameed
  1 sibling, 0 replies; 13+ messages in thread
From: Saeed Mahameed @ 2019-06-04 18:35 UTC (permalink / raw)
  To: barbette@kth.se, Eran Ben Elisha, Maxim Mikityanskiy,
	brouer@redhat.com
  Cc: toke@redhat.com, xdp-newbies@vger.kernel.org, Leon Romanovsky,
	Tariq Toukan

On Tue, 2019-06-04 at 09:28 +0200, Tom Barbette wrote:
> Le 31/05/2019 à 20:06, Saeed Mahameed a écrit :
> > The question is, On the same packet rate/bandwidth do you see
> > higher
> > cpu utilization on mlx5 compared to other drivers? you have to
> > compare
> > apples to apples.
> > 
> I meant relative increase. Of course at 40G the XL710 is using less
> CPU, 
> but activating XDP is nearly free. As XDP is purely per packet I
> would 
> expect the cost of it to be similar. Eg, a few instructions per
> packet.
> 
> 
> Thanks Jesper for looking into this!
> 
> I don't think I will be of much help further on this matter. My take
> out 
> would be: as a first-time user looking into XDP after watching a
> dozen 
> of XDP talks, I would have expected XDP default settings to be
> identical 
> to SKB, so I don't have to watch out for a set of per-driver
> parameter 
> checklist to avoid increasing my CPU consumption by 15% when
> inserting 
> "a super efficient and light BPF program". But I understand it's not 
> that easy...
> 

Hi Tom,

Don't give up so easy on XDP :)
let me do a quick test and see how we can help here. 

> Tom

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-04 18:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-05-29 16:03 Bad XDP performance with mlx5 Tom Barbette
2019-05-29 17:16 ` Jesper Dangaard Brouer
2019-05-29 18:16   ` Tom Barbette
2019-05-30  7:40     ` Jesper Dangaard Brouer
2019-05-30  8:55       ` Tom Barbette
2019-05-31  6:51         ` Tom Barbette
2019-05-31 16:18           ` Jesper Dangaard Brouer
2019-05-31 18:00             ` David Miller
2019-05-31 18:06             ` Saeed Mahameed
2019-05-31 21:57               ` Jesper Dangaard Brouer
2019-06-04  7:28               ` Tom Barbette
2019-06-04  9:15                 ` Jesper Dangaard Brouer
2019-06-04 18:35                 ` Saeed Mahameed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.