Linux network is damn fast, need more use XDP (Was: DC behaviors today)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Linux network is damn fast, need more use XDP (Was:  DC behaviors today)
       [not found]         ` <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org>
@ 2017-12-04 10:56           ` Jesper Dangaard Brouer
  2017-12-04 17:00             ` [Bloat] " Dave Taht
       [not found]             ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2017-12-04 10:56 UTC (permalink / raw)
  To: Dave Taht
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1, Christina Jacob,
	Joel Wirāmu Pauling,
	cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org,
	David Ahern, Tariq Toukan

On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@taht.net> wrote:

> Changing the topic, adding bloat.

Adding netdev, and also adjust the topic to be a rant on that the Linux
kernel network stack is actually damn fast, and if you need something
faster then XDP can solved your needs...

> Joel Wirāmu Pauling <joel@aenertia.net> writes:
> 
> > Just from a Telco/Industry perspective slant.
> >
> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
> > Mellanox X5 cards are the current hotness, and their offload
> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
> > OVS flow rules programming into the card. We have a lot of customers
> > chomping at the bit for that feature (disclaimer I work for Nuage
> > Networks, and we are working on enhanced OVS to do just that) for NFV
> > workloads.  
> 
> What Jesper's been working on for ages has been to try and get linux's
> PPS up for small packets, which last I heard was hovering at about
> 4Gbits.

I hope you made a typo here Dave, the normal Linux kernel is definitely
way beyond 4Gbit/s, you must have misunderstood something, maybe you
meant 40Gbit/s? (which is also too low)

Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
But when the drivers page-recycler fails, we hit bottlenecks in the
page-allocator, that cause negative scaling to around 43Gbit/s.

[1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com

Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
but last couple of years the network stack have been optimized (with
UDP workloads), and as a result we can do 10G without TSO/GRO on a
single-CPU.  This is "only" 812Kpps with MTU size frames.

It is important to NOTICE that I'm mostly talking about SINGLE-CPU
performance.  But the Linux kernel scales very well to more CPUs, and
you can scale this up, although we are starting to hit scalability
issues in MM-land[1].

I've also demonstrated that netdev-community have optimized the kernels
per-CPU processing power to around 2Mpps.  What does this really
mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
single CPU without GRO (MTU size frames).  Do you need more I ask?

> The route table lookup also really expensive on the main cpu.

Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
blogposts[2][3] on the recent improvements over kernel versions, and
gave due credit to people involved.

[2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
[3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux

He measured around 25 to 35 nanosec cost of route lookups.  My own
recent measurements were 36.9 ns cost of fib_table_lookup.

> Does this stuff offload the route table lookup also?

If you have not heard, the netdev-community have worked on something
called XDP (eXpress Data Path).  This is a new layer in the network
stack, that basically operates a the same "layer"/level as DPDK.
Thus, surprise we get the same performance numbers as DPDK. E.g. I can
do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)

We can actually use XDP for (software) offloading the Linux routing
table.  There are two methods we are experimenting with:

(1) externally monitor route changes from userspace and update BPF-maps
to reflect this. That approach is already accepted upstream[4][5].  I'm
measuring 9,513,746 pps per CPU with that approach.

(2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
This is still experimental patches (credit to David Ahern), and I've
measured 9,350,160 pps with this approach in a single CPU.  Using more
CPUs we hit 14.6Mpps (only used 3 CPUs in that test)

[4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
[5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)
  2017-12-04 10:56           ` Linux network is damn fast, need more use XDP (Was: DC behaviors today) Jesper Dangaard Brouer
@ 2017-12-04 17:00             ` Dave Taht
       [not found]               ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]             ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Taht @ 2017-12-04 17:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, Christina Jacob, Joel Wirāmu Pauling,
	cerowrt-devel@lists.bufferbloat.net, bloat, David Ahern,
	Tariq Toukan

Jesper:

I have a tendency to deal with netdev by itself and never cross post
there, as the bufferbloat.net servers (primarily to combat spam)
mandate starttls and vger doesn't support it at all, thus leading to
raising davem blood pressure which I'd rather not do.

But moving on...

On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@taht.net> wrote:
>
>> Changing the topic, adding bloat.
>
> Adding netdev, and also adjust the topic to be a rant on that the Linux
> kernel network stack is actually damn fast, and if you need something
> faster then XDP can solved your needs...
>
>> Joel Wirāmu Pauling <joel@aenertia.net> writes:
>>
>> > Just from a Telco/Industry perspective slant.
>> >
>> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
>> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
>> > Mellanox X5 cards are the current hotness, and their offload
>> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
>> > OVS flow rules programming into the card. We have a lot of customers
>> > chomping at the bit for that feature (disclaimer I work for Nuage
>> > Networks, and we are working on enhanced OVS to do just that) for NFV
>> > workloads.
>>
>> What Jesper's been working on for ages has been to try and get linux's
>> PPS up for small packets, which last I heard was hovering at about
>> 4Gbits.
>
> I hope you made a typo here Dave, the normal Linux kernel is definitely
> way beyond 4Gbit/s, you must have misunderstood something, maybe you
> meant 40Gbit/s? (which is also too low)

The context here was PPS for *non-gro'd* tcp ack packets, in the
further context of
the increasingly epic "benefits of ack filtering" thread on the bloat
list, in the context
that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new
sch_cake ack-filter code, double the throughput...

The kind of return traffic you see from data sent outside the DC, with
tons of flows.

What's that number?

>
> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.

So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or?

> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
>
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.

acks.

> It is important to NOTICE that I'm mostly talking about SINGLE-CPU
> performance.  But the Linux kernel scales very well to more CPUs, and
> you can scale this up, although we are starting to hit scalability
> issues in MM-land[1].
>
> I've also demonstrated that netdev-community have optimized the kernels
> per-CPU processing power to around 2Mpps.  What does this really
> mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
> should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
> single CPU without GRO (MTU size frames).  Do you need more I ask?

The benchmark I had in mind was, say, 100k flows going out over the internet,
and the characteristics of the ack flows on the return path.

>
>
>> The route table lookup also really expensive on the main cpu.

To clarify the context here, I was asking specifically if the X5 mellonox card
did routing table offlload or only switching.

> Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
> blogposts[2][3] on the recent improvements over kernel versions, and
> gave due credit to people involved.
>
> [2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
> [3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
>
> He measured around 25 to 35 nanosec cost of route lookups.  My own
> recent measurements were 36.9 ns cost of fib_table_lookup.

On intel hw.

>
>> Does this stuff offload the route table lookup also?
>
> If you have not heard, the netdev-community have worked on something
> called XDP (eXpress Data Path).  This is a new layer in the network
> stack, that basically operates a the same "layer"/level as DPDK.
> Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
>
> We can actually use XDP for (software) offloading the Linux routing
> table.  There are two methods we are experimenting with:
>
> (1) externally monitor route changes from userspace and update BPF-maps
> to reflect this. That approach is already accepted upstream[4][5].  I'm
> measuring 9,513,746 pps per CPU with that approach.
>
> (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
> This is still experimental patches (credit to David Ahern), and I've
> measured 9,350,160 pps with this approach in a single CPU.  Using more
> CPUs we hit 14.6Mpps (only used 3 CPUs in that test)

Neat. Perhaps trying xdp on the itty bitty routers I usually work on
would be a win.
quad arm cores are increasingy common there.

>
> [4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
> [5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c

thx very much for the update.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> _______________________________________________
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Linux network is damn fast, need more use XDP (Was: DC behaviors today)
       [not found]               ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-04 20:49                 ` Joel Wirāmu Pauling
  0 siblings, 0 replies; 6+ messages in thread
From: Joel Wirāmu Pauling @ 2017-12-04 20:49 UTC (permalink / raw)
  To: Dave Taht
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Christina Jacob,
	cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org,
	bloat, David Ahern, Tariq Toukan

On 5 December 2017 at 06:00, Dave Taht <dave.taht@gmail.com> wrote:

>>> The route table lookup also really expensive on the main cpu.
>
> To clarify the context here, I was asking specifically if the X5 mellonox card
> did routing table offlload or only switching.
>
To clarify what I know the X5 using it's smart offload engine CAN do
L3 offload into the NIC - the X4's can't.

So for the Nuage OVS -> Eswitch (what mellanox calls the flow
programming) magic to happen and be useful we are going to need X5.

Mark Iskra gave a talk at Openstack summit which can be found here:

https://www.openstack.org/videos/sydney-2017/warp-speed-openvswitch-turbo-charge-vnfs-to-100gbps-in-nextgen-sdnnfv-datacenter

Slides here:

https://www.openstack.org/assets/presentation-media/OSS-Nov-2017-Warp-speed-Openvswitch-v6.pptx

Mark's local to you (Mountain View) - and is a nice guy, is probably
the better person to answer specifics.

-Joel
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: Linux network is damn fast, need more use XDP (Was: DC behaviors today)
       [not found]             ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-12-04 17:19               ` Matthias Tafelmeier
  2017-12-07  8:33                 ` [Bloat] " Jesper Dangaard Brouer
  0 siblings, 1 reply; 6+ messages in thread
From: Matthias Tafelmeier @ 2017-12-04 17:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Dave Taht
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Christina Jacob,
	Joel Wirāmu Pauling,
	cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org,
	bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1, David Ahern, Tariq Toukan


[-- Attachment #1.1.1.1: Type: text/plain, Size: 1455 bytes --]

Hello,
> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.
>
> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
>
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.

Cannot find the reference anymore, but there was once some workshop held
by you during some netdev where you were stating that you're practially
in rigorous exchange with NIC vendors as to having them tremendously
increase the RX/TX rings(queues) numbers. Further, that there are hardly
any limits to the number other than FPGA magic/physical HW - up to
millions is viable was coined back then.  May I ask were this ended up?
Wouldn't that be key for massive parallelization either - With having a
queue(producer), a CPU (consumer)  - vice versa - per flow at the
extreme? Did this end up in this SMART-NIC thingummy? The latter is
rather trageted at XDP, no?


-- 
Besten Gruß

Matthias Tafelmeier


[-- Attachment #1.1.1.2: 0x8ADF343B.asc --]
[-- Type: application/pgp-keys, Size: 4806 bytes --]

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 538 bytes --]

[-- Attachment #2: Type: text/plain, Size: 140 bytes --]

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)
  2017-12-04 17:19               ` Matthias Tafelmeier
@ 2017-12-07  8:33                 ` Jesper Dangaard Brouer
  2017-12-07 18:50                   ` Matthias Tafelmeier
  0 siblings, 1 reply; 6+ messages in thread
From: Jesper Dangaard Brouer @ 2017-12-07  8:33 UTC (permalink / raw)
  To: Matthias Tafelmeier
  Cc: Dave Taht, netdev@vger.kernel.org, Joel Wirāmu Pauling,
	David Ahern, Tariq Toukan, brouer, Björn Töpel

[-- Attachment #1: Type: text/plain, Size: 3650 bytes --]

(Removed bloat-lists to avoid cross ML-posting)

On Mon, 4 Dec 2017 18:19:09 +0100 Matthias Tafelmeier <matthias.tafelmeier@gmx.net> wrote:

> Hello,
> > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> > But when the drivers page-recycler fails, we hit bottlenecks in the
> > page-allocator, that cause negative scaling to around 43Gbit/s.
> >
> > [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
> >
> > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> > a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> > but last couple of years the network stack have been optimized (with
> > UDP workloads), and as a result we can do 10G without TSO/GRO on a
> > single-CPU.  This is "only" 812Kpps with MTU size frames.  
> 
> Cannot find the reference anymore, but there was once some workshop held
> by you during some netdev where you were stating that you're practially
> in rigorous exchange with NIC vendors as to having them tremendously
> increase the RX/TX rings(queues) numbers. 

You are mis-quoting me. I have not recommended tremendously increasing
the RX/TX rings(queues) numbers.  Actually, we should likely decrease
number of RX-rings, per recommendation of Eric Dumazet[1], to increase
the chance of packet aggregation/bulking during NAPI-loop.  And use
something like CPUMAP[2] to re-distribute load on CPUs.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
[2] https://git.kernel.org/torvalds/c/452606d6c9cd

You might have heard/seen me talk about increasing the ring queue size.
that is the frames/pages available per RX-ring queue[3][4].  I generally
don't recommend increasing that too much, as it hurts cache-usage.  The
real reason it sometimes helps to increase the RX-ring size on the
Intel based NICs is because they intermix page-recycling into their
RX-ring, which I now added a counter for when it fails[5].

[3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html
[4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
[5] https://git.kernel.org/torvalds/c/86e23494222f3

> Further, that there are hardly
> any limits to the number other than FPGA magic/physical HW - up to
> millions is viable was coined back then.  May I ask were this ended up?
> Wouldn't that be key for massive parallelization either - With having a
> queue(producer), a CPU (consumer)  - vice versa - per flow at the
> extreme? Did this end up in this SMART-NIC thingummy? The latter is
> rather trageted at XDP, no?

I do have future plans for (wanting drivers to support) dynamically
adding more RX-TX-queue-pairs.  The general idea is to have NIC HW to
filter packets per application into specific NIC queue number, which
can be mapped directly into an application (and I want a queue-pair to
allow the app to TX also).

I actually imagine that we can do the application steering via
XDP_REDIRECT. And by having application register user-pages, like
AF_PACKET-V4, we can achieve zero-copy into userspace from XDP.  A
subtle trick here is that zero-copy only occurs if the RX-queue number
match (XDP operating at driver ring level could know), meaning that NIC
HW filter setup could happen async (but premapping userspace pages
still have to happen upfront, before starting app/socket).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 213 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)
  2017-12-07  8:33                 ` [Bloat] " Jesper Dangaard Brouer
@ 2017-12-07 18:50                   ` Matthias Tafelmeier
  0 siblings, 0 replies; 6+ messages in thread
From: Matthias Tafelmeier @ 2017-12-07 18:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Dave Taht, netdev@vger.kernel.org, Joel Wirāmu Pauling,
	David Ahern, Tariq Toukan, Björn Töpel


[-- Attachment #1.1.1: Type: text/plain, Size: 3284 bytes --]



That's the discussion I meant:
https://www.youtube.com/watch?v=vsjxgOpv1n8

Manifold excuses should I've turned your words in any respect. On top,
haven't rewatched it, only putting it here for completeness' sake.

> You are mis-quoting me. I have not recommended tremendously increasing
> the RX/TX rings(queues) numbers.  Actually, we should likely decrease
> number of RX-rings, per recommendation of Eric Dumazet[1], to increase
> the chance of packet aggregation/bulking during NAPI-loop.  And use
> something like CPUMAP[2] to re-distribute load on CPUs.
>
> [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> [2] https://git.kernel.org/torvalds/c/452606d6c9cd
Well certainly so for throughut optimizations: allow me to defensively
qualify by saying,  screwing down latency still quite depends on scaling
out rings, at least that's what experience tells me for the NAPI based
approach. Should hold symmetrically for Busy-Polling, though, am only
theoriticising there.

>
> You might have heard/seen me talk about increasing the ring queue size.
> that is the frames/pages available per RX-ring queue[3][4].  I generally
> don't recommend increasing that too much, as it hurts cache-usage.  The
> real reason it sometimes helps to increase the RX-ring size on the
> Intel based NICs is because they intermix page-recycling into their
> RX-ring, which I now added a counter for when it fails[5].
>
> [3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html
> [4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
> [5] https://git.kernel.org/torvalds/c/86e23494222f3
>
Presumingly, touching the length shoulb be obsolete ever since DQL,
respectively BQL, anyways.


>> Further, that there are hardly
>> any limits to the number other than FPGA magic/physical HW - up to
>> millions is viable was coined back then.  May I ask were this ended up?
>> Wouldn't that be key for massive parallelization either - With having a
>> queue(producer), a CPU (consumer)  - vice versa - per flow at the
>> extreme? Did this end up in this SMART-NIC thingummy? The latter is
>> rather trageted at XDP, no?
> I do have future plans for (wanting drivers to support) dynamically
> adding more RX-TX-queue-pairs.  The general idea is to have NIC HW to
> filter packets per application into specific NIC queue number, which
> can be mapped directly into an application (and I want a queue-pair to
> allow the app to TX also).
> I actually imagine that we can do the application steering via
> XDP_REDIRECT. And by having application register user-pages, like
> AF_PACKET-V4, we can achieve zero-copy into userspace from XDP.
I understand, working on a sort of in kernel 'virtual' TX-RX-Ring-Pairs
per flow/application.

>   A
> subtle trick here is that zero-copy only occurs if the RX-queue number
> match (XDP operating at driver ring level could know), meaning that NIC
> HW filter setup could happen async (but premapping userspace pages
> still have to happen upfront, before starting app/socket).
>
I see, a more sophisticated, flexible RPS then. That was overdue.


Very much appreciated, thanks!

-- 

Besten Gruß

Matthias Tafelmeier


[-- Attachment #1.1.2: 0x8ADF343B.asc --]
[-- Type: application/pgp-keys, Size: 4806 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 538 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-12-07 18:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAA93jw43M=dhPOFhMJo7f-qOq=k=kKS6ppq4o9=hsTEKoBdUpA@mail.gmail.com>
     [not found] ` <92906bd8-7bad-945d-83c8-a2f9598aac2c@lackof.org>
     [not found]   ` <CAA93jw5pRMcZmZQmRwSi_1dETEjTHhmg2iJ3A-ijuOMi+mg4+Q@mail.gmail.com>
     [not found]     ` <CAKiAkGT54RPLQ4f1tzCj9wcW=mnK7+=uJfaotw9G+H_JEy_hqQ@mail.gmail.com>
     [not found]       ` <87bmjff7l6.fsf_-_@nemesis.taht.net>
     [not found]         ` <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org>
2017-12-04 10:56           ` Linux network is damn fast, need more use XDP (Was: DC behaviors today) Jesper Dangaard Brouer
2017-12-04 17:00             ` [Bloat] " Dave Taht
     [not found]               ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-04 20:49                 ` Joel Wirāmu Pauling
     [not found]             ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-12-04 17:19               ` Matthias Tafelmeier
2017-12-07  8:33                 ` [Bloat] " Jesper Dangaard Brouer
2017-12-07 18:50                   ` Matthias Tafelmeier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).