* Linux network is damn fast, need more use XDP (Was: DC behaviors today) [not found] ` <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org> @ 2017-12-04 10:56 ` Jesper Dangaard Brouer 2017-12-04 17:00 ` [Bloat] " Dave Taht [not found] ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 2 replies; 6+ messages in thread From: Jesper Dangaard Brouer @ 2017-12-04 10:56 UTC (permalink / raw) To: Dave Taht Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1, Christina Jacob, Joel Wirāmu Pauling, cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org, David Ahern, Tariq Toukan On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@taht.net> wrote: > Changing the topic, adding bloat. Adding netdev, and also adjust the topic to be a rant on that the Linux kernel network stack is actually damn fast, and if you need something faster then XDP can solved your needs... > Joel Wirāmu Pauling <joel@aenertia.net> writes: > > > Just from a Telco/Industry perspective slant. > > > > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server > > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit. > > Mellanox X5 cards are the current hotness, and their offload > > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for > > OVS flow rules programming into the card. We have a lot of customers > > chomping at the bit for that feature (disclaimer I work for Nuage > > Networks, and we are working on enhanced OVS to do just that) for NFV > > workloads. > > What Jesper's been working on for ages has been to try and get linux's > PPS up for small packets, which last I heard was hovering at about > 4Gbits. I hope you made a typo here Dave, the normal Linux kernel is definitely way beyond 4Gbit/s, you must have misunderstood something, maybe you meant 40Gbit/s? (which is also too low) Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the Linux kernel network stack scales to 94Gbit/s (linerate minus overhead). But when the drivers page-recycler fails, we hit bottlenecks in the page-allocator, that cause negative scaling to around 43Gbit/s. [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets, but last couple of years the network stack have been optimized (with UDP workloads), and as a result we can do 10G without TSO/GRO on a single-CPU. This is "only" 812Kpps with MTU size frames. It is important to NOTICE that I'm mostly talking about SINGLE-CPU performance. But the Linux kernel scales very well to more CPUs, and you can scale this up, although we are starting to hit scalability issues in MM-land[1]. I've also demonstrated that netdev-community have optimized the kernels per-CPU processing power to around 2Mpps. What does this really mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s should be around 2Mpps.... That implies Linux can do 25Gbit/s on a single CPU without GRO (MTU size frames). Do you need more I ask? > The route table lookup also really expensive on the main cpu. Well, it used-to-be very expensive. Vincent Bernat wrote some excellent blogposts[2][3] on the recent improvements over kernel versions, and gave due credit to people involved. [2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux [3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux He measured around 25 to 35 nanosec cost of route lookups. My own recent measurements were 36.9 ns cost of fib_table_lookup. > Does this stuff offload the route table lookup also? If you have not heard, the netdev-community have worked on something called XDP (eXpress Data Path). This is a new layer in the network stack, that basically operates a the same "layer"/level as DPDK. Thus, surprise we get the same performance numbers as DPDK. E.g. I can do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps) We can actually use XDP for (software) offloading the Linux routing table. There are two methods we are experimenting with: (1) externally monitor route changes from userspace and update BPF-maps to reflect this. That approach is already accepted upstream[4][5]. I'm measuring 9,513,746 pps per CPU with that approach. (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook. This is still experimental patches (credit to David Ahern), and I've measured 9,350,160 pps with this approach in a single CPU. Using more CPUs we hit 14.6Mpps (only used 3 CPUs in that test) [4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c [5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today) 2017-12-04 10:56 ` Linux network is damn fast, need more use XDP (Was: DC behaviors today) Jesper Dangaard Brouer @ 2017-12-04 17:00 ` Dave Taht [not found] ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 6+ messages in thread From: Dave Taht @ 2017-12-04 17:00 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: netdev@vger.kernel.org, Christina Jacob, Joel Wirāmu Pauling, cerowrt-devel@lists.bufferbloat.net, bloat, David Ahern, Tariq Toukan Jesper: I have a tendency to deal with netdev by itself and never cross post there, as the bufferbloat.net servers (primarily to combat spam) mandate starttls and vger doesn't support it at all, thus leading to raising davem blood pressure which I'd rather not do. But moving on... On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@taht.net> wrote: > >> Changing the topic, adding bloat. > > Adding netdev, and also adjust the topic to be a rant on that the Linux > kernel network stack is actually damn fast, and if you need something > faster then XDP can solved your needs... > >> Joel Wirāmu Pauling <joel@aenertia.net> writes: >> >> > Just from a Telco/Industry perspective slant. >> > >> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server >> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit. >> > Mellanox X5 cards are the current hotness, and their offload >> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for >> > OVS flow rules programming into the card. We have a lot of customers >> > chomping at the bit for that feature (disclaimer I work for Nuage >> > Networks, and we are working on enhanced OVS to do just that) for NFV >> > workloads. >> >> What Jesper's been working on for ages has been to try and get linux's >> PPS up for small packets, which last I heard was hovering at about >> 4Gbits. > > I hope you made a typo here Dave, the normal Linux kernel is definitely > way beyond 4Gbit/s, you must have misunderstood something, maybe you > meant 40Gbit/s? (which is also too low) The context here was PPS for *non-gro'd* tcp ack packets, in the further context of the increasingly epic "benefits of ack filtering" thread on the bloat list, in the context that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new sch_cake ack-filter code, double the throughput... The kind of return traffic you see from data sent outside the DC, with tons of flows. What's that number? > > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead). > But when the drivers page-recycler fails, we hit bottlenecks in the > page-allocator, that cause negative scaling to around 43Gbit/s. So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or? > [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com > > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on > a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets, > but last couple of years the network stack have been optimized (with > UDP workloads), and as a result we can do 10G without TSO/GRO on a > single-CPU. This is "only" 812Kpps with MTU size frames. acks. > It is important to NOTICE that I'm mostly talking about SINGLE-CPU > performance. But the Linux kernel scales very well to more CPUs, and > you can scale this up, although we are starting to hit scalability > issues in MM-land[1]. > > I've also demonstrated that netdev-community have optimized the kernels > per-CPU processing power to around 2Mpps. What does this really > mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s > should be around 2Mpps.... That implies Linux can do 25Gbit/s on a > single CPU without GRO (MTU size frames). Do you need more I ask? The benchmark I had in mind was, say, 100k flows going out over the internet, and the characteristics of the ack flows on the return path. > > >> The route table lookup also really expensive on the main cpu. To clarify the context here, I was asking specifically if the X5 mellonox card did routing table offlload or only switching. > Well, it used-to-be very expensive. Vincent Bernat wrote some excellent > blogposts[2][3] on the recent improvements over kernel versions, and > gave due credit to people involved. > > [2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux > [3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux > > He measured around 25 to 35 nanosec cost of route lookups. My own > recent measurements were 36.9 ns cost of fib_table_lookup. On intel hw. > >> Does this stuff offload the route table lookup also? > > If you have not heard, the netdev-community have worked on something > called XDP (eXpress Data Path). This is a new layer in the network > stack, that basically operates a the same "layer"/level as DPDK. > Thus, surprise we get the same performance numbers as DPDK. E.g. I can > do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps) > > We can actually use XDP for (software) offloading the Linux routing > table. There are two methods we are experimenting with: > > (1) externally monitor route changes from userspace and update BPF-maps > to reflect this. That approach is already accepted upstream[4][5]. I'm > measuring 9,513,746 pps per CPU with that approach. > > (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook. > This is still experimental patches (credit to David Ahern), and I've > measured 9,350,160 pps with this approach in a single CPU. Using more > CPUs we hit 14.6Mpps (only used 3 CPUs in that test) Neat. Perhaps trying xdp on the itty bitty routers I usually work on would be a win. quad arm cores are increasingy common there. > > [4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c > [5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c thx very much for the update. > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > _______________________________________________ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat -- Dave Täht CEO, TekLibre, LLC http://www.teklibre.com Tel: 1-669-226-2619 _______________________________________________ Cerowrt-devel mailing list Cerowrt-devel@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/cerowrt-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Linux network is damn fast, need more use XDP (Was: DC behaviors today) [not found] ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2017-12-04 20:49 ` Joel Wirāmu Pauling 0 siblings, 0 replies; 6+ messages in thread From: Joel Wirāmu Pauling @ 2017-12-04 20:49 UTC (permalink / raw) To: Dave Taht Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Christina Jacob, cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org, bloat, David Ahern, Tariq Toukan On 5 December 2017 at 06:00, Dave Taht <dave.taht@gmail.com> wrote: >>> The route table lookup also really expensive on the main cpu. > > To clarify the context here, I was asking specifically if the X5 mellonox card > did routing table offlload or only switching. > To clarify what I know the X5 using it's smart offload engine CAN do L3 offload into the NIC - the X4's can't. So for the Nuage OVS -> Eswitch (what mellanox calls the flow programming) magic to happen and be useful we are going to need X5. Mark Iskra gave a talk at Openstack summit which can be found here: https://www.openstack.org/videos/sydney-2017/warp-speed-openvswitch-turbo-charge-vnfs-to-100gbps-in-nextgen-sdnnfv-datacenter Slides here: https://www.openstack.org/assets/presentation-media/OSS-Nov-2017-Warp-speed-Openvswitch-v6.pptx Mark's local to you (Mountain View) - and is a nice guy, is probably the better person to answer specifics. -Joel _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Linux network is damn fast, need more use XDP (Was: DC behaviors today) [not found] ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-12-04 17:19 ` Matthias Tafelmeier 2017-12-07 8:33 ` [Bloat] " Jesper Dangaard Brouer 0 siblings, 1 reply; 6+ messages in thread From: Matthias Tafelmeier @ 2017-12-04 17:19 UTC (permalink / raw) To: Jesper Dangaard Brouer, Dave Taht Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Christina Jacob, Joel Wirāmu Pauling, cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org, bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1, David Ahern, Tariq Toukan [-- Attachment #1.1.1.1: Type: text/plain, Size: 1455 bytes --] Hello, > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead). > But when the drivers page-recycler fails, we hit bottlenecks in the > page-allocator, that cause negative scaling to around 43Gbit/s. > > [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com > > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on > a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets, > but last couple of years the network stack have been optimized (with > UDP workloads), and as a result we can do 10G without TSO/GRO on a > single-CPU. This is "only" 812Kpps with MTU size frames. Cannot find the reference anymore, but there was once some workshop held by you during some netdev where you were stating that you're practially in rigorous exchange with NIC vendors as to having them tremendously increase the RX/TX rings(queues) numbers. Further, that there are hardly any limits to the number other than FPGA magic/physical HW - up to millions is viable was coined back then. May I ask were this ended up? Wouldn't that be key for massive parallelization either - With having a queue(producer), a CPU (consumer) - vice versa - per flow at the extreme? Did this end up in this SMART-NIC thingummy? The latter is rather trageted at XDP, no? -- Besten Gruß Matthias Tafelmeier [-- Attachment #1.1.1.2: 0x8ADF343B.asc --] [-- Type: application/pgp-keys, Size: 4806 bytes --] [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 538 bytes --] [-- Attachment #2: Type: text/plain, Size: 140 bytes --] _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today) 2017-12-04 17:19 ` Matthias Tafelmeier @ 2017-12-07 8:33 ` Jesper Dangaard Brouer 2017-12-07 18:50 ` Matthias Tafelmeier 0 siblings, 1 reply; 6+ messages in thread From: Jesper Dangaard Brouer @ 2017-12-07 8:33 UTC (permalink / raw) To: Matthias Tafelmeier Cc: Dave Taht, netdev@vger.kernel.org, Joel Wirāmu Pauling, David Ahern, Tariq Toukan, brouer, Björn Töpel [-- Attachment #1: Type: text/plain, Size: 3650 bytes --] (Removed bloat-lists to avoid cross ML-posting) On Mon, 4 Dec 2017 18:19:09 +0100 Matthias Tafelmeier <matthias.tafelmeier@gmx.net> wrote: > Hello, > > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the > > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead). > > But when the drivers page-recycler fails, we hit bottlenecks in the > > page-allocator, that cause negative scaling to around 43Gbit/s. > > > > [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com > > > > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on > > a SINGLE CPU. This is mostly thanks to TSO/GRO aggregating packets, > > but last couple of years the network stack have been optimized (with > > UDP workloads), and as a result we can do 10G without TSO/GRO on a > > single-CPU. This is "only" 812Kpps with MTU size frames. > > Cannot find the reference anymore, but there was once some workshop held > by you during some netdev where you were stating that you're practially > in rigorous exchange with NIC vendors as to having them tremendously > increase the RX/TX rings(queues) numbers. You are mis-quoting me. I have not recommended tremendously increasing the RX/TX rings(queues) numbers. Actually, we should likely decrease number of RX-rings, per recommendation of Eric Dumazet[1], to increase the chance of packet aggregation/bulking during NAPI-loop. And use something like CPUMAP[2] to re-distribute load on CPUs. [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf [2] https://git.kernel.org/torvalds/c/452606d6c9cd You might have heard/seen me talk about increasing the ring queue size. that is the frames/pages available per RX-ring queue[3][4]. I generally don't recommend increasing that too much, as it hurts cache-usage. The real reason it sometimes helps to increase the RX-ring size on the Intel based NICs is because they intermix page-recycling into their RX-ring, which I now added a counter for when it fails[5]. [3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html [4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html [5] https://git.kernel.org/torvalds/c/86e23494222f3 > Further, that there are hardly > any limits to the number other than FPGA magic/physical HW - up to > millions is viable was coined back then. May I ask were this ended up? > Wouldn't that be key for massive parallelization either - With having a > queue(producer), a CPU (consumer) - vice versa - per flow at the > extreme? Did this end up in this SMART-NIC thingummy? The latter is > rather trageted at XDP, no? I do have future plans for (wanting drivers to support) dynamically adding more RX-TX-queue-pairs. The general idea is to have NIC HW to filter packets per application into specific NIC queue number, which can be mapped directly into an application (and I want a queue-pair to allow the app to TX also). I actually imagine that we can do the application steering via XDP_REDIRECT. And by having application register user-pages, like AF_PACKET-V4, we can achieve zero-copy into userspace from XDP. A subtle trick here is that zero-copy only occurs if the RX-queue number match (XDP operating at driver ring level could know), meaning that NIC HW filter setup could happen async (but premapping userspace pages still have to happen upfront, before starting app/socket). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 213 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today) 2017-12-07 8:33 ` [Bloat] " Jesper Dangaard Brouer @ 2017-12-07 18:50 ` Matthias Tafelmeier 0 siblings, 0 replies; 6+ messages in thread From: Matthias Tafelmeier @ 2017-12-07 18:50 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Dave Taht, netdev@vger.kernel.org, Joel Wirāmu Pauling, David Ahern, Tariq Toukan, Björn Töpel [-- Attachment #1.1.1: Type: text/plain, Size: 3284 bytes --] That's the discussion I meant: https://www.youtube.com/watch?v=vsjxgOpv1n8 Manifold excuses should I've turned your words in any respect. On top, haven't rewatched it, only putting it here for completeness' sake. > You are mis-quoting me. I have not recommended tremendously increasing > the RX/TX rings(queues) numbers. Actually, we should likely decrease > number of RX-rings, per recommendation of Eric Dumazet[1], to increase > the chance of packet aggregation/bulking during NAPI-loop. And use > something like CPUMAP[2] to re-distribute load on CPUs. > > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf > [2] https://git.kernel.org/torvalds/c/452606d6c9cd Well certainly so for throughut optimizations: allow me to defensively qualify by saying, screwing down latency still quite depends on scaling out rings, at least that's what experience tells me for the NAPI based approach. Should hold symmetrically for Busy-Polling, though, am only theoriticising there. > > You might have heard/seen me talk about increasing the ring queue size. > that is the frames/pages available per RX-ring queue[3][4]. I generally > don't recommend increasing that too much, as it hurts cache-usage. The > real reason it sometimes helps to increase the RX-ring size on the > Intel based NICs is because they intermix page-recycling into their > RX-ring, which I now added a counter for when it fails[5]. > > [3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html > [4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html > [5] https://git.kernel.org/torvalds/c/86e23494222f3 > Presumingly, touching the length shoulb be obsolete ever since DQL, respectively BQL, anyways. >> Further, that there are hardly >> any limits to the number other than FPGA magic/physical HW - up to >> millions is viable was coined back then. May I ask were this ended up? >> Wouldn't that be key for massive parallelization either - With having a >> queue(producer), a CPU (consumer) - vice versa - per flow at the >> extreme? Did this end up in this SMART-NIC thingummy? The latter is >> rather trageted at XDP, no? > I do have future plans for (wanting drivers to support) dynamically > adding more RX-TX-queue-pairs. The general idea is to have NIC HW to > filter packets per application into specific NIC queue number, which > can be mapped directly into an application (and I want a queue-pair to > allow the app to TX also). > I actually imagine that we can do the application steering via > XDP_REDIRECT. And by having application register user-pages, like > AF_PACKET-V4, we can achieve zero-copy into userspace from XDP. I understand, working on a sort of in kernel 'virtual' TX-RX-Ring-Pairs per flow/application. > A > subtle trick here is that zero-copy only occurs if the RX-queue number > match (XDP operating at driver ring level could know), meaning that NIC > HW filter setup could happen async (but premapping userspace pages > still have to happen upfront, before starting app/socket). > I see, a more sophisticated, flexible RPS then. That was overdue. Very much appreciated, thanks! -- Besten Gruß Matthias Tafelmeier [-- Attachment #1.1.2: 0x8ADF343B.asc --] [-- Type: application/pgp-keys, Size: 4806 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 538 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-12-07 18:51 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAA93jw43M=dhPOFhMJo7f-qOq=k=kKS6ppq4o9=hsTEKoBdUpA@mail.gmail.com>
[not found] ` <92906bd8-7bad-945d-83c8-a2f9598aac2c@lackof.org>
[not found] ` <CAA93jw5pRMcZmZQmRwSi_1dETEjTHhmg2iJ3A-ijuOMi+mg4+Q@mail.gmail.com>
[not found] ` <CAKiAkGT54RPLQ4f1tzCj9wcW=mnK7+=uJfaotw9G+H_JEy_hqQ@mail.gmail.com>
[not found] ` <87bmjff7l6.fsf_-_@nemesis.taht.net>
[not found] ` <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org>
2017-12-04 10:56 ` Linux network is damn fast, need more use XDP (Was: DC behaviors today) Jesper Dangaard Brouer
2017-12-04 17:00 ` [Bloat] " Dave Taht
[not found] ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-04 20:49 ` Joel Wirāmu Pauling
[not found] ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-12-04 17:19 ` Matthias Tafelmeier
2017-12-07 8:33 ` [Bloat] " Jesper Dangaard Brouer
2017-12-07 18:50 ` Matthias Tafelmeier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).