From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@herbertland.com>,
Or Gerlitz <gerlitz.or@gmail.com>,
David Miller <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Linux Netdev List <netdev@vger.kernel.org>,
Alexander Duyck <alexander.duyck@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Daniel Borkmann <borkmann@iogearbox.net>,
Marek Majkowski <marek@cloudflare.com>,
Hannes Frederic Sowa <hannes@stressinduktion.org>,
Florian Westphal <fw@strlen.de>, Paolo Abeni <pabeni@redhat.com>,
John Fastabend <john.r.fastabend@intel.com>,
Amir Vadai <amirva@gmail.com>,
brouer@redhat.com
Subject: Re: Optimizing instruction-cache, more packets at each stage
Date: Fri, 22 Jan 2016 13:33:41 +0100 [thread overview]
Message-ID: <20160122133341.0f174115@redhat.com> (raw)
In-Reply-To: <1453398516.1223.376.camel@edumazet-glaptop2.roam.corp.google.com>
On Thu, 21 Jan 2016 09:48:36 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
>
> > Sure, but the receive path is parallelized.
>
> This is true for multiqueue processing, assuming you can dedicate many
> cores to process RX.
>
> > Improving parallelism has
> > continuously shown to have much more impact than attempting to
> > optimize for cache misses. The primary goal is not to drive 100Gbps
> > with 64 packets from a single CPU. It is one benchmark of many we
> > should look at to measure efficiency of the data path, but I've yet to
> > see any real workload that requires that...
> >
> > Regardless of anything, we need to load packet headers into CPU cache
> > to do protocol processing. I'm not sure I see how trying to defer that
> > as long as possible helps except in cases where the packet is crossing
> > CPU cache boundaries and can eliminate cache misses completely (not
> > just move them around from one function to another).
>
> Note that some user space use multiple core (or hyper threads) to
> implement a pipeline, using a single RX queue.
>
> One thread can handle one stage (device RX drain) and prefetch data into
> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
>
> The second thread process packets with headers already in L1/L2
I agree. I've heard experiences where DPDK users use 2 core for RX, and
1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
with full Internet routing table look up.
One of the ideas behind my alf_queue, is that it can be used for
efficiently distributing object (pointers) between threads.
1. because it only transfers the pointers (not touching object), and
2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
Thus, lower in the message passing cost between threads.
> This way, the ~100 ns (or even more if you also consider skb
> allocations) penalty to bring packet headers do not hurt PPS.
I've studied the allocation cost in great detail, thus let me share my
numbers, 100 ns is too high:
Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
The cycles count should be comparable with other CPUs, but that nanosec
measurement is affected by the very high clock freq of this CPU.
Kmem_cache fastpath "recycle" case:
SLUB => 44 cycles(tsc) 11.205 ns
SLAB => 96 cycles(tsc) 24.119 ns.
The problem is that real use-cases in the network stack, almost always
hit the slowpath in kmem_cache allocators.
Kmem_cache "slowpath" case:
SLUB => 117 cycles(tsc) 29.276 ns
SLAB => 101 cycles(tsc) 25.342 ns
I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
by introducing a bulk API, which amortize the needed sync-mechanisms.
Kmem_cache using bulk API:
SLUB => 37 cycles(tsc) 9.280 ns
SLAB => 20 cycles(tsc) 5.035 ns
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2016-01-22 12:33 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
2016-01-15 13:32 ` Hannes Frederic Sowa
2016-01-15 14:17 ` Jesper Dangaard Brouer
2016-01-15 13:36 ` David Laight
2016-01-15 14:00 ` Jesper Dangaard Brouer
2016-01-15 14:38 ` Felix Fietkau
2016-01-18 11:54 ` Jesper Dangaard Brouer
2016-01-18 17:01 ` Eric Dumazet
2016-01-25 0:08 ` Florian Fainelli
2016-01-15 20:47 ` David Miller
2016-01-18 10:27 ` Jesper Dangaard Brouer
2016-01-18 16:24 ` David Miller
2016-01-20 22:20 ` Or Gerlitz
2016-01-20 23:02 ` Eric Dumazet
2016-01-20 23:27 ` Tom Herbert
2016-01-21 11:27 ` Jesper Dangaard Brouer
2016-01-21 12:49 ` Or Gerlitz
2016-01-21 13:57 ` Jesper Dangaard Brouer
2016-01-21 18:56 ` David Miller
2016-01-21 22:45 ` Or Gerlitz
2016-01-21 22:59 ` David Miller
2016-01-21 16:38 ` Eric Dumazet
2016-01-21 18:54 ` David Miller
2016-01-24 14:28 ` Jesper Dangaard Brouer
2016-01-24 14:44 ` Michael S. Tsirkin
2016-01-24 17:28 ` John Fastabend
2016-01-25 13:15 ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer
2016-01-25 17:09 ` Tom Herbert
2016-01-25 17:50 ` John Fastabend
2016-01-25 21:32 ` Tom Herbert
2016-01-25 21:58 ` John Fastabend
2016-01-25 22:10 ` Jesper Dangaard Brouer
2016-01-27 20:47 ` Jesper Dangaard Brouer
2016-01-27 21:56 ` Alexei Starovoitov
2016-01-28 9:52 ` Jesper Dangaard Brouer
2016-01-28 12:54 ` Eric Dumazet
2016-01-28 13:25 ` Eric Dumazet
2016-01-28 16:43 ` Tom Herbert
2016-01-28 2:50 ` Tom Herbert
2016-01-28 9:25 ` Jesper Dangaard Brouer
2016-01-28 12:45 ` Eric Dumazet
2016-01-28 16:37 ` Tom Herbert
2016-01-28 16:43 ` Eric Dumazet
2016-01-28 17:04 ` Jesper Dangaard Brouer
2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert
2016-01-24 21:41 ` John Fastabend
2016-01-24 23:50 ` Tom Herbert
2016-01-21 12:23 ` Jesper Dangaard Brouer
2016-01-21 16:38 ` Tom Herbert
2016-01-21 17:48 ` Eric Dumazet
2016-01-22 12:33 ` Jesper Dangaard Brouer [this message]
2016-01-22 14:33 ` Eric Dumazet
2016-01-22 17:07 ` Tom Herbert
2016-01-22 17:17 ` Jesper Dangaard Brouer
2016-02-02 16:13 ` Or Gerlitz
2016-02-02 16:37 ` Eric Dumazet
2016-01-18 16:53 ` Eric Dumazet
2016-01-18 17:36 ` Tom Herbert
2016-01-18 17:49 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160122133341.0f174115@redhat.com \
--to=brouer@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=amirva@gmail.com \
--cc=borkmann@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=fw@strlen.de \
--cc=gerlitz.or@gmail.com \
--cc=hannes@stressinduktion.org \
--cc=john.r.fastabend@intel.com \
--cc=marek@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=tom@herbertland.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.