From: John Fastabend <john.fastabend@gmail.com>
To: Tom Herbert <tom@herbertland.com>,
Jesper Dangaard Brouer <brouer@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
David Miller <davem@davemloft.net>,
Eric Dumazet <eric.dumazet@gmail.com>,
Or Gerlitz <gerlitz.or@gmail.com>,
Eric Dumazet <edumazet@google.com>,
Linux Kernel Network Developers <netdev@vger.kernel.org>,
Alexander Duyck <alexander.duyck@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Daniel Borkmann <borkmann@iogearbox.net>,
Marek Majkowski <marek@cloudflare.com>,
Hannes Frederic Sowa <hannes@stressinduktion.org>,
Florian Westphal <fw@strlen.de>, Paolo Abeni <pabeni@redhat.com>,
John Fastabend <john.r.fastabend@intel.com>,
Amir Vadai <amirva@gmail.com>,
Daniel Borkmann <daniel@iogearbox.net>,
Vladislav Yasevich <vyasevich@gmail.com>
Subject: Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
Date: Mon, 25 Jan 2016 09:50:16 -0800 [thread overview]
Message-ID: <56A66058.1090308@gmail.com> (raw)
In-Reply-To: <CALx6S37EmUh7SXmVSjTN9DLMTicr6-VwL-i25VV1fM=V2E6giQ@mail.gmail.com>
On 16-01-25 09:09 AM, Tom Herbert wrote:
> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> After reading John's reply about perfect filters, I want to re-state
>> my idea, for this very early RX stage. And describe a packet-page
>> level bypass use-case, that John indirectly mentions.
>>
>>
>> There are two ideas, getting mixed up here. (1) bundling from the
>> RX-ring, (2) allowing to pick up the "packet-page" directly.
>>
>> Bundling (1) is something that seems natural, and which help us
>> amortize the cost between layers (and utilizes icache better). Lets
>> keep that in another thread.
>>
>> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>> BUT it have the potential of being an new integration point for
>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>> speed with bypass-solutions.
>>
>>
>> Today, the bypass-solutions grab and control the entire NIC HW. In
>> many cases this is not very practical, if you also want to use the NIC
>> for something else.
>>
>> Solutions for bypassing only part of the traffic is starting to show
>> up. Both a netmap[1] and a DPDK[2] based approach.
>>
>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>>
>> Both approaches install a HW filter in the NIC, and redirect packets
>> to a separate RX HW queue (via ethtool ntuple + flow-type). DPDK
>> needs pci SRIOV setup and then run it own poll-mode driver on top.
>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
>> changes[3] support a single RX queue mode.
>>
FWIW I wrote a version of the patch talked about in the queue splitting
article that didn't require SR-IOV and we also talked about it at last
netconf in ottowa. The problem is without SR-IOV if you map a queue
directly into userspace so you can run the poll mode drivers there is
nothing protecting the DMA engine. So userspace can put arbitrary
addresses in there. There is something called Process Address Space ID
(PASID) also part of the PCI-SIG spec that could help you here but I
don't know of any hardware that supports it. The other option is to
use system calls and validate the descriptors in the kernel but this
incurs some overhead we had it at 15% or so when I did the numbers
last year. However I'm told there is some interesting work going on
around syscall overhead that may help.
One thing to note is SRIOV does somewhat limit the number of these
types of interfaces you can support to the max VFs where as the
queue mechanism although slower with a function call would be limited
to max number of queues. Also busy polling will help here if you
are worried about pps.
Jesper, at least for you (2) case what are we missing with the
bifurcated/queue splitting work? Are you really after systems
without SR-IOV support or are you trying to get this on the order
of queues instead of VFs.
> Jepser, thanks for providing more specifics.
>
> One comment: If you intend to change core code paths or APIs for this,
> then I think that we should require up front that the associated HW
> support is protocol agnostic (i.e. HW filters must be programmable and
> generic ). We don't want a promising feature like this to be
> undermined by protocol ossification.
At the moment we use ethtool ntuple filters which is basically adding
a new set of enums and structures every time we need a new protocol
so its painful and you need your vendor to support you and you need a
new kernel.
The flow api was shot down (which would get you to the point where
the user could specify the protocols for the driver to implement e.g.
put_parse_graph) and the only new proposals I've seen are bpf
translations in drivers and 'tc'. I plan to take another shot at this in
net-next.
>
> Thanks,
> Tom
>
>> [3] https://github.com/luigirizzo/netmap/pull/87
>>
next prev parent reply other threads:[~2016-01-25 17:50 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
2016-01-15 13:32 ` Hannes Frederic Sowa
2016-01-15 14:17 ` Jesper Dangaard Brouer
2016-01-15 13:36 ` David Laight
2016-01-15 14:00 ` Jesper Dangaard Brouer
2016-01-15 14:38 ` Felix Fietkau
2016-01-18 11:54 ` Jesper Dangaard Brouer
2016-01-18 17:01 ` Eric Dumazet
2016-01-25 0:08 ` Florian Fainelli
2016-01-15 20:47 ` David Miller
2016-01-18 10:27 ` Jesper Dangaard Brouer
2016-01-18 16:24 ` David Miller
2016-01-20 22:20 ` Or Gerlitz
2016-01-20 23:02 ` Eric Dumazet
2016-01-20 23:27 ` Tom Herbert
2016-01-21 11:27 ` Jesper Dangaard Brouer
2016-01-21 12:49 ` Or Gerlitz
2016-01-21 13:57 ` Jesper Dangaard Brouer
2016-01-21 18:56 ` David Miller
2016-01-21 22:45 ` Or Gerlitz
2016-01-21 22:59 ` David Miller
2016-01-21 16:38 ` Eric Dumazet
2016-01-21 18:54 ` David Miller
2016-01-24 14:28 ` Jesper Dangaard Brouer
2016-01-24 14:44 ` Michael S. Tsirkin
2016-01-24 17:28 ` John Fastabend
2016-01-25 13:15 ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer
2016-01-25 17:09 ` Tom Herbert
2016-01-25 17:50 ` John Fastabend [this message]
2016-01-25 21:32 ` Tom Herbert
2016-01-25 21:58 ` John Fastabend
2016-01-25 22:10 ` Jesper Dangaard Brouer
2016-01-27 20:47 ` Jesper Dangaard Brouer
2016-01-27 21:56 ` Alexei Starovoitov
2016-01-28 9:52 ` Jesper Dangaard Brouer
2016-01-28 12:54 ` Eric Dumazet
2016-01-28 13:25 ` Eric Dumazet
2016-01-28 16:43 ` Tom Herbert
2016-01-28 2:50 ` Tom Herbert
2016-01-28 9:25 ` Jesper Dangaard Brouer
2016-01-28 12:45 ` Eric Dumazet
2016-01-28 16:37 ` Tom Herbert
2016-01-28 16:43 ` Eric Dumazet
2016-01-28 17:04 ` Jesper Dangaard Brouer
2016-01-24 20:09 ` Optimizing instruction-cache, more packets at each stage Tom Herbert
2016-01-24 21:41 ` John Fastabend
2016-01-24 23:50 ` Tom Herbert
2016-01-21 12:23 ` Jesper Dangaard Brouer
2016-01-21 16:38 ` Tom Herbert
2016-01-21 17:48 ` Eric Dumazet
2016-01-22 12:33 ` Jesper Dangaard Brouer
2016-01-22 14:33 ` Eric Dumazet
2016-01-22 17:07 ` Tom Herbert
2016-01-22 17:17 ` Jesper Dangaard Brouer
2016-02-02 16:13 ` Or Gerlitz
2016-02-02 16:37 ` Eric Dumazet
2016-01-18 16:53 ` Eric Dumazet
2016-01-18 17:36 ` Tom Herbert
2016-01-18 17:49 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56A66058.1090308@gmail.com \
--to=john.fastabend@gmail.com \
--cc=alexander.duyck@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=amirva@gmail.com \
--cc=borkmann@iogearbox.net \
--cc=brouer@redhat.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=fw@strlen.de \
--cc=gerlitz.or@gmail.com \
--cc=hannes@stressinduktion.org \
--cc=john.r.fastabend@intel.com \
--cc=marek@cloudflare.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=tom@herbertland.com \
--cc=vyasevich@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).