Re: Per-queue XDP programs, thoughts - Jesper Dangaard Brouer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: "Jonathan Lemon" <jonathan.lemon@gmail.com>
Cc: "Björn Töpel" <bjorn.topel@intel.com>,
	" Björn Töpel" <bjorn.topel@gmail.com>,
	ilias.apalodimas@linaro.org, toke@redhat.com,
	magnus.karlsson@intel.com, maciej.fijalkowski@intel.com,
	"Jason Wang" <jasowang@redhat.com>,
	"Alexei Starovoitov" <ast@fb.com>,
	"Daniel Borkmann" <borkmann@iogearbox.net>,
	"Jakub Kicinski" <jakub.kicinski@netronome.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"David Miller" <davem@davemloft.net>,
	"Andy Gospodarek" <andy@greyhouse.net>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Thomas Graf" <tgraf@suug.ch>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	brouer@redhat.com
Subject: Re: Per-queue XDP programs, thoughts
Date: Tue, 16 Apr 2019 16:48:34 +0200	[thread overview]
Message-ID: <20190416164834.2ce7e8ba@carbon> (raw)
In-Reply-To: <467AEB5A-DE90-4460-84EF-AFA33A7D6CD1@gmail.com>

On Mon, 15 Apr 2019 10:58:07 -0700
"Jonathan Lemon" <jonathan.lemon@gmail.com> wrote:

> On 15 Apr 2019, at 9:32, Jesper Dangaard Brouer wrote:
> 
> > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel 
> > <bjorn.topel@intel.com> wrote:
> >  
> >> Hi,
> >>
> >> As you probably can derive from the amount of time this is taking, 
> >> I'm
> >> not really satisfied with the design of per-queue XDP program. (That,
> >> plus I'm a terribly slow hacker... ;-)) I'll try to expand my 
> >> thinking
> >> in this mail!
> >>
> >> Beware, it's kind of a long post, and it's all over the place.  
> >
> > Cc'ing all the XDP-maintainers (and netdev).
> >  
> >> There are a number of ways of setting up flows in the kernel, e.g.
> >>
> >> * Connecting/accepting a TCP socket (in-band)
> >> * Using tc-flower (out-of-band)
> >> * ethtool (out-of-band)
> >> * ...
> >>
> >> The first acts on sockets, the second on netdevs. Then there's 
> >> ethtool
> >> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can 
> >> steer
> >> to queues. Most users care about sockets and netdevices. Queues is
> >> more of an implementation detail of Rx or for QoS on the Tx side.  
> >
> > Let me first acknowledge that the current Linux tools to administrator
> > HW filters is lacking (well sucks).  We know the hardware is capable,
> > as DPDK have an full API for this called rte_flow[1]. If nothing else
> > you/we can use the DPDK API to create a program to configure the
> > hardware, examples here[2]
> >
> >  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> >  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> >  
> >> XDP is something that we can attach to a netdevice. Again, very
> >> natural from a user perspective. As for XDP sockets, the current
> >> mechanism is that we attach to an existing netdevice queue. Ideally
> >> what we'd like is to *remove* the queue concept. A better approach
> >> would be creating the socket and set it up -- but not binding it to a
> >> queue. Instead just binding it to a netdevice (or crazier just
> >> creating a socket without a netdevice).  
> >
> > Let me just remind everybody that the AF_XDP performance gains comes
> > from binding the resource, which allow for lock-free semantics, as
> > explained here[3].
> >
> > [3] 
> > https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> >
> >  
> >> The socket is an endpoint, where I'd like data to end up (or get sent
> >> from). If the kernel can attach the socket to a hardware queue,
> >> there's zerocopy if not, copy-mode. Dito for Tx.  
> >
> > Well XDP programs per RXQ is just a building block to achieve this.
> >
> > As Van Jacobson explain[4], sockets or applications "register" a
> > "transport signature", and gets back a "channel".   In our case, the
> > netdev-global XDP program is our way to register/program these 
> > transport
> > signatures and redirect (e.g. into the AF_XDP socket).
> > This requires some work in software to parse and match transport
> > signatures to sockets.  The XDP programs per RXQ is a way to get
> > hardware to perform this filtering for us.
> >
> >  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> >
> >  
> >> Does a user (control plane) want/need to care about queues? Just
> >> create a flow to a socket (out-of-band or inband) or to a netdevice
> >> (out-of-band).  
> >
> > A userspace "control-plane" program, could hide the setup and use what
> > the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> > that the "listen" socket first register the transport signature (with
> > the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> > can register a 5-tuple (or create TC-HW rules) and load our
> > "transport-signature" XDP prog on the queue number we choose.  If not,
> > when our netdev-global XDP prog need a hash-table with 5-tuple and do
> > 5-tuple parsing.
> >
> > Creating netdevices via HW filter into queues is an interesting idea.
> > DPDK have an example here[5], on how to per flow (via ethtool filter
> > setup even!) send packets to queues, that endup in SRIOV devices.
> >
> >  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
> >
> >  
> >> Do we envison any other uses for per-queue XDP other than AF_XDP? If
> >> not, it would make *more* sense to attach the XDP program to the
> >> socket (e.g. if the endpoint would like to use kernel data structures
> >> via XDP).  
> >
> > As demonstrated in [5] you can use (ethtool) hardware filters to
> > redirect packets into VFs (Virtual Functions).
> >
> > I also want us to extend XDP to allow for redirect from a PF (Physical
> > Function) into a VF (Virtual Function).  First the netdev-global
> > XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> > info).  Next configure HW filter to queue# and load XDP prog on that
> > queue# that only "redirect" to a single VF.  Now if driver+HW supports
> > it, it can "eliminate" the per-queue XDP-prog and do everything in HW.  
> 
> One thing I'd like to see is have RSS distribute incoming traffic
> across a set of queues.  The application would open a set of xsk's
> which are bound to those queues.

Yes. (Some) NIC hardware does support this RSS distribute incoming
traffic across a set of queues.  As you can see in [5] they have an
example of this:

 testpmd> flow isolate 0 true
 testpmd> flow create 0 ingress pattern eth / ipv4 / udp / vxlan vni is 42 / end \
          actions rss queues 0 1 2 3 end / end


> I'm not seeing how a transport signature would achieve this.  The
> current tooling seems to treat the queue as the basic building block,
> which seems generally appropriate.

After creating N-queue that your RSS-hash distribute over, I imagine
that you load your per-queue XDP program on each of these N-queues.  I
don't necessarily see a need for the kernel API to expose to userspace
a API/facility to load an XDP-prog on N-queue in-one-go (you can just
iterate over them).

 
> Whittling things down (receiving packets only for a specific flow)
> could be achieved by creating a queue which only contains those
> packets which atched via some form of classification (or perhaps
> steered to a VF device), aka [5] above.   Exposing multiple queues
> allows load distribution for those apps which care about it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2019-04-16 14:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190405131745.24727-1-bjorn.topel@gmail.com>
     [not found] ` <20190405131745.24727-2-bjorn.topel@gmail.com>
     [not found]   ` <64259723-f0d8-8ade-467e-ad865add4908@intel.com>
2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
2019-04-15 17:08       ` Toke Høiland-Jørgensen
2019-04-15 17:58       ` Jonathan Lemon
2019-04-16 14:48         ` Jesper Dangaard Brouer [this message]
2019-04-17 20:17           ` Tom Herbert
2019-04-15 22:49       ` Jakub Kicinski
2019-04-16  7:45         ` Björn Töpel
2019-04-16 21:17           ` Jakub Kicinski
2019-04-16 13:55         ` Jesper Dangaard Brouer
2019-04-16 16:53           ` Jonathan Lemon
2019-04-16 18:23             ` Björn Töpel
2019-04-16 21:28           ` Jakub Kicinski
2019-04-16  7:44       ` Björn Töpel
2019-04-16  9:36         ` Toke Høiland-Jørgensen
2019-04-16 12:07           ` Björn Töpel
2019-04-16 13:25             ` Toke Høiland-Jørgensen
2019-04-16 10:15       ` Jason Wang
2019-04-16 10:41       ` Jason Wang
2019-04-17 16:46       ` Björn Töpel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190416164834.2ce7e8ba@carbon \
    --to=brouer@redhat.com \
    --cc=andy@greyhouse.net \
    --cc=ast@fb.com \
    --cc=bjorn.topel@gmail.com \
    --cc=bjorn.topel@intel.com \
    --cc=borkmann@iogearbox.net \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jakub.kicinski@netronome.com \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=jonathan.lemon@gmail.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=tgraf@suug.ch \
    --cc=thomas@monjalon.net \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.