Re: Questions on XDP - John Fastabend

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: John Fastabend <john.fastabend@gmail.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Netdev <netdev@vger.kernel.org>,
	Tom Herbert <tom@herbertland.com>,
	Alexei Starovoitov <ast@kernel.org>,
	John Fastabend <john.r.fastabend@intel.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>
Subject: Re: Questions on XDP
Date: Sat, 18 Feb 2017 19:48:25 -0800	[thread overview]
Message-ID: <58A91589.6050404@gmail.com> (raw)
In-Reply-To: <CAKgT0UcpW_MMCgURnB=r5Tezdm9jJSNfOYcNJg8TR1sMKREzPg@mail.gmail.com>

On 17-02-18 06:16 PM, Alexander Duyck wrote:
> On Sat, Feb 18, 2017 at 3:48 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 17-02-18 03:31 PM, Alexei Starovoitov wrote:
>>> On Sat, Feb 18, 2017 at 10:18 AM, Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>>
>>>>> XDP_DROP does not require having one page per frame.
>>>>
>>>> Agreed.
>>>
>>> why do you think so?
>>> xdp_drop is targeting ddos where in good case
>>> all traffic is passed up and in bad case
>>> most of the traffic is dropped, but good traffic still needs
>>> to be serviced by the layers after. Like other xdp
>>> programs and the stack.
>>> Say ixgbe+xdp goes with 2k per packet,
>>> very soon we will have a bunch of half pages
>>> sitting in the stack and other halfs requiring
>>> complex refcnting and making the actual
>>> ddos mitigation ineffective and forcing nic to drop packets
>>
>> I'm not seeing the distinction here. If its a 4k page and
>> in the stack the driver will get overrun as well.
>>
>>> because it runs out of buffers. Why complicate things?
>>
>> It doesn't seem complex to me and the driver already handles this
>> case so it actually makes the drivers simpler because there is only
>> a single buffer management path.
>>
>>> packet per page approach is simple and effective.
>>> virtio is different. there we don't have hw that needs
>>> to have buffers ready for dma.
>>>
>>>> Looking at the Mellanox way of doing it I am not entirely sure it is
>>>> useful.  It looks good for benchmarks but that is about it.  Also I
>>>
>>> it's the opposite. It already runs very nicely in production.
>>> In real life it's always a combination of xdp_drop, xdp_tx and
>>> xdp_pass actions.
>>> Sounds like ixgbe wants to do things differently because
>>> of not-invented-here. That new approach may turn
>>> out to be good or bad, but why risk it?
>>> mlx4 approach works.
>>> mlx5 has few issues though, because page recycling
>>> was done too simplistic. Generic page pool/recycling
>>> that all drivers will use should solve that. I hope.
>>> Is the proposal to have generic split-page recycler ?
>>> How that is going to work?
>>>
>>
>> No, just give the driver a page when it asks for it. How the
>> driver uses the page is not the pools concern.
>>
>>>> don't see it extending out to the point that we would be able to
>>>> exchange packets between interfaces which really seems like it should
>>>> be the ultimate goal for XDP_TX.
>>>
>>> we don't have a use case for multi-port xdp_tx,
>>> but I'm not objecting to doing it in general.
>>> Just right now I don't see a need to complicate
>>> drivers to do so.
>>
>> We are running our vswitch in userspace now for many workloads
>> it would be nice to have these in kernel if possible.
>>
>>>
>>>> It seems like eventually we want to be able to peel off the buffer and
>>>> send it to something other than ourselves.  For example it seems like
>>>> it might be useful at some point to use XDP to do traffic
>>>> classification and have it route packets between multiple interfaces
>>>> on a host and it wouldn't make sense to have all of them map every
>>>> page as bidirectional because it starts becoming ridiculous if you
>>>> have dozens of interfaces in a system.
>>>
>>> dozen interfaces? Like a single nic with dozen ports?
>>> or many nics with many ports on the same system?
>>> are you trying to build a switch out of x86?
>>> I don't think it's realistic to have multi-terrabit x86 box.
>>> Is it all because of dpdk/6wind demos?
>>> I saw how dpdk was bragging that they can saturate
>>> pcie bus. So? Why is this useful?
> 
> Actually I was thinking more of an OVS, bridge, or routing
> replacement.  Basically with a couple of physical interfaces and then
> either veth and/or vhost interfaces.
> 

Yep valid use case for me. We would use this with Intel Clear Linux
assuming we can sort it out and perf metrics are good.

>>> Why anyone would care to put a bunch of nics
>>> into x86 and demonstrate that bandwidth of pcie is now
>>> a limiting factor ?
>>
>> Maybe Alex had something else in mind but we have many virtual interfaces
>> plus physical interfaces in vswitch use case. Possibly thousands.
> 
> I was thinking about the fact that the Mellanox driver is currently
> mapping pages as bidirectional, so I was sticking to the device to
> device case in regards to that discussion.  For virtual interfaces we
> don't even need the DMA mapping, it is just a copy to user space we
> have to deal with in the case of vhost.  In that regard I was thinking
> we need to start looking at taking XDP_TX one step further and
> possibly look at supporting the transmit of an xdp_buf on an unrelated
> netdev.  Although it looks like that means adding a netdev pointer to
> xdp_buf in order to support returning that.
> 
> Anyway I am just running on conjecture at this point.  But it seems
> like if we want to make XDP capable of doing transmit we should
> support something other than bounce on the same port since that seems
> like a "just saturate the bus" use case more than anything.  I suppose
> you can do a one armed router, or have it do encap/decap for a tunnel,
> but that is about the limits of it.  If we allow it to do transmit on
> other netdevs then suddenly this has the potential to replace
> significant existing infrastructure.
> 
> Sorry if I am stirring the hornets nest here.  I just finished the DMA
> API changes to allow DMA page reuse with writable pages on ixgbe, and
> igb/i40e/i40evf should be getting the same treatment shortly.  So now
> I am looking forward at XDP and just noticing a few things that didn't
> seem to make sense given the work I was doing to enable the API.
> 

Yep good to push on it IMO. So as I hinted here is forward to another
port interface I've been looking at. I'm not claiming its the best
possible solution but the simplest thing I could come up with that works.
I was hoping to think about it more next week.

Here is XDP extensions for redirect (need to be rebased though)

 https://github.com/jrfastab/linux/commit/e78f5425d5e3c305b4170ddd85c61c2e15359fee

And here is a sample program,

 https://github.com/jrfastab/linux/commit/19d0a5de3f6e934baa8df23d95e766bab7f026d0

Probably the most relevant pieces in the above patch is a new ndo op as follows,

 +	void			(*ndo_xdp_xmit)(struct net_device *dev,
 +						struct xdp_buff *xdp);


Then support for redirect in xdp ebpf,

 +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 +{
 +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 +
 +	if (unlikely(flags))
 +		return XDP_ABORTED;
 +
 +	ri->ifindex = ifindex;
 +	return XDP_REDIRECT;
 +}
 +

And then a routine for drivers to use to push packets with the XDP_REDIRECT
action around,

+static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
+{
+	if (dev->netdev_ops->ndo_xdp_xmit) {
+		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		return 0;
+	}
+	bpf_warn_invalid_xdp_redirect(dev->ifindex);
+	return -EOPNOTSUPP;
+}
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
+	ri->ifindex = 0;
+	if (unlikely(!dev)) {
+		bpf_warn_invalid_xdp_redirect(ri->ifindex);
+		return -EINVAL;
+	}
+
+	return __bpf_tx_xdp(dev, xdp);
+}


Still thinking on it though to see if I might have a better mechanism and
need benchmarks to show various metrics.

Thanks,
John

next prev parent reply	other threads:[~2017-02-19  3:49 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-18 23:31 Questions on XDP Alexei Starovoitov
2017-02-18 23:48 ` John Fastabend
2017-02-18 23:59   ` Eric Dumazet
2017-02-19  2:16   ` Alexander Duyck
2017-02-19  3:48     ` John Fastabend [this message]
2017-02-20 20:06       ` Jakub Kicinski
2017-02-22  5:02         ` John Fastabend
2017-02-21  3:18     ` Alexei Starovoitov
2017-02-21  3:39       ` John Fastabend
2017-02-21  4:00         ` Alexander Duyck
2017-02-21  7:55           ` Alexei Starovoitov
2017-02-21 17:44             ` Alexander Duyck
2017-02-22 17:08               ` John Fastabend
2017-02-22 21:59                 ` Jesper Dangaard Brouer
  -- strict thread matches above, loose matches on Subject: below --
2017-02-18 23:59 Alexei Starovoitov
2017-02-16 20:41 Alexander Duyck
2017-02-16 22:36 ` John Fastabend
2017-02-18 16:34   ` Jesper Dangaard Brouer
2017-02-18 17:41     ` Eric Dumazet
2017-02-18 18:18       ` Alexander Duyck
2017-02-18 23:28         ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58A91589.6050404@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=john.r.fastabend@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=tom@herbertland.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).