From: Toshiaki Makita <toshiaki.makita1@gmail.com>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>,
netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>,
Daniel Borkmann <borkmann@iogearbox.net>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Eran Ben Elisha <eranbe@mellanox.com>
Subject: Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
Date: Wed, 2 May 2018 12:33:47 +0900 [thread overview]
Message-ID: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com> (raw)
In-Reply-To: <20180501101420.544ff225@redhat.com>
On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita
> <makita.toshiaki@lab.ntt.co.jp> wrote:
>
>> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita
>>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>
>>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita
>>>>> <toshiaki.makita1@gmail.com> wrote:
>>>>>
>>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct
>>>>>> xdp_frame *frame) +{ + struct veth_priv *rcv_priv, *priv =
>>>>>> netdev_priv(dev); + int headroom = frame->data - (void
>>>>>> *)frame; + struct net_device *rcv; + int err = 0; + + rcv
>>>>>> = rcu_dereference(priv->peer); + if (unlikely(!rcv)) +
>>>>>> return -ENXIO; + + rcv_priv = netdev_priv(rcv); + /*
>>>>>> xdp_ring is initialized on receive side? */ + if
>>>>>> (rcu_access_pointer(rcv_priv->xdp_prog)) { + err =
>>>>>> xdp_ok_fwd_dev(rcv, frame->len); + if (unlikely(err)) +
>>>>>> return err; + + err = veth_xdp_enqueue(rcv_priv,
>>>>>> veth_xdp_to_ptr(frame)); + } else { + struct sk_buff *skb;
>>>>>> + + skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>>>> + if (unlikely(!skb)) + return -ENOMEM; + + /* Get page
>>>>>> ref in case skb is dropped in netif_rx. + * The caller is
>>>>>> responsible for freeing the page on error. + */ +
>>>>>> get_page(virt_to_page(frame->data));
>>>>>
>>>>> I'm not sure you can make this assumption, that xdp_frames
>>>>> coming from another device driver uses a refcnt based memory
>>>>> model. But maybe I'm confused, as this looks like an SKB
>>>>> receive path, but in the ndo_xdp_xmit().
>>>>
>>>> I find this path similar to cpumap, which creates skb from
>>>> redirected xdp frame. Once it is converted to skb, skb head is
>>>> freed by page_frag_free, so anyway I needed to get the
>>>> refcount here regardless of memory model.
>>>
>>> Yes I know, I wrote cpumap ;-)
>>>
>>> First of all, I don't want to see such xdp_frame to SKB
>>> conversion code in every driver. Because that increase the
>>> chances of errors. And when looking at the details, then it
>>> seems that you have made the mistake of making it possible to
>>> leak xdp_frame info to the SKB (which cpumap takes into
>>> account).
>>
>> Do you mean leaving xdp_frame in skb->head is leaking something?
>> how?
>
> Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp
> headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored
> in frame data on page reuse").
Thanks for sharing the info.
> But this time, the concern is a bpf_prog attached at TC/bpf_cls
> level, that can that can adjust head via bpf_skb_change_head (for
> XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame. As
> described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in
> is not super critical at the moment, as this _currently_ runs as
> CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.
What I don't get is why special casing xdp_frame info. My assumption is
that the area above skb->mac_header is uninit kernel memory and should
not be readable by unprivileged users anyway. So I didn't clear the area
at this point.
>>> Second, I think the refcnt scheme here is wrong. The xdp_frame
>>> should be "owned" by XDP and have the proper refcnt to deliver
>>> it directly to the network stack.
>>>
>>> Third, if we choose that we want a fallback, in-case XDP is not
>>> enabled on egress dev (but it have an ndo_xdp_xmit), then it
>>> should be placed in the generic/core code. E.g.
>>> __bpf_tx_xdp_map() could look at the return code from
>>> dev->netdev_ops->ndo_xdp() and create an SKB. (Hint, this would
>>> make it easy to implement TX bulking towards the dev).
>>
>> Right, this is a much cleaner way.
>>
>> Although I feel like we should add this fallback for veth because
>> it requires something which is different from other drivers
>> (enabling XDP on the peer device of the egress device),
>
> (This is why I Cc'ed Tariq...)
>
> This is actually a general problem with the xdp "xmit" side (and not
> specific to veth driver). The problem exists for other drivers as
> well.
>
> The problem is that a driver can implement ndo_xdp_xmit(), but the
> driver might not have allocated the necessary XDP TX-queue resources
> yet (or it might not be possible due to system resource limits).
>
> The current "hack" is to load an XDP prog on the egress device, and
> then assume that this cause the driver to also allocate the XDP
> ndo_xdo_xmit side HW resources. This is IMHO a wrong assumption.
>
> We need a more generic way to test if a net_device is "ready/enabled"
> for handling xdp_frames via ndo_xdp_xmit. And Tariq had some ideas
> on how to implement this...
My assumption on REDIRECT requirement came from this.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b
I guess you are saying thing are changing, and having an XDP program
attached on the egress device is no longer generally sufficient. Looking
forward to Tariq's solution.
Toshiaki Makita
>
> My opinion is that it is a waste of (HW/mem) resources to always
> allocate resources for ndo_xdp_xmit when loading an XDP program.
> Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP
> redirect load-balancer, then I don't want/need ndo_xdp_xmit. E.g.
> today using ixgbe on machines with more than 96 CPUs, will fail due
> to limited TX queue resources. Thus, blocking the mentioned
> use-cases.
>
>
>> I'll drop the part for now. It should not be resolved in the driver
>> code.
>
> Thank you.
>
next prev parent reply other threads:[~2018-05-02 3:33 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
2018-04-25 20:39 ` Jesper Dangaard Brouer
2018-04-26 10:46 ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
2018-05-01 7:50 ` Jesper Dangaard Brouer
2018-05-01 8:02 ` Toshiaki Makita
2018-05-01 8:43 ` Jesper Dangaard Brouer
2018-05-02 3:37 ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-04-25 20:24 ` Jesper Dangaard Brouer
2018-04-26 10:52 ` Toshiaki Makita
2018-04-30 17:27 ` Jesper Dangaard Brouer
2018-05-01 1:02 ` Toshiaki Makita
2018-05-01 8:14 ` Jesper Dangaard Brouer
2018-05-02 3:33 ` Toshiaki Makita [this message]
2018-05-02 5:56 ` Jesper Dangaard Brouer
2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com \
--to=toshiaki.makita1@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=borkmann@iogearbox.net \
--cc=brouer@redhat.com \
--cc=eranbe@mellanox.com \
--cc=makita.toshiaki@lab.ntt.co.jp \
--cc=netdev@vger.kernel.org \
--cc=tariqt@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).