netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Toshiaki Makita <toshiaki.makita1@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>,
	netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>,
	Daniel Borkmann <borkmann@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	brouer@redhat.com
Subject: Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
Date: Wed, 2 May 2018 07:56:56 +0200	[thread overview]
Message-ID: <20180502075656.5f2e6bd1@redhat.com> (raw)
In-Reply-To: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com>

On Wed, 2 May 2018 12:33:47 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> > <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
> >>> <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>   
> >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
> >>>>> <toshiaki.makita1@gmail.com> wrote:
> >>>>>   
> >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
[...]
> >>>>> 
> >>>>> I'm not sure you can make this assumption, that xdp_frames 
> >>>>> coming from another device driver uses a refcnt based memory 
> >>>>> model. But maybe I'm confused, as this looks like an SKB 
> >>>>> receive path, but in the ndo_xdp_xmit().  
> >>>> 
> >>>> I find this path similar to cpumap, which creates skb from 
> >>>> redirected xdp frame. Once it is converted to skb, skb head is 
> >>>> freed by page_frag_free, so anyway I needed to get the
> >>>> refcount here regardless of memory model.  
> >>> 
> >>> Yes I know, I wrote cpumap ;-)
> >>> 
> >>> First of all, I don't want to see such xdp_frame to SKB 
> >>> conversion code in every driver.  Because that increase the 
> >>> chances of errors.  And when looking at the details, then it 
> >>> seems that you have made the mistake of making it possible to 
> >>> leak xdp_frame info to the SKB (which cpumap takes into 
> >>> account).  
> >> 
> >> Do you mean leaving xdp_frame in skb->head is leaking something? 
> >> how?  
> > 
> > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> > in frame data on page reuse").  
> 
> Thanks for sharing the info.
> 
> > But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> > level, that can that can adjust head via bpf_skb_change_head (for
> > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> > is not super critical at the moment, as this _currently_ runs as 
> > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.  
> 
> What I don't get is why special casing xdp_frame info. My assumption is
> that the area above skb->mac_header is uninit kernel memory and should
> not be readable by unprivileged users anyway. So I didn't clear the area
> at this point.

This is also my understanding. But Alexei explicitly asked me to handle
this xdp_frame info case.  I assume that more work is required for the
transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to
add more/new code that makes this more difficult.


> >>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
> >>> should be "owned" by XDP and have the proper refcnt to deliver
> >>> it directly to the network stack.
> >>> 
> >>> Third, if we choose that we want a fallback, in-case XDP is not 
> >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
> >>> should be placed in the generic/core code.  E.g. 
> >>> __bpf_tx_xdp_map() could look at the return code from 
> >>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
> >>> make it easy to implement TX bulking towards the dev).  
> >> 
> >> Right, this is a much cleaner way.
> >> 
> >> Although I feel like we should add this fallback for veth because 
> >> it requires something which is different from other drivers 
> >> (enabling XDP on the peer device of the egress device),  
> > 
> > (This is why I Cc'ed Tariq...)
> > 
> > This is actually a general problem with the xdp "xmit" side (and not
> >  specific to veth driver). The problem exists for other drivers as 
> > well.
> > 
> > The problem is that a driver can implement ndo_xdp_xmit(), but the 
> > driver might not have allocated the necessary XDP TX-queue resources
> >  yet (or it might not be possible due to system resource limits).
> > 
> > The current "hack" is to load an XDP prog on the egress device, and 
> > then assume that this cause the driver to also allocate the XDP 
> > ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
>  >
> > We need a more generic way to test if a net_device is "ready/enabled"
> > for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> > on how to implement this...  
> 
> My assumption on REDIRECT requirement came from this.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

Yes, I wrote that.

> I guess you are saying thing are changing, and having an XDP program 
> attached on the egress device is no longer generally sufficient. Looking 
> forward to Tariq's solution.

Yes, (hopefully) things are changing. Loading a dummy XDP program to
enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons
I listed.

I hope Tariq find some time to work on this ... ;-)


> Toshiaki Makita
> 
> > 
> > My opinion is that it is a waste of (HW/mem) resources to always 
> > allocate resources for ndo_xdp_xmit when loading an XDP program. 
> > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> > redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> > today using ixgbe on machines with more than 96 CPUs, will fail due 
> > to limited TX queue resources. Thus, blocking the mentioned 
> > use-cases.
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2018-05-02  5:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
2018-04-25 20:39   ` Jesper Dangaard Brouer
2018-04-26 10:46     ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
2018-05-01  7:50   ` Jesper Dangaard Brouer
2018-05-01  8:02     ` Toshiaki Makita
2018-05-01  8:43       ` Jesper Dangaard Brouer
2018-05-02  3:37         ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-04-25 20:24   ` Jesper Dangaard Brouer
2018-04-26 10:52     ` Toshiaki Makita
2018-04-30 17:27       ` Jesper Dangaard Brouer
2018-05-01  1:02         ` Toshiaki Makita
2018-05-01  8:14           ` Jesper Dangaard Brouer
2018-05-02  3:33             ` Toshiaki Makita
2018-05-02  5:56               ` Jesper Dangaard Brouer [this message]
2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180502075656.5f2e6bd1@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=borkmann@iogearbox.net \
    --cc=eranbe@mellanox.com \
    --cc=makita.toshiaki@lab.ntt.co.jp \
    --cc=netdev@vger.kernel.org \
    --cc=tariqt@mellanox.com \
    --cc=toshiaki.makita1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).