All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Toshiaki Makita <toshiaki.makita1@gmail.com>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>,
	netdev@vger.kernel.org, Tariq Toukan <tariqt@mellanox.com>,
	Daniel Borkmann <borkmann@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Eran Ben Elisha <eranbe@mellanox.com>,
	brouer@redhat.com
Subject: Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
Date: Wed, 2 May 2018 07:56:56 +0200	[thread overview]
Message-ID: <20180502075656.5f2e6bd1@redhat.com> (raw)
In-Reply-To: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com>

On Wed, 2 May 2018 12:33:47 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> > <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
> >>> <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>   
> >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
> >>>>> <toshiaki.makita1@gmail.com> wrote:
> >>>>>   
> >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
[...]
> >>>>> 
> >>>>> I'm not sure you can make this assumption, that xdp_frames 
> >>>>> coming from another device driver uses a refcnt based memory 
> >>>>> model. But maybe I'm confused, as this looks like an SKB 
> >>>>> receive path, but in the ndo_xdp_xmit().  
> >>>> 
> >>>> I find this path similar to cpumap, which creates skb from 
> >>>> redirected xdp frame. Once it is converted to skb, skb head is 
> >>>> freed by page_frag_free, so anyway I needed to get the
> >>>> refcount here regardless of memory model.  
> >>> 
> >>> Yes I know, I wrote cpumap ;-)
> >>> 
> >>> First of all, I don't want to see such xdp_frame to SKB 
> >>> conversion code in every driver.  Because that increase the 
> >>> chances of errors.  And when looking at the details, then it 
> >>> seems that you have made the mistake of making it possible to 
> >>> leak xdp_frame info to the SKB (which cpumap takes into 
> >>> account).  
> >> 
> >> Do you mean leaving xdp_frame in skb->head is leaking something? 
> >> how?  
> > 
> > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> > in frame data on page reuse").  
> 
> Thanks for sharing the info.
> 
> > But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> > level, that can that can adjust head via bpf_skb_change_head (for
> > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> > is not super critical at the moment, as this _currently_ runs as 
> > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.  
> 
> What I don't get is why special casing xdp_frame info. My assumption is
> that the area above skb->mac_header is uninit kernel memory and should
> not be readable by unprivileged users anyway. So I didn't clear the area
> at this point.

This is also my understanding. But Alexei explicitly asked me to handle
this xdp_frame info case.  I assume that more work is required for the
transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to
add more/new code that makes this more difficult.


> >>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
> >>> should be "owned" by XDP and have the proper refcnt to deliver
> >>> it directly to the network stack.
> >>> 
> >>> Third, if we choose that we want a fallback, in-case XDP is not 
> >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
> >>> should be placed in the generic/core code.  E.g. 
> >>> __bpf_tx_xdp_map() could look at the return code from 
> >>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
> >>> make it easy to implement TX bulking towards the dev).  
> >> 
> >> Right, this is a much cleaner way.
> >> 
> >> Although I feel like we should add this fallback for veth because 
> >> it requires something which is different from other drivers 
> >> (enabling XDP on the peer device of the egress device),  
> > 
> > (This is why I Cc'ed Tariq...)
> > 
> > This is actually a general problem with the xdp "xmit" side (and not
> >  specific to veth driver). The problem exists for other drivers as 
> > well.
> > 
> > The problem is that a driver can implement ndo_xdp_xmit(), but the 
> > driver might not have allocated the necessary XDP TX-queue resources
> >  yet (or it might not be possible due to system resource limits).
> > 
> > The current "hack" is to load an XDP prog on the egress device, and 
> > then assume that this cause the driver to also allocate the XDP 
> > ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
>  >
> > We need a more generic way to test if a net_device is "ready/enabled"
> > for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> > on how to implement this...  
> 
> My assumption on REDIRECT requirement came from this.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

Yes, I wrote that.

> I guess you are saying thing are changing, and having an XDP program 
> attached on the egress device is no longer generally sufficient. Looking 
> forward to Tariq's solution.

Yes, (hopefully) things are changing. Loading a dummy XDP program to
enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons
I listed.

I hope Tariq find some time to work on this ... ;-)


> Toshiaki Makita
> 
> > 
> > My opinion is that it is a waste of (HW/mem) resources to always 
> > allocate resources for ndo_xdp_xmit when loading an XDP program. 
> > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> > redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> > today using ixgbe on machines with more than 96 CPUs, will fail due 
> > to limited TX queue resources. Thus, blocking the mentioned 
> > use-cases.
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

  reply	other threads:[~2018-05-02  5:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
2018-04-25 20:39   ` Jesper Dangaard Brouer
2018-04-26 10:46     ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
2018-05-01  7:50   ` Jesper Dangaard Brouer
2018-05-01  8:02     ` Toshiaki Makita
2018-05-01  8:43       ` Jesper Dangaard Brouer
2018-05-02  3:37         ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-04-25 20:24   ` Jesper Dangaard Brouer
2018-04-26 10:52     ` Toshiaki Makita
2018-04-30 17:27       ` Jesper Dangaard Brouer
2018-05-01  1:02         ` Toshiaki Makita
2018-05-01  8:14           ` Jesper Dangaard Brouer
2018-05-02  3:33             ` Toshiaki Makita
2018-05-02  5:56               ` Jesper Dangaard Brouer [this message]
2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180502075656.5f2e6bd1@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=borkmann@iogearbox.net \
    --cc=eranbe@mellanox.com \
    --cc=makita.toshiaki@lab.ntt.co.jp \
    --cc=netdev@vger.kernel.org \
    --cc=tariqt@mellanox.com \
    --cc=toshiaki.makita1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.