From: Jason Wang <jasowang@redhat.com>
To: "Prashant Bhole" <prashantbhole.linux@gmail.com>,
"Toke Høiland-Jørgensen" <toke@redhat.com>,
"Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
"Jesper Dangaard Brouer" <jbrouer@redhat.com>
Cc: "David S . Miller" <davem@davemloft.net>,
"Michael S . Tsirkin" <mst@redhat.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Jesper Dangaard Brouer <hawk@kernel.org>,
David Ahern <dsahern@gmail.com>,
Jakub Kicinski <jakub.kicinski@netronome.com>,
John Fastabend <john.fastabend@gmail.com>,
Toshiaki Makita <toshiaki.makita1@gmail.com>,
Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
Yonghong Song <yhs@fb.com>, Andrii Nakryiko <andriin@fb.com>,
netdev@vger.kernel.org,
Ilias Apalodimas <ilias.apalodimas@linaro.org>
Subject: Re: [RFC net-next 11/14] tun: run XDP program in tx path
Date: Fri, 20 Dec 2019 15:36:50 +0800 [thread overview]
Message-ID: <dd1f048f-3c51-79d0-73ff-a6b0c71f7925@redhat.com> (raw)
In-Reply-To: <3654a205-b3fd-b531-80ac-42823e089b39@gmail.com>
On 2019/12/20 下午12:46, Prashant Bhole wrote:
>
>
> On 12/20/19 12:24 PM, Jason Wang wrote:
>>
>> On 2019/12/20 上午8:07, Prashant Bhole wrote:
>>> Note: Resending my last response. It was not delivered to netdev list
>>> due to some problem.
>>>
>>> On 12/19/19 7:15 PM, Toke Høiland-Jørgensen wrote:
>>>> Prashant Bhole <prashantbhole.linux@gmail.com> writes:
>>>>
>>>>> On 12/19/19 3:19 AM, Alexei Starovoitov wrote:
>>>>>> On Wed, Dec 18, 2019 at 12:48:59PM +0100, Toke Høiland-Jørgensen
>>>>>> wrote:
>>>>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>>>>>
>>>>>>>> On Wed, 18 Dec 2019 17:10:47 +0900
>>>>>>>> Prashant Bhole <prashantbhole.linux@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct
>>>>>>>>> tun_file *tfile,
>>>>>>>>> + struct xdp_frame *frame)
>>>>>>>>> +{
>>>>>>>>> + struct bpf_prog *xdp_prog;
>>>>>>>>> + struct tun_page tpage;
>>>>>>>>> + struct xdp_buff xdp;
>>>>>>>>> + u32 act = XDP_PASS;
>>>>>>>>> + int flush = 0;
>>>>>>>>> +
>>>>>>>>> + xdp_prog = rcu_dereference(tun->xdp_tx_prog);
>>>>>>>>> + if (xdp_prog) {
>>>>>>>>> + xdp.data_hard_start = frame->data - frame->headroom;
>>>>>>>>> + xdp.data = frame->data;
>>>>>>>>> + xdp.data_end = xdp.data + frame->len;
>>>>>>>>> + xdp.data_meta = xdp.data - frame->metasize;
>>>>>>>>
>>>>>>>> You have not configured xdp.rxq, thus a BPF-prog accessing this
>>>>>>>> will crash.
>>>>>>>>
>>>>>>>> For an XDP TX hook, I want us to provide/give BPF-prog access
>>>>>>>> to some
>>>>>>>> more information about e.g. the current tx-queue length, or
>>>>>>>> TC-q number.
>>>>>>>>
>>>>>>>> Question to Daniel or Alexei, can we do this and still keep
>>>>>>>> BPF_PROG_TYPE_XDP?
>>>>>>>> Or is it better to introduce a new BPF prog type (enum
>>>>>>>> bpf_prog_type)
>>>>>>>> for XDP TX-hook ?
>>>>>>>
>>>>>>> I think a new program type would make the most sense. If/when we
>>>>>>> introduce an XDP TX hook[0], it should have different semantics
>>>>>>> than the
>>>>>>> regular XDP hook. I view the XDP TX hook as a hook that executes
>>>>>>> as the
>>>>>>> very last thing before packets leave the interface. It should have
>>>>>>> access to different context data as you say, but also I don't
>>>>>>> think it
>>>>>>> makes sense to have XDP_TX and XDP_REDIRECT in an XDP_TX hook.
>>>>>>> And we
>>>>>>> may also want to have a "throttle" return code; or maybe that
>>>>>>> could be
>>>>>>> done via a helper?
>>>>>>>
>>>>>>> In any case, I don't think this "emulated RX hook on the other
>>>>>>> end of a
>>>>>>> virtual device" model that this series introduces is the right
>>>>>>> semantics
>>>>>>> for an XDP TX hook. I can see what you're trying to do, and for
>>>>>>> virtual
>>>>>>> point-to-point links I think it may make sense to emulate the RX
>>>>>>> hook of
>>>>>>> the "other end" on TX. However, form a UAPI perspective, I don't
>>>>>>> think
>>>>>>> we should be calling this a TX hook; logically, it's still an RX
>>>>>>> hook
>>>>>>> on the receive end.
>>>>>>>
>>>>>>> If you guys are up for evolving this design into a "proper" TX
>>>>>>> hook (as
>>>>>>> outlined above an in [0]), that would be awesome, of course. But
>>>>>>> not
>>>>>>> sure what constraints you have on your original problem? Do you
>>>>>>> specifically need the "emulated RX hook for unmodified XDP
>>>>>>> programs"
>>>>>>> semantics, or could your problem be solved with a TX hook with
>>>>>>> different
>>>>>>> semantics?
>>>>>>
>>>>>> I agree with above.
>>>>>> It looks more like existing BPF_PROG_TYPE_XDP, but attached to
>>>>>> egress
>>>>>> of veth/tap interface. I think only attachment point makes a
>>>>>> difference.
>>>>>> May be use expected_attach_type ?
>>>>>> Then there will be no need to create new program type.
>>>>>> BPF_PROG_TYPE_XDP will be able to access different fields depending
>>>>>> on expected_attach_type. Like rx-queue length that Jesper is
>>>>>> suggesting
>>>>>> will be available only in such case and not for all
>>>>>> BPF_PROG_TYPE_XDP progs.
>>>>>> It can be reduced too. Like if there is no xdp.rxq concept for
>>>>>> egress side
>>>>>> of virtual device the access to that field can disallowed by the
>>>>>> verifier.
>>>>>> Could you also call it XDP_EGRESS instead of XDP_TX?
>>>>>> I would like to reserve XDP_TX name to what Toke describes as
>>>>>> XDP_TX.
>>>>>>
>>>>>
>>>>> From the discussion over this set, it makes sense to have new
>>>>> type of
>>>>> program. As David suggested it will make a way for changes specific
>>>>> to egress path.
>>>>> On the other hand, XDP offload with virtio-net implementation is
>>>>> based
>>>>> on "emulated RX hook". How about having this special behavior with
>>>>> expected_attach_type?
>>>>
>>>> Another thought I had re: this was that for these "special" virtual
>>>> point-to-point devices we could extend the API to have an ATTACH_PEER
>>>> flag. So if you have a pair of veth devices (veth0,veth1)
>>>> connecting to
>>>> each other, you could do either of:
>>>>
>>>> bpf_set_link_xdp_fd(ifindex(veth0), prog_fd, 0);
>>>> bpf_set_link_xdp_fd(ifindex(veth1), prog_fd, ATTACH_PEER);
>>>>
>>>> to attach to veth0, and:
>>>>
>>>> bpf_set_link_xdp_fd(ifindex(veth1), prog_fd, 0);
>>>> bpf_set_link_xdp_fd(ifindex(veth0), prog_fd, ATTACH_PEER);
>>>>
>>>> to attach to veth0.
>>>>
>>>> This would allow to attach to a device without having the "other end"
>>>> visible, and keep the "XDP runs on RX" semantics clear to userspace.
>>>> Internally in the kernel we could then turn the "attach to peer"
>>>> operation for a tun device into the "emulate on TX" thing you're
>>>> already
>>>> doing?
>>>>
>>>> Would this work for your use case, do you think?
>>>>
>>>> -Toke
>>>>
>>>
>>> This is nice from UAPI point of view. It may work for veth case but
>>> not for XDP offload with virtio-net. Please see the sequence when
>>> a user program in the guest wants to offload a program to tun.
>>>
>>> * User program wants to loads the program by setting offload flag and
>>> ifindex:
>>>
>>> - map_offload_ops->alloc()
>>> virtio-net sends map info to qemu and it creates map on the host.
>>> - prog_offload_ops->setup()
>>> New callback just to have a copy of unmodified program. It contains
>>> original map fds. We replace map fds with fds from the host side.
>>> Check the program for unsupported helpers calls.
>>> - prog_offload_ops->finalize()
>>> Send the program to qemu and it loads the program to the host.
>>>
>>> * User program calls bpf_set_link_xdp_fd()
>>> virtio-net handles XDP_PROG_SETUP_HW by sending a request to qemu.
>>> Qemu then attaches host side program fd to respective tun device by
>>> calling bpf_set_link_xdp_fd()
>>>
>>> In above sequence there is no chance to use.
>>
>>
>> For VM, I think what Toke meant is to consider virtio-net as a peer
>> of TAP and we can do something like the following in qemu:
>>
>> bpf_set_link_xdp_fd(ifindex(tap0), prog_fd, ATTACH_PEER);
>>
>> in this case. And the behavior of XDP_TX could be kept as if the XDP
>> was attached to the peer of TAP (actually a virtio-net inside the
>> guest).
>
> I think he meant actually attaching the program to the peer. Most
> probably referring the use case I mentioned in the cover letter.
Interesting, actually this is kind of XDP offloading. In the case of
veth, we had some discussion with Toshiaki and I remember he said it
doesn't help much.
Thanks
>
> "It can improve container networking where veth pair links the host and
> the container. Host can set ACL by setting tx path XDP to the veth
> iface."
>
> Toke, Can you please clarify?
>
> Thanks!
>
>>
>> Thanks
>>
>>
>>>
>>> Here is how other ideas from this discussion can be used:
>>>
>>> - Introduce BPF_PROG_TYPE_TX_XDP for egress path. Have a special
>>> behavior of emulating RX XDP using expected_attach_type flag.
>>> - The emulated RX XDP will be restrictive in terms of helper calls.
>>> - In offload case qemu will load the program BPF_PROG_TYPE_TX_XDP and
>>> set expected_attach_type.
>>>
>>> What is your opinion about it? Does the driver implementing egress
>>> XDP needs to know what kind of XDP program it is running?
>>>
>>> Thanks,
>>> Prashant
>>>
>>
>
next prev parent reply other threads:[~2019-12-20 7:37 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-18 8:10 [RFC net-next 00/14] XDP in tx path Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 01/14] net: add tx path XDP support Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 02/14] tools: sync kernel uapi/linux/if_link.h header Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 03/14] libbpf: API for tx path XDP support Prashant Bhole
2019-12-18 18:20 ` Alexei Starovoitov
2019-12-18 8:10 ` [RFC net-next 04/14] samples/bpf: xdp1, add XDP tx support Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 05/14] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 06/14] net: core: export do_xdp_generic_core() Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 07/14] tuntap: check tun_msg_ctl type at necessary places Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 08/14] vhost_net: user tap recvmsg api to access ptr ring Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 09/14] tuntap: remove usage of ptr ring in vhost_net Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 10/14] tun: set tx path XDP program Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 11/14] tun: run XDP program in tx path Prashant Bhole
2019-12-18 10:07 ` Jesper Dangaard Brouer
2019-12-18 11:48 ` Toke Høiland-Jørgensen
2019-12-18 16:33 ` David Ahern
2019-12-19 2:44 ` Jason Wang
2019-12-18 18:19 ` Alexei Starovoitov
2019-12-19 2:34 ` Prashant Bhole
2019-12-19 10:15 ` Toke Høiland-Jørgensen
2019-12-20 0:07 ` Prashant Bhole
2019-12-20 3:24 ` Jason Wang
2019-12-20 4:46 ` Prashant Bhole
2019-12-20 7:36 ` Jason Wang [this message]
2019-12-20 10:11 ` Toke Høiland-Jørgensen
2019-12-20 16:11 ` David Ahern
2019-12-20 22:17 ` Prashant Bhole
2019-12-23 6:05 ` Jason Wang
2019-12-23 8:09 ` Prashant Bhole
2019-12-23 8:34 ` Jason Wang
2019-12-23 11:06 ` Prashant Bhole
2019-12-18 16:29 ` David Ahern
2019-12-19 1:47 ` Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 12/14] tun: add a way to inject tx path packet into Rx path Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 13/14] tun: handle XDP_TX action of tx path XDP program Prashant Bhole
2019-12-18 8:10 ` [RFC net-next 14/14] tun: run xdp prog when tun is read from file interface Prashant Bhole
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dd1f048f-3c51-79d0-73ff-a6b0c71f7925@redhat.com \
--to=jasowang@redhat.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andriin@fb.com \
--cc=ast@kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@gmail.com \
--cc=hawk@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=jakub.kicinski@netronome.com \
--cc=jbrouer@redhat.com \
--cc=john.fastabend@gmail.com \
--cc=kafai@fb.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=prashantbhole.linux@gmail.com \
--cc=songliubraving@fb.com \
--cc=toke@redhat.com \
--cc=toshiaki.makita1@gmail.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).