public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, "Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Martin KaFai Lau" <kafai@fb.com>,
	"Song Liu" <songliubraving@fb.com>, "Yonghong Song" <yhs@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@kernel.org>,
	"Jamal Hadi Salim" <jhs@mojatatu.com>,
	"Vlad Buslov" <vladbu@nvidia.com>,
	"Jiri Pirko" <jiri@resnulli.us>,
	"David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Joe Stringer" <joe@cilium.io>,
	"Quentin Monnet" <quentin@isovalent.com>,
	"Jesper Dangaard Brouer" <brouer@redhat.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Linux Kernel Network Developers" <netdev@vger.kernel.org>
Subject: Re: [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API
Date: Tue, 8 Jun 2021 12:49:08 +0530	[thread overview]
Message-ID: <20210608071908.sos275adj3gunewo@apollo> (raw)
In-Reply-To: <CAM_iQpWA=SXNR3Ya8_L2aoVJGP_uaRP8EYCpDrnq3y8Uf6qu=g@mail.gmail.com>

On Tue, Jun 08, 2021 at 07:30:40AM IST, Cong Wang wrote:
> On Sun, Jun 6, 2021 at 11:08 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, Jun 07, 2021 at 10:48:04AM IST, Cong Wang wrote:
> > > On Sun, Jun 6, 2021 at 8:38 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Mon, Jun 07, 2021 at 05:07:28AM IST, Cong Wang wrote:
> > > > > On Fri, May 28, 2021 at 1:00 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > >
> > > > > > This is the first RFC version.
> > > > > >
> > > > > > This adds a bpf_link path to create TC filters tied to cls_bpf classifier, and
> > > > > > introduces fd based ownership for such TC filters. Netlink cannot delete or
> > > > > > replace such filters, but the bpf_link is severed on indirect destruction of the
> > > > > > filter (backing qdisc being deleted, or chain being flushed, etc.). To ensure
> > > > > > that filters remain attached beyond process lifetime, the usual bpf_link fd
> > > > > > pinning approach can be used.
> > > > >
> > > > > I have some troubles understanding this. So... why TC filter is so special
> > > > > here that it deserves such a special treatment?
> > > > >
> > > >
> > > > So the motivation behind this was automatic cleanup of filters installed by some
> > > > program. Usually from the userspace side you need a bunch of things (handle,
> > > > priority, protocol, chain_index, etc.) to be able to delete a filter without
> > > > stepping on others' toes. Also, there is no gurantee that filter hasn't been
> > > > replaced, so you need to check some other way (either tag or prog_id, but these
> > > > are also not perfect).
> > > >
> > > > bpf_link provides isolation from netlink and fd-based lifetime management. As
> > > > for why it needs special treatment (by which I guess you mean why it _creates_
> > > > an object instead of simply attaching to one, see below):
> > >
> > > Are you saying TC filter is not independent? IOW, it has to rely on TC qdisc
> > > to exist. This is true, and is of course different with netns/cgroup.
> > > This is perhaps
> > > not hard to solve, because TC actions are already independent, we can perhaps
> > > convert TC filters too (the biggest blocker is compatibility).
> > >
> >
> > True, but that would mean you need some way to create a detached TC filter, correct?
> > Can you give some ideas on how the setup would look like from userspace side?
> >
> > IIUC you mean
> >
> > RTM_NEWTFILTER (with kind == bpf) parent == SOME_MAGIC_DETACHED ifindex == INVALID
> >
> > then bpf_link comes in and creates the binding to the qdisc, parent, prio,
> > chain, handle ... ?
>
> Roughly yes, except creation is still done by netlink, not bpf_link. It is
> pretty much similar to those unbound TC actions.
>

Right, thanks for explaining. I will try to work on this and see if it works out.

> >
> > > Or do you just need an ephemeral representation of a TC filter which only exists
> > > for a process? If so, see below.
> > >
> > > >
> > > > > The reason why I ask is that none of other bpf links actually create any
> > > > > object, they merely attach bpf program to an existing object. For example,
> > > > > netns bpf_link does not create an netns, cgroup bpf_link does not create
> > > > > a cgroup either. So, why TC filter is so lucky to be the first one requires
> > > > > creating an object?
> > > > >
> > > >
> > > > They are created behind the scenes, but are also fairly isolated from netlink
> > > > (i.e.  can only be introspected, so not hidden in that sense, but are
> > > > effectively locked for replace/delete).
> > > >
> > > > The problem would be (of not creating a filter during attach) is that a typical
> > > > 'attach point' for TC exists in form of tcf_proto. If we take priority (protocol
> > > > is fixed) out of the equation, it becomes possible to attach just a single BPF
> > > > prog, but that seems like a needless limitation when TC already supports list of
> > > > filters at each 'attach point'.
> > > >
> > > > My point is that the created object (the tcf_proto under the 'chain' object) is
> > > > the attach point, and since there can be so many, keeping them around at all
> > > > times doesn't make sense, so the refcounted attach locations are created as
> > > > needed.  Both netlink and bpf_link owned filters can be attached under the same
> > > > location, with different ownership story in userspace.
> > >
> > > I do not understand "created behind the scenes". These are all created
> > > independent of bpf_link, right? For example, we create a cgroup with
> > > mkdir, then open it and pass the fd to bpf_link. Clearly, cgroup is not
> > > created by bpf_link or any bpf syscall.
> >
> > Sorry, that must be confusing. I was only referring to what this patch does.
> > Indeed, as far as implementation is concerned this has no precedence.
> >
> > >
> > > The only thing different is fd, or more accurately, an identifier to locate
> > > these objects. For example, ifindex can also be used to locate a netdev.
> > > We can certainly locate a TC filter with (prio,proto,handle) but this has to
> > > be passed via netlink. So if you need some locator, I think we can
> > > introduce a kernel API which takes all necessary parameters to locate
> > > a TC filter and return it to you. For a quick example, like this:
> > >
> > > struct tcf_proto *tcf_get_proto(struct net *net, int ifindex,
> > >                                 int parent, char* kind, int handle...);
> > >
> >
> > I think this already exists in some way, i.e. you can just ignore if filter
> > handle from tp->ops->get doesn't exist (reusing the exsiting code) that walks
> > from qdisc/block -> chain -> tcf_proto during creation.
>
> Right, except currently it requires a few API's to reach TC filters
> (first netdev,,
> then qdisc, finally filters). So, I think providing one API could at
> least address
> your "stepping on others toes" concern?
>
> >
> > > (Note, it can grab a refcnt in case of being deleted by others.)
> > >
> > > With this, you can simply call it in bpf_link, and attach bpf prog to tcf_proto
> > > (of course, only cls_bpf succeeds here).
> > >
> >
> > So IIUC, you are proposing to first create a filter normally using netlink, then
> > attach it using bpf_link to the proper parent? I.e. your main contention point
> > is to not create filter from bpf_link, instead take a filter and attach it to a
> > parent with bpf_link representing this attachment?
>
> Yes, to me I don't see a reason we want to create it from bpf_link.
>
> >
> > But then the created filter stays with refcount of 1 until RTM_DELTFILTER, i.e.
> > the lifetime of the attachment may be managed by bpf_link (in that we can detach
> > the filter from parent) but the filter itself will not be cleaned up. One of the
> > goals of tying TC filter to fd was to bind lifetime of filter itself, along with
> > attachment. Separating both doesn't seem to buy anything here. You always create
> > a filter to attach somewhere.
>
> This is really odd, for two reasons:
>
> 1) Why netdev does not have such problem? bpf_xdp_link_attach() uses
> ifindex to locate a netdev, without creating it or cleaning it either.
> So, why do we
> never want to bind a netdev to an fd? IOW, what makes TC filters' lifetime so
> different from netdev?
>

I think I tried to explain the difference, but I may have failed.

netdev does not have this problem because netdev is to XDP prog what qdisc is to
a SCHED_CLS prog. The filter is merely a way to hook into the qdisc. So we bind
the attachment's lifetime to the filter's lifetime, which in turn is controlled
by the bpf_link fd. When the filter is gone, the attachment to the qdisc is gone.

So we're not really creating a qdisc here, we're just tying the filter (which in
the current semantics exists only while attached) to the bpf_link. The filter is
the attachment, so tying its lifetime to bpf_link makes sense. When you destroy
the bpf_link, the filter goes away too, which means classification at that
hook (parent/class) in the qdisc stops working. This is why creating the filter
from the bpf_link made sense to me.

I hope you can see where I was going with this now.  Introducing a new kind of
method to attach to qdisc didn't seem wise to me, given all the infrastructure
already exists.

> 2) All existing bpf_link targets, except netdev, are fs based, hence an fd makes
> sense for them naturally. TC filters, or any other netlink based
> things, are not even
> related to fs, hence fd does not make sense here, like we never bind a netdev
> to a fd.
>

Yes, none of them create any objects. It is only a side effect of current
semantics that you are able to control the filter's lifetime using the bpf_link
as filter creation is also accompanied with its attachment to the qdisc.

Your unbound filter idea just separates the two. One will still end up creating
a cls_bpf_prog object internally in the kernel, just that it will now be
refcounted and be linked into multiple tcf_proto (based on how many bpf_link's
are attached).

Another additional responsibility of the user space is to now clean up these
unbound filters when it is done using them (either right after making a bpf_link
attachment so that it is removed on bpf_link destruction, or later), because
they don't sit under any chain etc. so a full flush of filters won't remove
them.

> >
> > With actions, things are different, you may create one action but bind it to
> > multiple filters, so actions existing as their own thing makes sense. A single
> > action can serve multiple filters, and save on memory.
> >
> > You could argue that even with filters this is true, as you may want to attach
> > the same filter to multiple qdiscs, but we already have a facility to do that
> > (shared tcf_block with block->q == NULL). However that is not as flexible as
> > what you are proposing.
>
> True. I think making TC filters as standalone as TC actions is a right
> direction,
> if it helps you too.
>
> >
> > It may be odd from the kernel side but to userspace a parent, prio, handle (we
> > don't let user choose anything else for now) is itself the attach point, how
> > bpf_link manages the attachment internally isn't really that interesting. It
> > does so now by way of creating an object that represents a certain hook, then
> > binding the BPF prog to it. I consider this mostly an implementation detail.
> > What you are really attaching to is the qdisc/block, which is the resource
> > analogous to cgroup fd, netns fd, and ifindex, and 'where' is described by other
> > attributes.
>
> How do you establish the analogy here? cgroup and netns are fs based,
> having an fd is natural. ifindex is not an fd, it is a locator for netdev. Plus,
> current bpf_link code does not create any of them.
>
> Thanks.

--
Kartikeya

  reply	other threads:[~2021-06-08  7:21 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-28 19:59 [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 1/7] net: sched: refactor cls_bpf creation code Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 2/7] bpf: export bpf_link functions for modules Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 3/7] net: sched: add bpf_link API for bpf classifier Kumar Kartikeya Dwivedi
2021-06-02 20:56   ` Andrii Nakryiko
2021-05-28 19:59 ` [PATCH RFC bpf-next 4/7] net: sched: add lightweight update path for cls_bpf Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 5/7] tools: bpf.h: sync with kernel sources Kumar Kartikeya Dwivedi
2021-05-28 19:59 ` [PATCH RFC bpf-next 6/7] libbpf: add bpf_link based TC-BPF management API Kumar Kartikeya Dwivedi
2021-06-02 21:03   ` Andrii Nakryiko
2021-05-28 19:59 ` [PATCH RFC bpf-next 7/7] libbpf: add selftest for " Kumar Kartikeya Dwivedi
2021-06-02 21:09 ` [PATCH RFC bpf-next 0/7] Add bpf_link based TC-BPF API Andrii Nakryiko
2021-06-02 21:45   ` Kumar Kartikeya Dwivedi
2021-06-02 23:50     ` Alexei Starovoitov
2021-06-04  6:43       ` Kumar Kartikeya Dwivedi
2021-06-06 23:37 ` Cong Wang
2021-06-07  3:37   ` Kumar Kartikeya Dwivedi
2021-06-07  5:18     ` Cong Wang
2021-06-07  6:07       ` Kumar Kartikeya Dwivedi
2021-06-08  2:00         ` Cong Wang
2021-06-08  7:19           ` Kumar Kartikeya Dwivedi [this message]
2021-06-08 15:39             ` Alexei Starovoitov
2021-06-11  2:10               ` Cong Wang
2021-06-11  2:00             ` Cong Wang
2021-06-13  2:53               ` Kumar Kartikeya Dwivedi
2021-06-13 20:27                 ` Jamal Hadi Salim
2021-06-13 20:34                   ` Kumar Kartikeya Dwivedi
2021-06-13 21:10                     ` Jamal Hadi Salim
2021-06-14 13:03                       ` Marcelo Ricardo Leitner
2021-06-15 23:07                       ` Daniel Borkmann
2021-06-16 14:40                         ` Jamal Hadi Salim
2021-06-16 15:32                           ` Kumar Kartikeya Dwivedi
2021-06-16 16:00                             ` Daniel Borkmann
2021-06-18 11:40                               ` Jamal Hadi Salim
2021-06-18 14:38                                 ` Alexei Starovoitov
2021-06-18 14:50                                   ` Jamal Hadi Salim
2021-06-18 16:23                                     ` Alexei Starovoitov
2021-06-18 16:41                                       ` Jamal Hadi Salim
2021-06-18 22:42                                 ` Daniel Borkmann
2021-06-21 13:55                                   ` Jamal Hadi Salim
2021-06-15  4:33                 ` Cong Wang
2021-06-15 11:54                   ` Toke Høiland-Jørgensen
2021-06-15 23:44                     ` Daniel Borkmann
2021-06-16 12:03                       ` Toke Høiland-Jørgensen
2021-06-16 15:33                       ` Jamal Hadi Salim
2021-06-13  3:08               ` Kumar Kartikeya Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210608071908.sos275adj3gunewo@apollo \
    --to=memxor@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=joe@cilium.io \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=quentin@isovalent.com \
    --cc=songliubraving@fb.com \
    --cc=toke@redhat.com \
    --cc=vladbu@nvidia.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox