netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: sdf@google.com
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>,
	Nikolay Aleksandrov <razor@blackwall.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	Joanne Koong <joannelkoong@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Joe Stringer <joe@cilium.io>,
	Network Development <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs
Date: Fri, 7 Oct 2022 09:55:33 -0700	[thread overview]
Message-ID: <Y0BaBUWeTj18V5Xp@google.com> (raw)
In-Reply-To: <875ygvemau.fsf@toke.dk>

On 10/07, Toke H�iland-J�rgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:

> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote:
> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann <daniel@iogearbox.net>  
> wrote:
> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote:
> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote:
> >>> [...]
> >>>>
> >>>> I cannot help but feel that prio logic copy-paste from old tc,  
> netfilter and friends
> >>>> is done because "that's how things were done in the past".
> >>>> imo it was a well intentioned mistake and all networking things (tc,  
> netfilter, etc)
> >>>> copy-pasted that cumbersome and hard to use concept.
> >>>> Let's throw away that baggage?
> >>>> In good set of cases the bpf prog inserter cares whether the prog is  
> first or not.
> >>>> Since the first prog returning anything but TC_NEXT will be final.
> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't care  
> about order'
> >>>> is good enough in practice. Any complex scheme should probably be  
> programmable
> >>>> as any policy should. For example in Meta we have 'xdp chainer'  
> logic that is similar
> >>>> to libxdp chaining, but we added a feature that allows a prog to  
> jump over another
> >>>> prog and continue the chain. Priority concept cannot express that.
> >>>> Since we'd have to add some "policy program" anyway for use cases  
> like this
> >>>> let's keep things as simple as possible?
> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ?
> >>>> And allow bpf progs chaining in the kernel with "run_me_first"  
> vs "run_me_anywhere"
> >>>> in both tcx and xdp ?
> >>>> Naturally "run_me_first" prog will be the only one. No need for  
> F_REPLACE flags, etc.
> >>>> The owner of "run_me_first" will update its prog through  
> bpf_link_update.
> >>>> "run_me_anywhere" will add to the end of the chain.
> >>>> In XDP for compatibility reasons "run_me_first" will be the default.
> >>>> Since only one prog can be enqueued with such flag it will match  
> existing single prog behavior.
> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)  
> will use "run_me_anywhere".
> >>>> I know it's far from covering plenty of cases that we've discussed  
> for long time,
> >>>> but prio concept isn't really covering them either.
> >>>> We've struggled enough with single xdp prog, so certainly not  
> advocating for that.
> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".  
> Just as simple.
> >>>> Both simple versions have their pros and cons and don't cover  
> everything,
> >>>> but imo both are better than prio.
> >>>
> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'  
> vs 'run_me_anywhere' are two
> >>> use cases that should be covered (and actually we kind of do this in  
> this set, too,
> >>> with the prios via prio=x vs prio=0). Given users will only be  
> consuming the APIs
> >>> via libs like libbpf, this can also be abstracted this way w/o users  
> having to be
> >>> aware of prios.
> >>
> >> but the patchset tells different story.
> >> Prio gets exposed everywhere in uapi all the way to bpftool
> >> when it's right there for users to understand.
> >> And that's the main problem with it.
> >> The user don't want to and don't need to be aware of it,
> >> but uapi forces them to pick the priority.
> >>
> >>> Anyway, where it gets tricky would be when things depend on ordering,
> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,  
> encryption, which
> >>> would be sth you can build today via tc BPF: so policy one acts as a  
> prefilter for
> >>> various cidr ranges that should be blocked no matter what, then  
> monitoring to sample
> >>> what goes into the lb, then lb itself which does snat/dnat, then  
> monitoring to see what
> >>> the corresponding pkt looks that goes to backend, and maybe  
> encryption to e.g. send
> >>> the result to wireguard dev, so it's encrypted from lb node to  
> backend.
> >>
> >> That's all theory. Your cover letter example proves that in
> >> real life different service pick the same priority.
> >> They simply don't know any better.
> >> prio is an unnecessary magic that apps _have_ to pick,
> >> so they just copy-paste and everyone ends up using the same.
> >>
> >>> For such
> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee  
> order, so there's
> >>> a case for both scenarios (concrete layout vs loose one), and for  
> latter we could
> >>> start off with and internal prio around x (e.g. 16k), so there's room  
> to attach in
> >>> front via fixed prio, but also append to end for 'don't care', and  
> that could be
> >>> from lib pov the default/main API whereas prio would be some kind of  
> extended one.
> >>> Thoughts?
> >>
> >> If prio was not part of uapi, like kernel internal somehow,
> >> and there was a user space daemon, systemd, or another bpf prog,
> >> module, whatever that users would interface to then
> >> the proposed implementation of prio would totally make sense.
> >> prio as uapi is not that.
> >
> > A good analogy to this issue might be systemd's unit files.. you  
> specify dependencies
> > for your own <unit> file via 'Wants=<unitA>', and ordering  
> via 'Before=<unitB>' and
> > 'After=<unitC>' and they refer to other unit files. I think that is  
> generally okay,
> > you don't deal with prio numbers, but rather some kind textual  
> representation. However
> > user/operator will have to deal with dependencies/ordering one way or  
> another, the
> > problem here is that we deal with kernel and loader talks to kernel  
> directly so it
> > has no awareness of what else is running or could be running, so apps  
> needs to deal
> > with it somehow (and it cannot without external help).

> I was thinking a little about how this might work; i.e., how can the
> kernel expose the required knobs to allow a system policy to be
> implemented without program loading having to talk to anything other
> than the syscall API?

> How about we only expose prepend/append in the prog attach UAPI, and
> then have a kernel function that does the sorting like:

> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struct  
> bpf_prog *new_prog, bool append)

> where the default implementation just appends/prepends to the array in
> progs depending on the value of 'appen'.

> And then use the __weak linking trick (or maybe struct_ops with a member
> for TXC, another for XDP, etc?) to allow BPF to override the function
> wholesale and implement whatever ordering it wants? I.e., allow it can
> to just shift around the order of progs in the 'progs' array whenever a
> program is loaded/unloaded?

> This way, a userspace daemon can implement any policy it wants by just
> attaching to that hook, and keeping things like how to express
> dependencies as a userspace concern?

What if we do the above, but instead of simple global 'attach first/last',
the default api would be:

- attach before <target_fd>
- attach after <target_fd>
- attach before target_fd=-1 == first
- attach after target_fd=-1 == last

?

That might be flexible enough by default to allow users to
append/prepend to any existing program in the chain (say, for
monitoring). Flexible enough for some central daemons to do
systemd-style policy. And, with bpf_add_new_tcx_prog, flexible
enough to implement any policy?

  reply	other threads:[~2022-10-07 16:55 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-04 23:11 [PATCH bpf-next 00/10] BPF link support for tc BPF programs Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach " Daniel Borkmann
2022-10-05  0:55   ` sdf
2022-10-05 10:50     ` Toke Høiland-Jørgensen
2022-10-05 14:48       ` Daniel Borkmann
2022-10-05 12:35     ` Daniel Borkmann
2022-10-05 17:56       ` sdf
2022-10-05 18:21         ` Daniel Borkmann
2022-10-05 10:33   ` Toke Høiland-Jørgensen
2022-10-05 12:47     ` Daniel Borkmann
2022-10-05 14:32       ` Toke Høiland-Jørgensen
2022-10-05 14:53         ` Daniel Borkmann
2022-10-05 19:04   ` Jamal Hadi Salim
2022-10-06 20:49     ` Daniel Borkmann
2022-10-07 15:36       ` Jamal Hadi Salim
2022-10-06  0:22   ` Andrii Nakryiko
2022-10-06  5:00   ` Alexei Starovoitov
2022-10-06 14:40     ` Jamal Hadi Salim
2022-10-06 23:29       ` Alexei Starovoitov
2022-10-07 15:43         ` Jamal Hadi Salim
2022-10-06 21:29     ` Daniel Borkmann
2022-10-06 23:28       ` Alexei Starovoitov
2022-10-07 13:26         ` Daniel Borkmann
2022-10-07 14:32           ` Toke Høiland-Jørgensen
2022-10-07 16:55             ` sdf [this message]
2022-10-07 17:20               ` Toke Høiland-Jørgensen
2022-10-07 18:11                 ` sdf
2022-10-07 19:06                   ` Daniel Borkmann
2022-10-07 18:59                 ` Alexei Starovoitov
2022-10-07 19:37                   ` Daniel Borkmann
2022-10-07 22:45                     ` sdf
2022-10-07 23:41                       ` Alexei Starovoitov
2022-10-07 23:34                     ` Alexei Starovoitov
2022-10-08 11:38                       ` Toke Høiland-Jørgensen
2022-10-08 20:38                         ` Alexei Starovoitov
2022-10-13 18:30                           ` Andrii Nakryiko
2022-10-14 15:38                             ` Alexei Starovoitov
2022-10-27  9:01                               ` Daniel Xu
2022-10-06 20:15   ` Martin KaFai Lau
2022-10-06 20:54   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 02/10] bpf: Implement BPF link handling for " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 20:54     ` Daniel Borkmann
2022-10-06 17:56   ` Martin KaFai Lau
2022-10-06 20:10   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 03/10] bpf: Implement link update for tc BPF link programs Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 04/10] bpf: Implement link introspection " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 23:14   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 05/10] bpf: Implement link detach " Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-06 23:24   ` Martin KaFai Lau
2022-10-04 23:11 ` [PATCH bpf-next 06/10] libbpf: Change signature of bpf_prog_query Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 07/10] libbpf: Add extended attach/detach opts Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 08/10] libbpf: Add support for BPF tc link Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko
2022-10-04 23:11 ` [PATCH bpf-next 09/10] bpftool: Add support for tc fd-based attach types Daniel Borkmann
2022-10-04 23:11 ` [PATCH bpf-next 10/10] bpf, selftests: Add various BPF tc link selftests Daniel Borkmann
2022-10-06  3:19   ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0BaBUWeTj18V5Xp@google.com \
    --to=sdf@google.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=joannelkoong@gmail.com \
    --cc=joe@cilium.io \
    --cc=john.fastabend@gmail.com \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=razor@blackwall.org \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).