From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7713EC433FE for ; Fri, 7 Oct 2022 17:20:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229495AbiJGRUt (ORCPT ); Fri, 7 Oct 2022 13:20:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44376 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229509AbiJGRUt (ORCPT ); Fri, 7 Oct 2022 13:20:49 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10F90C8208 for ; Fri, 7 Oct 2022 10:20:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665163247; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=e8/cc3rA0x7qRcg5j7SUr5NIuevAztPGZBNjEBn3naY=; b=FIV/z48NsPmmNQa7NXbcUTf5or/DkdLgVAKSnv5HYcMFdWotCXEiZWkIn9kkcfK0JlzZ67 f4R4YpLwciNcRmbJqatjkuJsBndHWLHucM6e88Fk9VijbkmuGexY8rS4FQ/p0IK12V6u6p 6sd84QnfbWWtwk8kZFvCyQxL2/3ccXc= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-411-zF3PqA-hNTCXHxzhf5o12Q-1; Fri, 07 Oct 2022 13:20:45 -0400 X-MC-Unique: zF3PqA-hNTCXHxzhf5o12Q-1 Received: by mail-ej1-f72.google.com with SMTP id qw31-20020a1709066a1f00b00783d9fd7df2so3143836ejc.17 for ; Fri, 07 Oct 2022 10:20:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=e8/cc3rA0x7qRcg5j7SUr5NIuevAztPGZBNjEBn3naY=; b=OEJOP48Ngwr9U3gLkf6kTkr6xbotmWmUdJs9AV+wadRRzbuImiMgxjOp/WrJrKSqEO chNP2TfHs+f539/cI9Vo9FzQUIa1ntD4v/aD1cHtJkTajlY28d2QwQfqvOTqXJQKuH4G sMNLBe2SKhtrPkfHdHcSzKrKAX/k5Tz6DqZK8vTHPb+fZRQy7BR6pjU3vtY3aJrAZ251 RYtC11lHlQDpktqIOvNfz5D90KM6hbV2PeKdvuYXczPe2drBq+lbBAth8LwlQyy+oyKY uussu3eqybwF9E0dWRBu/1V6xyxabZhSxg6+v3vWQaPSk5+IcNHH6aQQf9kmRAMHm5c5 XmOA== X-Gm-Message-State: ACrzQf2RX4c81dYB56h2KCxk6XA+K9vcqjaz3S0F6sPfEklb6xcBWvm6 IWst3VLAerEfJ9cik700oVNKJXGDkNtQPUZ4OF6/T1+n+IqsJo+vxYlIMjYpaV2FXr0ecn/n7Jf QHufsLRg1Z/Jh X-Received: by 2002:aa7:c956:0:b0:43b:206d:c283 with SMTP id h22-20020aa7c956000000b0043b206dc283mr5478816edt.381.1665163244098; Fri, 07 Oct 2022 10:20:44 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4DHf7zqccBgUo5jcwC5NhJfg0rUqOGAnV5SmPRzvLCg8EvG4sRaprT7t6k+hX/RytvDliRlQ== X-Received: by 2002:aa7:c956:0:b0:43b:206d:c283 with SMTP id h22-20020aa7c956000000b0043b206dc283mr5478645edt.381.1665163241576; Fri, 07 Oct 2022 10:20:41 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id v29-20020a50d09d000000b004580296bb0bsm1831561edd.83.2022.10.07.10.20.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Oct 2022 10:20:40 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 29FE064EF1A; Fri, 7 Oct 2022 19:20:40 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: sdf@google.com Cc: Daniel Borkmann , Alexei Starovoitov , bpf , Nikolay Aleksandrov , Alexei Starovoitov , Andrii Nakryiko , Martin KaFai Lau , John Fastabend , Joanne Koong , Kumar Kartikeya Dwivedi , Joe Stringer , Network Development Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs In-Reply-To: References: <20221004231143.19190-1-daniel@iogearbox.net> <20221004231143.19190-2-daniel@iogearbox.net> <20221006050053.pbwo72xtzoza6gfl@macbook-pro-4.dhcp.thefacebook.com> <14f368eb-9158-68bc-956c-c8371cfcb531@iogearbox.net> <875ygvemau.fsf@toke.dk> X-Clacks-Overhead: GNU Terry Pratchett Date: Fri, 07 Oct 2022 19:20:40 +0200 Message-ID: <87tu4fczyv.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org sdf@google.com writes: > On 10/07, Toke H=EF=BF=BDiland-J=EF=BF=BDrgensen wrote: >> Daniel Borkmann writes: > >> > On 10/7/22 1:28 AM, Alexei Starovoitov wrote: >> >> On Thu, Oct 6, 2022 at 2:29 PM Daniel Borkmann = =20=20 >> wrote: >> >>> On 10/6/22 7:00 AM, Alexei Starovoitov wrote: >> >>>> On Wed, Oct 05, 2022 at 01:11:34AM +0200, Daniel Borkmann wrote: >> >>> [...] >> >>>> >> >>>> I cannot help but feel that prio logic copy-paste from old tc,=20=20 >> netfilter and friends >> >>>> is done because "that's how things were done in the past". >> >>>> imo it was a well intentioned mistake and all networking things (tc= ,=20=20 >> netfilter, etc) >> >>>> copy-pasted that cumbersome and hard to use concept. >> >>>> Let's throw away that baggage? >> >>>> In good set of cases the bpf prog inserter cares whether the prog i= s=20=20 >> first or not. >> >>>> Since the first prog returning anything but TC_NEXT will be final. >> >>>> I think prog insertion flags: 'I want to run first' vs 'I don't car= e=20=20 >> about order' >> >>>> is good enough in practice. Any complex scheme should probably be= =20=20 >> programmable >> >>>> as any policy should. For example in Meta we have 'xdp chainer'=20= =20 >> logic that is similar >> >>>> to libxdp chaining, but we added a feature that allows a prog to=20= =20 >> jump over another >> >>>> prog and continue the chain. Priority concept cannot express that. >> >>>> Since we'd have to add some "policy program" anyway for use cases= =20=20 >> like this >> >>>> let's keep things as simple as possible? >> >>>> Then maybe we can adopt this "as-simple-as-possible" to XDP hooks ? >> >>>> And allow bpf progs chaining in the kernel with "run_me_first"=20=20 >> vs "run_me_anywhere" >> >>>> in both tcx and xdp ? >> >>>> Naturally "run_me_first" prog will be the only one. No need for=20= =20 >> F_REPLACE flags, etc. >> >>>> The owner of "run_me_first" will update its prog through=20=20 >> bpf_link_update. >> >>>> "run_me_anywhere" will add to the end of the chain. >> >>>> In XDP for compatibility reasons "run_me_first" will be the default. >> >>>> Since only one prog can be enqueued with such flag it will match=20= =20 >> existing single prog behavior. >> >>>> Well behaving progs will use (like xdp-tcpdump or monitoring progs)= =20=20 >> will use "run_me_anywhere". >> >>>> I know it's far from covering plenty of cases that we've discussed= =20=20 >> for long time, >> >>>> but prio concept isn't really covering them either. >> >>>> We've struggled enough with single xdp prog, so certainly not=20=20 >> advocating for that. >> >>>> Another alternative is to do: "queue_at_head" vs "queue_at_tail".= =20=20 >> Just as simple. >> >>>> Both simple versions have their pros and cons and don't cover=20=20 >> everything, >> >>>> but imo both are better than prio. >> >>> >> >>> Yeah, it's kind of tricky, imho. The 'run_me_first'=20=20 >> vs 'run_me_anywhere' are two >> >>> use cases that should be covered (and actually we kind of do this in= =20=20 >> this set, too, >> >>> with the prios via prio=3Dx vs prio=3D0). Given users will only be= =20=20 >> consuming the APIs >> >>> via libs like libbpf, this can also be abstracted this way w/o users= =20=20 >> having to be >> >>> aware of prios. >> >> >> >> but the patchset tells different story. >> >> Prio gets exposed everywhere in uapi all the way to bpftool >> >> when it's right there for users to understand. >> >> And that's the main problem with it. >> >> The user don't want to and don't need to be aware of it, >> >> but uapi forces them to pick the priority. >> >> >> >>> Anyway, where it gets tricky would be when things depend on ordering, >> >>> e.g. you have BPF progs doing: policy, monitoring, lb, monitoring,= =20=20 >> encryption, which >> >>> would be sth you can build today via tc BPF: so policy one acts as a= =20=20 >> prefilter for >> >>> various cidr ranges that should be blocked no matter what, then=20=20 >> monitoring to sample >> >>> what goes into the lb, then lb itself which does snat/dnat, then=20= =20 >> monitoring to see what >> >>> the corresponding pkt looks that goes to backend, and maybe=20=20 >> encryption to e.g. send >> >>> the result to wireguard dev, so it's encrypted from lb node to=20=20 >> backend. >> >> >> >> That's all theory. Your cover letter example proves that in >> >> real life different service pick the same priority. >> >> They simply don't know any better. >> >> prio is an unnecessary magic that apps _have_ to pick, >> >> so they just copy-paste and everyone ends up using the same. >> >> >> >>> For such >> >>> example, you'd need prios as the 'run_me_anywhere' doesn't guarantee= =20=20 >> order, so there's >> >>> a case for both scenarios (concrete layout vs loose one), and for=20= =20 >> latter we could >> >>> start off with and internal prio around x (e.g. 16k), so there's roo= m=20=20 >> to attach in >> >>> front via fixed prio, but also append to end for 'don't care', and= =20=20 >> that could be >> >>> from lib pov the default/main API whereas prio would be some kind of= =20=20 >> extended one. >> >>> Thoughts? >> >> >> >> If prio was not part of uapi, like kernel internal somehow, >> >> and there was a user space daemon, systemd, or another bpf prog, >> >> module, whatever that users would interface to then >> >> the proposed implementation of prio would totally make sense. >> >> prio as uapi is not that. >> > >> > A good analogy to this issue might be systemd's unit files.. you=20=20 >> specify dependencies >> > for your own file via 'Wants=3D', and ordering=20=20 >> via 'Before=3D' and >> > 'After=3D' and they refer to other unit files. I think that is= =20=20 >> generally okay, >> > you don't deal with prio numbers, but rather some kind textual=20=20 >> representation. However >> > user/operator will have to deal with dependencies/ordering one way or= =20=20 >> another, the >> > problem here is that we deal with kernel and loader talks to kernel=20= =20 >> directly so it >> > has no awareness of what else is running or could be running, so apps= =20=20 >> needs to deal >> > with it somehow (and it cannot without external help). > >> I was thinking a little about how this might work; i.e., how can the >> kernel expose the required knobs to allow a system policy to be >> implemented without program loading having to talk to anything other >> than the syscall API? > >> How about we only expose prepend/append in the prog attach UAPI, and >> then have a kernel function that does the sorting like: > >> int bpf_add_new_tcx_prog(struct bpf_prog *progs, size_t num_progs, struc= t=20=20 >> bpf_prog *new_prog, bool append) > >> where the default implementation just appends/prepends to the array in >> progs depending on the value of 'appen'. > >> And then use the __weak linking trick (or maybe struct_ops with a member >> for TXC, another for XDP, etc?) to allow BPF to override the function >> wholesale and implement whatever ordering it wants? I.e., allow it can >> to just shift around the order of progs in the 'progs' array whenever a >> program is loaded/unloaded? > >> This way, a userspace daemon can implement any policy it wants by just >> attaching to that hook, and keeping things like how to express >> dependencies as a userspace concern? > > What if we do the above, but instead of simple global 'attach first/last', > the default api would be: > > - attach before > - attach after > - attach before target_fd=3D-1 =3D=3D first > - attach after target_fd=3D-1 =3D=3D last > > ? Hmm, the problem with that is that applications don't generally have an fd to another application's BPF programs; and obtaining them from an ID is a privileged operation (CAP_SYS_ADMIN). We could have it be "attach before target *ID*" instead, which could work I guess? But then the problem becomes that it's racy: the ID you're targeting could get detached before you attach, so you'll need to be prepared to check that and retry; and I'm almost certain that applications won't test for this, so it'll just lead to hard-to-debug heisenbugs. Or am I being too pessimistic here? -Toke