Re: [PATCH bpf-next v3 1/8] bpf: Add generic attach/detach/query API for multi-progs

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Stanislav Fomichev <sdf@google.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: ast@kernel.org, andrii@kernel.org, martin.lau@linux.dev,
	 razor@blackwall.org, john.fastabend@gmail.com, kuba@kernel.org,
	dxu@dxuuu.xyz,  joe@cilium.io, toke@kernel.org,
	davem@davemloft.net, bpf@vger.kernel.org,
	 netdev@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/8] bpf: Add generic attach/detach/query API for multi-progs
Date: Mon, 10 Jul 2023 11:18:11 -0700	[thread overview]
Message-ID: <ZKxLY3onuOHepOxt@google.com> (raw)
In-Reply-To: <d67ca0f4-4753-e86f-f8ca-dd515f941ea5@iogearbox.net>

On 07/10, Daniel Borkmann wrote:
> On 7/7/23 11:27 PM, Stanislav Fomichev wrote:
> > On 07/07, Daniel Borkmann wrote:
> [...]
> > > +static inline struct bpf_mprog_entry *
> > > +bpf_mprog_create(const size_t size, const off_t off)
> > > +{
> > > +	struct bpf_mprog_bundle *bundle;
> > > +	void *ptr;
> > > +
> > > +	BUILD_BUG_ON(size < sizeof(*bundle) + off);
> > > +	BUILD_BUG_ON(sizeof(bundle->a.fp_items[0]) > sizeof(u64));
> > > +	BUILD_BUG_ON(ARRAY_SIZE(bundle->a.fp_items) !=
> > > +		     ARRAY_SIZE(bundle->cp_items));
> > > +
> > > +	ptr = kzalloc(size, GFP_KERNEL);
> > > +	if (ptr) {
> > > +		bundle = ptr + off;
> > > +		atomic64_set(&bundle->revision, 1);
> > > +		bundle->off = off;
> > > +		bundle->a.parent = bundle;
> > > +		bundle->b.parent = bundle;
> > > +		return &bundle->a;
> > > +	}
> > > +	return NULL;
> > > +}
> > > +
> > > +void bpf_mprog_free_rcu(struct rcu_head *rcu);
> > > +
> > > +static inline void bpf_mprog_free(struct bpf_mprog_entry *entry)
> > > +{
> > > +	struct bpf_mprog_bundle *bundle = entry->parent;
> > > +
> > > +	call_rcu(&bundle->rcu, bpf_mprog_free_rcu);
> > > +}
> > 
> > Any reason we're doing allocation here? Why not do
> > bpf_mprog_init(struct bpf_mprog_bundle *) instead that simply initializes
> > the fields? Then we can move allocation/free part to the caller (tcx) along
> > with rcu_head.
> > Feels like it would be a bit more conventional/readable? bpf_mprog_free{,_rcu}
> > will also become tcx_free{,_rcu}..
> > 
> > I guess current approach works, but it took me awhile to figure it out..
> > (maybe it's just me)
> 
> I found this approach quite useful for tcx case since we only fetch the
> bpf_mprog_entry for tcx_link_prog_attach et al, but I can take a look to
> see if this looks better and if it does I'll include it.
> 
> > > +static inline void bpf_mprog_mark_ref(struct bpf_mprog_entry *entry,
> > > +				      struct bpf_tuple *tuple)
> > > +{
> > > +	WARN_ON_ONCE(entry->parent->ref);
> > > +	if (!tuple->link)
> > > +		entry->parent->ref = tuple->prog;
> > > +}
> > > +
> > > +static inline void bpf_mprog_inc(struct bpf_mprog_entry *entry)
> > > +{
> > > +	entry->parent->count++;
> > > +}
> > > +
> > > +static inline void bpf_mprog_dec(struct bpf_mprog_entry *entry)
> > > +{
> > > +	entry->parent->count--;
> > > +}
> > > +
> > > +static inline int bpf_mprog_max(void)
> > > +{
> > > +	return ARRAY_SIZE(((struct bpf_mprog_entry *)NULL)->fp_items) - 1;
> > > +}
> > > +
> > > +static inline int bpf_mprog_total(struct bpf_mprog_entry *entry)
> > > +{
> > > +	int total = entry->parent->count;
> > > +
> > > +	WARN_ON_ONCE(total > bpf_mprog_max());
> > > +	return total;
> > > +}
> > > +
> > > +static inline bool bpf_mprog_exists(struct bpf_mprog_entry *entry,
> > > +				    struct bpf_prog *prog)
> > > +{
> > > +	const struct bpf_mprog_fp *fp;
> > > +	const struct bpf_prog *tmp;
> > > +
> > > +	bpf_mprog_foreach_prog(entry, fp, tmp) {
> > > +		if (tmp == prog)
> > > +			return true;
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > > +static inline bool bpf_mprog_swap_entries(const int code)
> > > +{
> > > +	return code == BPF_MPROG_SWAP ||
> > > +	       code == BPF_MPROG_FREE;
> > > +}
> > > +
> > > +static inline void bpf_mprog_commit(struct bpf_mprog_entry *entry)
> > > +{
> > > +	atomic64_inc(&entry->parent->revision);
> > > +	synchronize_rcu();
> > 
> > Maybe add a comment on why we need to synchronize_rcu here? In general,
> > I don't think I have a good grasp of that ->ref member.
> 
> Yeap, will add a comment. For the case where we delete the prog, we mark
> it in bpf_mprog_detach, but we can only drop the reference once the user
> swapped the bpf_mprog_entry and ensured that there are no in-flight users
> hence both in bpf_mprog_commit.
> 
> [...]
> > > +static int bpf_mprog_prog(struct bpf_tuple *tuple,
> > > +			  u32 object, u32 flags,
> > > +			  enum bpf_prog_type type)
> > > +{
> > > +	bool id = flags & BPF_F_ID;
> > > +	struct bpf_prog *prog;
> > > +
> > > +	if (id)
> > > +		prog = bpf_prog_by_id(object);
> > > +	else
> > > +		prog = bpf_prog_get(object);
> > > +	if (IS_ERR(prog)) {
> > 
> > [..]
> > 
> > > +		if (!object && !id)
> > > +			return 0;
> > 
> > What's the reason behind this?
> 
> If an fd was passed which is 0 and this was not a program fd, then we don't error
> out and treat it as if no fd was passed.
 
Is this new api an opportunity to fix that fd==0? And always treat it as
valid. Or we have some other constrains elsewhere?

> > > +		return PTR_ERR(prog);
> > > +	}
> > > +	if (type && prog->type != type) {
> > > +		bpf_prog_put(prog);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	tuple->link = NULL;
> > > +	tuple->prog = prog;
> > > +	return 0;
> > > +}
> [...]
> > > +static int bpf_mprog_pos_before(struct bpf_mprog_entry *entry,
> > > +				struct bpf_tuple *tuple)
> > > +{
> > > +	struct bpf_mprog_fp *fp;
> > > +	struct bpf_mprog_cp *cp;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < bpf_mprog_total(entry); i++) {
> > > +		bpf_mprog_read(entry, i, &fp, &cp);
> > > +		if (tuple->prog == READ_ONCE(fp->prog) &&
> > 
> > Both attach/detach happen under rtnl, why do need READ_ONCE? I'm assuming
> > even going forwrad, attach/detach from non-tcx places will happen
> > under lock?
> > 
> > (same for bpf_mprog_pos_before/bpf_mprog_pos_after)
> > 
> > Feels like the only place where we need WRITE_ONCE is the replace (in-place)
> > and READ_ONCE during fast-path. Why do we need the rest?
> 
> Yes, the replace case is via WRITE_ONCE, hence the READ_ONCE annotations. You
> are saying that for the cases where we are under lock we should just drop the
> READ_ONCE annotations? I can do that ofc, I thought the general convention was
> to do the {READ,WRITE}_ONCE consistently for the purpose of documenting fp->prog
> access.

I see, then maybe let's keep them. I was a bit confused because those
READ_ONCE are within a locked section so I wasn't sure whether I'm
missing something or it's working as intended :-)

> > > +		    (!tuple->link || tuple->link == cp->link))
> > > +			return i - 1;
> > > +	}
> > > +	return tuple->prog ? -ENOENT : -1;
> > > +}
> > > +
> > > +static int bpf_mprog_pos_after(struct bpf_mprog_entry *entry,
> > > +			       struct bpf_tuple *tuple)
> > > +{
> > > +	struct bpf_mprog_fp *fp;
> > > +	struct bpf_mprog_cp *cp;
> > > +	int i;
> > > +
> > > +	for (i = 0; i < bpf_mprog_total(entry); i++) {
> > > +		bpf_mprog_read(entry, i, &fp, &cp);
> > > +		if (tuple->prog == READ_ONCE(fp->prog) &&
> > > +		    (!tuple->link || tuple->link == cp->link))
> > > +			return i + 1;
> > > +	}
> > > +	return tuple->prog ? -ENOENT : bpf_mprog_total(entry);
> > > +}
> > > +
> > > +int bpf_mprog_attach(struct bpf_mprog_entry *entry, struct bpf_prog *prog_new,
> > > +		     struct bpf_link *link, struct bpf_prog *prog_old,
> > > +		     u32 flags, u32 object, u64 revision)
> > > +{
> > > +	struct bpf_tuple rtuple, ntuple = {
> > > +		.prog = prog_new,
> > > +		.link = link,
> > > +	}, otuple = {
> > > +		.prog = prog_old,
> > > +		.link = link,
> > > +	};
> > > +	int ret, idx = -2, tidx;
> > > +
> > > +	if (revision && revision != bpf_mprog_revision(entry))
> > > +		return -ESTALE;
> > > +	if (bpf_mprog_exists(entry, prog_new))
> > > +		return -EEXIST;
> > > +	ret = bpf_mprog_tuple_relative(&rtuple, object,
> > > +				       flags & ~BPF_F_REPLACE,
> > > +				       prog_new->type);
> > > +	if (ret)
> > > +		return ret;
> > > +	if (flags & BPF_F_REPLACE) {
> > > +		tidx = bpf_mprog_pos_exact(entry, &otuple);
> > > +		if (tidx < 0) {
> > > +			ret = tidx;
> > > +			goto out;
> > > +		}
> > > +		idx = tidx;
> > > +	}
> > 
> > [..]
> > 
> > > +	if (flags & BPF_F_BEFORE) {
> > > +		tidx = bpf_mprog_pos_before(entry, &rtuple);
> > > +		if (tidx < -1 || (idx >= -1 && tidx != idx)) {
> > > +			ret = tidx < -1 ? tidx : -EDOM;
> > > +			goto out;
> > > +		}
> > > +		idx = tidx;
> > > +	}
> > > +	if (flags & BPF_F_AFTER) {
> > > +		tidx = bpf_mprog_pos_after(entry, &rtuple);
> > > +		if (tidx < 0 || (idx >= -1 && tidx != idx)) {
> > > +			ret = tidx < 0 ? tidx : -EDOM;
> > > +			goto out;
> > > +		}
> > > +		idx = tidx;
> > > +	}
> > 
> > There still seems to be some inter-dependency between F_BEFORE and F_AFTER?
> > IOW, looks like I can pass F_BEFORE|F_AFTER|F_REPLACE. Do we need that?
> > Why not exclusive cases?
> 
> I reworked this as per Andrii's suggestion/preference from v2 [0], iow, to calculate
> target index and bail out if the request cannot be resolved into a common index.
> 
>   [0] https://lore.kernel.org/bpf/CAEf4BzbsUMnP7WMm3OmJznvD2b03B1qASFRNiDoVAU6XvvTZNA@mail.gmail.com/

SG! Let's maybe put a summary in the header of what the expectation is when
combining them?

next prev parent reply	other threads:[~2023-07-10 18:18 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 17:24 [PATCH bpf-next v3 0/8] BPF link support for tc BPF programs Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 1/8] bpf: Add generic attach/detach/query API for multi-progs Daniel Borkmann
2023-07-07 21:27   ` Stanislav Fomichev
2023-07-10  7:42     ` Daniel Borkmann
2023-07-10 14:26       ` Daniel Borkmann
2023-07-10 18:18       ` Stanislav Fomichev [this message]
2023-07-10 18:26         ` Alexei Starovoitov
2023-07-10 19:00           ` Stanislav Fomichev
2023-07-10 20:16             ` Alexei Starovoitov
2023-07-10 21:13               ` Stanislav Fomichev
2023-07-10 22:38                 ` Alexei Starovoitov
2023-07-10 22:46                   ` Stanislav Fomichev
2023-07-10 18:29         ` Daniel Borkmann
2023-07-09 17:17   ` Alexei Starovoitov
2023-07-10  7:10     ` Daniel Borkmann
2023-07-10 13:15       ` Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 2/8] bpf: Add fd-based tcx multi-prog infra with link support Daniel Borkmann
2023-07-09 17:19   ` Alexei Starovoitov
2023-07-10  6:57     ` Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 3/8] libbpf: Add opts-based attach/detach/query API for tcx Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 4/8] libbpf: Add link-based " Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 5/8] libbpf: Add helper macro to clear opts structs Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 6/8] bpftool: Extend net dump with tcx progs Daniel Borkmann
2023-07-07 21:31   ` Stanislav Fomichev
2023-07-10  6:53     ` Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 7/8] selftests/bpf: Add mprog API tests for BPF tcx opts Daniel Borkmann
2023-07-07 17:24 ` [PATCH bpf-next v3 8/8] selftests/bpf: Add mprog API tests for BPF tcx links Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZKxLY3onuOHepOxt@google.com \
    --to=sdf@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dxu@dxuuu.xyz \
    --cc=joe@cilium.io \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=razor@blackwall.org \
    --cc=toke@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).