Re: [LSF/MM/BPF TOPIC] bpf qdisc - Toke Høiland-Jørgensen

BPF List
 help / color / mirror / Atom feed

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Amery Hung <ameryhung@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	lsf-pc@lists.linux-foundation.org, bpf@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] bpf qdisc
Date: Fri, 01 Mar 2024 16:06:32 +0100	[thread overview]
Message-ID: <87a5ni9ekn.fsf@toke.dk> (raw)
In-Reply-To: <CAMB2axOvfVfFFrmAkJanpJN8-W1j+XmuJcsgzvd-9WRWeqrCEw@mail.gmail.com>

Amery Hung <ameryhung@gmail.com> writes:

> On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Amery Hung <ameryhung@gmail.com> writes:
>>
>> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>> >>
>> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> >> >> support bpf qdisc using struct_ops, we found some limitations of
>> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
>> >> >> list, we can discuss in more detail to make struct_ops a more
>> >> >> generic/palatable approach to replace kernel functions.
>> >> >>
>> >> >> In addition, I would like to discuss supporting adding kernel objects
>> >> >> to bpf_list/rbtree, which may have performance benefits in some
>> >> >> applications and can improve the programming experience. The current
>> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
>> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
>> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> >> >> becomes comparable. We can discuss the approach and other potential
>> >> >> use cases.
>> >> >>
>> >> >
>> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
>> >> > patch set, we discussed the same thing, in that the sk_buff already
>> >> > has space for a list or rbnode due to it getting queued in other
>> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
>> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
>> >> > allow inserting it as an element into a BPF list or rbtree. Back then
>> >> > we didn't add that as the posting only used the PIFO map.
>> >> >
>> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
>> >> > well.
>> >>
>> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
>> >> rbtree would make a lot of sense if it can be done safely. I am less
>> >> sure about xdp_frames, mostly for performance reasons, but if it does
>> >> turn out to be useful whichever mechanism we add for skbs should be
>> >> fairly straight forward to reuse.
>> >>
>> >> > The verifier side changes should be fairly minimal, just allowing the
>> >> > use of a known kernel type as the contained object in a list or
>> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
>> >>
>> >> I think one additional concern here is how we ensure that an skb has
>> >> been correctly removed from any rbtrees it sits in before it is being
>> >> transmitted to another part of the stack?
>> >
>> > I think one solution is to disallow shared ownership of skb in
>> > multiple lists or rbtrees. That is, users should not be able to
>> > acquire the reference of an skb from the ctx more than once in
>> > ".enqueue" or using bpf_refcount_acquire().
>>
>> Can the verifier enforce this, even across multiple enqueue/dequeue
>> calls? Not sure if acquiring a refcount ensures that the rbtree entry
>> has been cleared?
>>
>> Basically, I'm worried about a dequeue() op that does something like:
>>
>> skb = rbtree_head();
>> // skb->rbnode is not cleared
>> return skb; // stack will keep processing it
>>
>> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
>> the verifier is already ensuring that a node cannot be read from a tree
>> without being properly cleared from it?
>>
>
> I see what you are saying now, and thanks Kumar for the clarification!
>
> I was thinking about how to prevent an skb from being added to lists
> and rbtrees at the same time, since list and rbnode share the same
> space. Hence the suggestion.

Ah, yes, good point, that is also a concern, certainly!

-Toke

     prev parent reply	other threads:[~2024-03-01 15:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-26 18:03 [LSF/MM/BPF TOPIC] bpf qdisc Amery Hung
2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
2024-02-28 14:36   ` Toke Høiland-Jørgensen
2024-02-28 23:01     ` Amery Hung
2024-03-01 14:08       ` Toke Høiland-Jørgensen
2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
2024-03-01 14:23           ` Toke Høiland-Jørgensen
2024-03-01 15:00         ` Amery Hung
2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
2024-03-01 19:28             ` Amery Hung
2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
2024-03-01 23:30                 ` Amery Hung
2024-03-01 15:06           ` Toke Høiland-Jørgensen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a5ni9ekn.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=ameryhung@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=memxor@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox