All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Amery Hung <ameryhung@gmail.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	lsf-pc@lists.linux-foundation.org, bpf@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] bpf qdisc
Date: Fri, 01 Mar 2024 16:06:32 +0100	[thread overview]
Message-ID: <87a5ni9ekn.fsf@toke.dk> (raw)
In-Reply-To: <CAMB2axOvfVfFFrmAkJanpJN8-W1j+XmuJcsgzvd-9WRWeqrCEw@mail.gmail.com>

Amery Hung <ameryhung@gmail.com> writes:

> On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Amery Hung <ameryhung@gmail.com> writes:
>>
>> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>> >>
>> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> >> >> support bpf qdisc using struct_ops, we found some limitations of
>> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
>> >> >> list, we can discuss in more detail to make struct_ops a more
>> >> >> generic/palatable approach to replace kernel functions.
>> >> >>
>> >> >> In addition, I would like to discuss supporting adding kernel objects
>> >> >> to bpf_list/rbtree, which may have performance benefits in some
>> >> >> applications and can improve the programming experience. The current
>> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
>> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
>> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> >> >> becomes comparable. We can discuss the approach and other potential
>> >> >> use cases.
>> >> >>
>> >> >
>> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
>> >> > patch set, we discussed the same thing, in that the sk_buff already
>> >> > has space for a list or rbnode due to it getting queued in other
>> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
>> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
>> >> > allow inserting it as an element into a BPF list or rbtree. Back then
>> >> > we didn't add that as the posting only used the PIFO map.
>> >> >
>> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
>> >> > well.
>> >>
>> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
>> >> rbtree would make a lot of sense if it can be done safely. I am less
>> >> sure about xdp_frames, mostly for performance reasons, but if it does
>> >> turn out to be useful whichever mechanism we add for skbs should be
>> >> fairly straight forward to reuse.
>> >>
>> >> > The verifier side changes should be fairly minimal, just allowing the
>> >> > use of a known kernel type as the contained object in a list or
>> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
>> >>
>> >> I think one additional concern here is how we ensure that an skb has
>> >> been correctly removed from any rbtrees it sits in before it is being
>> >> transmitted to another part of the stack?
>> >
>> > I think one solution is to disallow shared ownership of skb in
>> > multiple lists or rbtrees. That is, users should not be able to
>> > acquire the reference of an skb from the ctx more than once in
>> > ".enqueue" or using bpf_refcount_acquire().
>>
>> Can the verifier enforce this, even across multiple enqueue/dequeue
>> calls? Not sure if acquiring a refcount ensures that the rbtree entry
>> has been cleared?
>>
>> Basically, I'm worried about a dequeue() op that does something like:
>>
>> skb = rbtree_head();
>> // skb->rbnode is not cleared
>> return skb; // stack will keep processing it
>>
>> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
>> the verifier is already ensuring that a node cannot be read from a tree
>> without being properly cleared from it?
>>
>
> I see what you are saying now, and thanks Kumar for the clarification!
>
> I was thinking about how to prevent an skb from being added to lists
> and rbtrees at the same time, since list and rbnode share the same
> space. Hence the suggestion.

Ah, yes, good point, that is also a concern, certainly!

-Toke


      parent reply	other threads:[~2024-03-01 15:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-26 18:03 [LSF/MM/BPF TOPIC] bpf qdisc Amery Hung
2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
2024-02-28 14:36   ` Toke Høiland-Jørgensen
2024-02-28 23:01     ` Amery Hung
2024-03-01 14:08       ` Toke Høiland-Jørgensen
2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
2024-03-01 14:23           ` Toke Høiland-Jørgensen
2024-03-01 15:00         ` Amery Hung
2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
2024-03-01 19:28             ` Amery Hung
2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
2024-03-01 23:30                 ` Amery Hung
2024-03-01 15:06           ` Toke Høiland-Jørgensen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a5ni9ekn.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=ameryhung@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=memxor@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.