[LSF/MM/BPF TOPIC] bpf qdisc

BPF List
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] bpf qdisc
@ 2024-02-26 18:03 Amery Hung
  2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 13+ messages in thread
From: Amery Hung @ 2024-02-26 18:03 UTC (permalink / raw)
  To: lsf-pc; +Cc: bpf

Hi all,

I would like to discuss bpf qdisc in the BPF track. As we now try to
support bpf qdisc using struct_ops, we found some limitations of
bpf/struct_ops. While some have been discussed briefly on the mailing
list, we can discuss in more detail to make struct_ops a more
generic/palatable approach to replace kernel functions.

In addition, I would like to discuss supporting adding kernel objects
to bpf_list/rbtree, which may have performance benefits in some
applications and can improve the programming experience. The current
bpf fq in the RFC has a 6% throughput loss compared to the native
counterpart due to memory allocation in enqueue() to store skb kptr.
With a POC I wrote that allows adding skb to bpf_list, the throughput
becomes comparable. We can discuss the approach and other potential
use cases.

Thanks,
Amery

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-02-26 18:03 [LSF/MM/BPF TOPIC] bpf qdisc Amery Hung
@ 2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
  2024-02-28 14:36   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-26 18:10 UTC (permalink / raw)
  To: Amery Hung, Toke Høiland-Jørgensen; +Cc: lsf-pc, bpf

On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>
> Hi all,
>
> I would like to discuss bpf qdisc in the BPF track. As we now try to
> support bpf qdisc using struct_ops, we found some limitations of
> bpf/struct_ops. While some have been discussed briefly on the mailing
> list, we can discuss in more detail to make struct_ops a more
> generic/palatable approach to replace kernel functions.
>
> In addition, I would like to discuss supporting adding kernel objects
> to bpf_list/rbtree, which may have performance benefits in some
> applications and can improve the programming experience. The current
> bpf fq in the RFC has a 6% throughput loss compared to the native
> counterpart due to memory allocation in enqueue() to store skb kptr.
> With a POC I wrote that allows adding skb to bpf_list, the throughput
> becomes comparable. We can discuss the approach and other potential
> use cases.
>

When discussing this with Toke (Cc'd) long ago for the XDP queueing
patch set, we discussed the same thing, in that the sk_buff already
has space for a list or rbnode due to it getting queued in other
layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
the verifier that it is a valid bpf_list_node and bpf_rb_node and
allow inserting it as an element into a BPF list or rbtree. Back then
we didn't add that as the posting only used the PIFO map.

I think not only sk_buff, you can do a similar thing with xdp_buff as well.

The verifier side changes should be fairly minimal, just allowing the
use of a known kernel type as the contained object in a list or
rbtree, and the field pointing to this allowlisted list or rbnode.

Thanks

> Thanks,
> Amery
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
@ 2024-02-28 14:36   ` Toke Høiland-Jørgensen
  2024-02-28 23:01     ` Amery Hung
  0 siblings, 1 reply; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-02-28 14:36 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Amery Hung; +Cc: lsf-pc, bpf

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>>
>> Hi all,
>>
>> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> support bpf qdisc using struct_ops, we found some limitations of
>> bpf/struct_ops. While some have been discussed briefly on the mailing
>> list, we can discuss in more detail to make struct_ops a more
>> generic/palatable approach to replace kernel functions.
>>
>> In addition, I would like to discuss supporting adding kernel objects
>> to bpf_list/rbtree, which may have performance benefits in some
>> applications and can improve the programming experience. The current
>> bpf fq in the RFC has a 6% throughput loss compared to the native
>> counterpart due to memory allocation in enqueue() to store skb kptr.
>> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> becomes comparable. We can discuss the approach and other potential
>> use cases.
>>
>
> When discussing this with Toke (Cc'd) long ago for the XDP queueing
> patch set, we discussed the same thing, in that the sk_buff already
> has space for a list or rbnode due to it getting queued in other
> layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> the verifier that it is a valid bpf_list_node and bpf_rb_node and
> allow inserting it as an element into a BPF list or rbtree. Back then
> we didn't add that as the posting only used the PIFO map.
>
> I think not only sk_buff, you can do a similar thing with xdp_buff as
> well.

Yeah, I agree that allowing skbs to be inserted directly into a BPF
rbtree would make a lot of sense if it can be done safely. I am less
sure about xdp_frames, mostly for performance reasons, but if it does
turn out to be useful whichever mechanism we add for skbs should be
fairly straight forward to reuse.

> The verifier side changes should be fairly minimal, just allowing the
> use of a known kernel type as the contained object in a list or
> rbtree, and the field pointing to this allowlisted list or rbnode.

I think one additional concern here is how we ensure that an skb has
been correctly removed from any rbtrees it sits in before it is being
transmitted to another part of the stack?

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-02-28 14:36   ` Toke Høiland-Jørgensen
@ 2024-02-28 23:01     ` Amery Hung
  2024-03-01 14:08       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 13+ messages in thread
From: Amery Hung @ 2024-02-28 23:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Kumar Kartikeya Dwivedi, lsf-pc, bpf

On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> >> support bpf qdisc using struct_ops, we found some limitations of
> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> >> list, we can discuss in more detail to make struct_ops a more
> >> generic/palatable approach to replace kernel functions.
> >>
> >> In addition, I would like to discuss supporting adding kernel objects
> >> to bpf_list/rbtree, which may have performance benefits in some
> >> applications and can improve the programming experience. The current
> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> >> becomes comparable. We can discuss the approach and other potential
> >> use cases.
> >>
> >
> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > patch set, we discussed the same thing, in that the sk_buff already
> > has space for a list or rbnode due to it getting queued in other
> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > allow inserting it as an element into a BPF list or rbtree. Back then
> > we didn't add that as the posting only used the PIFO map.
> >
> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > well.
>
> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> rbtree would make a lot of sense if it can be done safely. I am less
> sure about xdp_frames, mostly for performance reasons, but if it does
> turn out to be useful whichever mechanism we add for skbs should be
> fairly straight forward to reuse.
>
> > The verifier side changes should be fairly minimal, just allowing the
> > use of a known kernel type as the contained object in a list or
> > rbtree, and the field pointing to this allowlisted list or rbnode.
>
> I think one additional concern here is how we ensure that an skb has
> been correctly removed from any rbtrees it sits in before it is being
> transmitted to another part of the stack?

I think one solution is to disallow shared ownership of skb in
multiple lists or rbtrees. That is, users should not be able to
acquire the reference of an skb from the ctx more than once in
".enqueue" or using bpf_refcount_acquire().

>
> -Toke
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-02-28 23:01     ` Amery Hung
@ 2024-03-01 14:08       ` Toke Høiland-Jørgensen
  2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
  2024-03-01 15:00         ` Amery Hung
  0 siblings, 2 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-03-01 14:08 UTC (permalink / raw)
  To: Amery Hung; +Cc: Kumar Kartikeya Dwivedi, lsf-pc, bpf

Amery Hung <ameryhung@gmail.com> writes:

> On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>>
>> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> >> support bpf qdisc using struct_ops, we found some limitations of
>> >> bpf/struct_ops. While some have been discussed briefly on the mailing
>> >> list, we can discuss in more detail to make struct_ops a more
>> >> generic/palatable approach to replace kernel functions.
>> >>
>> >> In addition, I would like to discuss supporting adding kernel objects
>> >> to bpf_list/rbtree, which may have performance benefits in some
>> >> applications and can improve the programming experience. The current
>> >> bpf fq in the RFC has a 6% throughput loss compared to the native
>> >> counterpart due to memory allocation in enqueue() to store skb kptr.
>> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> >> becomes comparable. We can discuss the approach and other potential
>> >> use cases.
>> >>
>> >
>> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
>> > patch set, we discussed the same thing, in that the sk_buff already
>> > has space for a list or rbnode due to it getting queued in other
>> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
>> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
>> > allow inserting it as an element into a BPF list or rbtree. Back then
>> > we didn't add that as the posting only used the PIFO map.
>> >
>> > I think not only sk_buff, you can do a similar thing with xdp_buff as
>> > well.
>>
>> Yeah, I agree that allowing skbs to be inserted directly into a BPF
>> rbtree would make a lot of sense if it can be done safely. I am less
>> sure about xdp_frames, mostly for performance reasons, but if it does
>> turn out to be useful whichever mechanism we add for skbs should be
>> fairly straight forward to reuse.
>>
>> > The verifier side changes should be fairly minimal, just allowing the
>> > use of a known kernel type as the contained object in a list or
>> > rbtree, and the field pointing to this allowlisted list or rbnode.
>>
>> I think one additional concern here is how we ensure that an skb has
>> been correctly removed from any rbtrees it sits in before it is being
>> transmitted to another part of the stack?
>
> I think one solution is to disallow shared ownership of skb in
> multiple lists or rbtrees. That is, users should not be able to
> acquire the reference of an skb from the ctx more than once in
> ".enqueue" or using bpf_refcount_acquire().

Can the verifier enforce this, even across multiple enqueue/dequeue
calls? Not sure if acquiring a refcount ensures that the rbtree entry
has been cleared?

Basically, I'm worried about a dequeue() op that does something like:

skb = rbtree_head();
// skb->rbnode is not cleared
return skb; // stack will keep processing it

I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
the verifier is already ensuring that a node cannot be read from a tree
without being properly cleared from it?

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 14:08       ` Toke Høiland-Jørgensen
@ 2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
  2024-03-01 14:23           ` Toke Høiland-Jørgensen
  2024-03-01 15:00         ` Amery Hung
  1 sibling, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-03-01 14:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Amery Hung, lsf-pc, bpf

On Fri, 1 Mar 2024 at 15:08, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Amery Hung <ameryhung@gmail.com> writes:
>
> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> >>
> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> >> >> support bpf qdisc using struct_ops, we found some limitations of
> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> >> >> list, we can discuss in more detail to make struct_ops a more
> >> >> generic/palatable approach to replace kernel functions.
> >> >>
> >> >> In addition, I would like to discuss supporting adding kernel objects
> >> >> to bpf_list/rbtree, which may have performance benefits in some
> >> >> applications and can improve the programming experience. The current
> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> >> >> becomes comparable. We can discuss the approach and other potential
> >> >> use cases.
> >> >>
> >> >
> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> >> > patch set, we discussed the same thing, in that the sk_buff already
> >> > has space for a list or rbnode due to it getting queued in other
> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> >> > allow inserting it as an element into a BPF list or rbtree. Back then
> >> > we didn't add that as the posting only used the PIFO map.
> >> >
> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> >> > well.
> >>
> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> >> rbtree would make a lot of sense if it can be done safely. I am less
> >> sure about xdp_frames, mostly for performance reasons, but if it does
> >> turn out to be useful whichever mechanism we add for skbs should be
> >> fairly straight forward to reuse.
> >>
> >> > The verifier side changes should be fairly minimal, just allowing the
> >> > use of a known kernel type as the contained object in a list or
> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> >>
> >> I think one additional concern here is how we ensure that an skb has
> >> been correctly removed from any rbtrees it sits in before it is being
> >> transmitted to another part of the stack?
> >
> > I think one solution is to disallow shared ownership of skb in
> > multiple lists or rbtrees. That is, users should not be able to
> > acquire the reference of an skb from the ctx more than once in
> > ".enqueue" or using bpf_refcount_acquire().
>
> Can the verifier enforce this, even across multiple enqueue/dequeue
> calls? Not sure if acquiring a refcount ensures that the rbtree entry
> has been cleared?
>
> Basically, I'm worried about a dequeue() op that does something like:
>
> skb = rbtree_head();
> // skb->rbnode is not cleared
> return skb; // stack will keep processing it
>
> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> the verifier is already ensuring that a node cannot be read from a tree
> without being properly cleared from it?
>

I think it should be ok. I'm not exactly sure what ownership scheme
the BPF qdisc stuff will follow, but the verifier is able to
distinguish between the cases where an skb would just be viewed but
inserted in an rbtree, vs when it is fully owned by the program and
not in the rbtree (say when removed from it). Therefore, it should be
possible to ensure unique ownership at any point of time.

> -Toke
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
@ 2024-03-01 14:23           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-03-01 14:23 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Amery Hung, lsf-pc, bpf

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Fri, 1 Mar 2024 at 15:08, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Amery Hung <ameryhung@gmail.com> writes:
>>
>> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>> >>
>> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> >> >> support bpf qdisc using struct_ops, we found some limitations of
>> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
>> >> >> list, we can discuss in more detail to make struct_ops a more
>> >> >> generic/palatable approach to replace kernel functions.
>> >> >>
>> >> >> In addition, I would like to discuss supporting adding kernel objects
>> >> >> to bpf_list/rbtree, which may have performance benefits in some
>> >> >> applications and can improve the programming experience. The current
>> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
>> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
>> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> >> >> becomes comparable. We can discuss the approach and other potential
>> >> >> use cases.
>> >> >>
>> >> >
>> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
>> >> > patch set, we discussed the same thing, in that the sk_buff already
>> >> > has space for a list or rbnode due to it getting queued in other
>> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
>> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
>> >> > allow inserting it as an element into a BPF list or rbtree. Back then
>> >> > we didn't add that as the posting only used the PIFO map.
>> >> >
>> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
>> >> > well.
>> >>
>> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
>> >> rbtree would make a lot of sense if it can be done safely. I am less
>> >> sure about xdp_frames, mostly for performance reasons, but if it does
>> >> turn out to be useful whichever mechanism we add for skbs should be
>> >> fairly straight forward to reuse.
>> >>
>> >> > The verifier side changes should be fairly minimal, just allowing the
>> >> > use of a known kernel type as the contained object in a list or
>> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
>> >>
>> >> I think one additional concern here is how we ensure that an skb has
>> >> been correctly removed from any rbtrees it sits in before it is being
>> >> transmitted to another part of the stack?
>> >
>> > I think one solution is to disallow shared ownership of skb in
>> > multiple lists or rbtrees. That is, users should not be able to
>> > acquire the reference of an skb from the ctx more than once in
>> > ".enqueue" or using bpf_refcount_acquire().
>>
>> Can the verifier enforce this, even across multiple enqueue/dequeue
>> calls? Not sure if acquiring a refcount ensures that the rbtree entry
>> has been cleared?
>>
>> Basically, I'm worried about a dequeue() op that does something like:
>>
>> skb = rbtree_head();
>> // skb->rbnode is not cleared
>> return skb; // stack will keep processing it
>>
>> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
>> the verifier is already ensuring that a node cannot be read from a tree
>> without being properly cleared from it?
>>
>
> I think it should be ok. I'm not exactly sure what ownership scheme
> the BPF qdisc stuff will follow, but the verifier is able to
> distinguish between the cases where an skb would just be viewed but
> inserted in an rbtree, vs when it is fully owned by the program and
> not in the rbtree (say when removed from it). Therefore, it should be
> possible to ensure unique ownership at any point of time.

Alright, cool. Thanks for clarifying! :)

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 14:08       ` Toke Høiland-Jørgensen
  2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
@ 2024-03-01 15:00         ` Amery Hung
  2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
  2024-03-01 15:06           ` Toke Høiland-Jørgensen
  1 sibling, 2 replies; 13+ messages in thread
From: Amery Hung @ 2024-03-01 15:00 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: Kumar Kartikeya Dwivedi, lsf-pc, bpf

On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Amery Hung <ameryhung@gmail.com> writes:
>
> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> >>
> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> >> >> support bpf qdisc using struct_ops, we found some limitations of
> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> >> >> list, we can discuss in more detail to make struct_ops a more
> >> >> generic/palatable approach to replace kernel functions.
> >> >>
> >> >> In addition, I would like to discuss supporting adding kernel objects
> >> >> to bpf_list/rbtree, which may have performance benefits in some
> >> >> applications and can improve the programming experience. The current
> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> >> >> becomes comparable. We can discuss the approach and other potential
> >> >> use cases.
> >> >>
> >> >
> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> >> > patch set, we discussed the same thing, in that the sk_buff already
> >> > has space for a list or rbnode due to it getting queued in other
> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> >> > allow inserting it as an element into a BPF list or rbtree. Back then
> >> > we didn't add that as the posting only used the PIFO map.
> >> >
> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> >> > well.
> >>
> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> >> rbtree would make a lot of sense if it can be done safely. I am less
> >> sure about xdp_frames, mostly for performance reasons, but if it does
> >> turn out to be useful whichever mechanism we add for skbs should be
> >> fairly straight forward to reuse.
> >>
> >> > The verifier side changes should be fairly minimal, just allowing the
> >> > use of a known kernel type as the contained object in a list or
> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> >>
> >> I think one additional concern here is how we ensure that an skb has
> >> been correctly removed from any rbtrees it sits in before it is being
> >> transmitted to another part of the stack?
> >
> > I think one solution is to disallow shared ownership of skb in
> > multiple lists or rbtrees. That is, users should not be able to
> > acquire the reference of an skb from the ctx more than once in
> > ".enqueue" or using bpf_refcount_acquire().
>
> Can the verifier enforce this, even across multiple enqueue/dequeue
> calls? Not sure if acquiring a refcount ensures that the rbtree entry
> has been cleared?
>
> Basically, I'm worried about a dequeue() op that does something like:
>
> skb = rbtree_head();
> // skb->rbnode is not cleared
> return skb; // stack will keep processing it
>
> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> the verifier is already ensuring that a node cannot be read from a tree
> without being properly cleared from it?
>

I see what you are saying now, and thanks Kumar for the clarification!

I was thinking about how to prevent an skb from being added to lists
and rbtrees at the same time, since list and rbnode share the same
space. Hence the suggestion.

> -Toke
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 15:00         ` Amery Hung
@ 2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
  2024-03-01 19:28             ` Amery Hung
  2024-03-01 15:06           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-03-01 15:06 UTC (permalink / raw)
  To: Amery Hung; +Cc: Toke Høiland-Jørgensen, lsf-pc, bpf

On Fri, 1 Mar 2024 at 16:01, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Amery Hung <ameryhung@gmail.com> writes:
> >
> > > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> > >>
> > >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> > >> >>
> > >> >> Hi all,
> > >> >>
> > >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> > >> >> support bpf qdisc using struct_ops, we found some limitations of
> > >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> > >> >> list, we can discuss in more detail to make struct_ops a more
> > >> >> generic/palatable approach to replace kernel functions.
> > >> >>
> > >> >> In addition, I would like to discuss supporting adding kernel objects
> > >> >> to bpf_list/rbtree, which may have performance benefits in some
> > >> >> applications and can improve the programming experience. The current
> > >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> > >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> > >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> > >> >> becomes comparable. We can discuss the approach and other potential
> > >> >> use cases.
> > >> >>
> > >> >
> > >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > >> > patch set, we discussed the same thing, in that the sk_buff already
> > >> > has space for a list or rbnode due to it getting queued in other
> > >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > >> > allow inserting it as an element into a BPF list or rbtree. Back then
> > >> > we didn't add that as the posting only used the PIFO map.
> > >> >
> > >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > >> > well.
> > >>
> > >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> > >> rbtree would make a lot of sense if it can be done safely. I am less
> > >> sure about xdp_frames, mostly for performance reasons, but if it does
> > >> turn out to be useful whichever mechanism we add for skbs should be
> > >> fairly straight forward to reuse.
> > >>
> > >> > The verifier side changes should be fairly minimal, just allowing the
> > >> > use of a known kernel type as the contained object in a list or
> > >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> > >>
> > >> I think one additional concern here is how we ensure that an skb has
> > >> been correctly removed from any rbtrees it sits in before it is being
> > >> transmitted to another part of the stack?
> > >
> > > I think one solution is to disallow shared ownership of skb in
> > > multiple lists or rbtrees. That is, users should not be able to
> > > acquire the reference of an skb from the ctx more than once in
> > > ".enqueue" or using bpf_refcount_acquire().
> >
> > Can the verifier enforce this, even across multiple enqueue/dequeue
> > calls? Not sure if acquiring a refcount ensures that the rbtree entry
> > has been cleared?
> >
> > Basically, I'm worried about a dequeue() op that does something like:
> >
> > skb = rbtree_head();
> > // skb->rbnode is not cleared
> > return skb; // stack will keep processing it
> >
> > I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> > the verifier is already ensuring that a node cannot be read from a tree
> > without being properly cleared from it?
> >
>
> I see what you are saying now, and thanks Kumar for the clarification!
>
> I was thinking about how to prevent an skb from being added to lists
> and rbtrees at the same time, since list and rbnode share the same
> space. Hence the suggestion.
>

In BPF qdisc programs, you could teach the verifier that the skb has
reference semantics (ref_obj_id > 0),
in such a case once you push it into a list or rbtree, the program
will lose ownership of the skb and all pointers same as the skb will
be marked invalid.
You could use some peek helper to look at it, but will never have an
skb with program ownership until you pop it back from a list or
rbtree.

In the XDP queueing series, we taught the verifier to have reference
semantics for xdp_md in the dequeue program, and then return such a
pointer from the program back to the kernel.
The changes to allow PTR_TO_PACKET accesses were also fairly simple,
the verifier just needs to know that comparison of data, data_end can
only be done for pkt pointers coming from the same xdp_md (as there
can be multiple in the same program at a time).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 15:00         ` Amery Hung
  2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
@ 2024-03-01 15:06           ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-03-01 15:06 UTC (permalink / raw)
  To: Amery Hung; +Cc: Kumar Kartikeya Dwivedi, lsf-pc, bpf

Amery Hung <ameryhung@gmail.com> writes:

> On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Amery Hung <ameryhung@gmail.com> writes:
>>
>> > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>> >>
>> >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
>> >> >> support bpf qdisc using struct_ops, we found some limitations of
>> >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
>> >> >> list, we can discuss in more detail to make struct_ops a more
>> >> >> generic/palatable approach to replace kernel functions.
>> >> >>
>> >> >> In addition, I would like to discuss supporting adding kernel objects
>> >> >> to bpf_list/rbtree, which may have performance benefits in some
>> >> >> applications and can improve the programming experience. The current
>> >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
>> >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
>> >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
>> >> >> becomes comparable. We can discuss the approach and other potential
>> >> >> use cases.
>> >> >>
>> >> >
>> >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
>> >> > patch set, we discussed the same thing, in that the sk_buff already
>> >> > has space for a list or rbnode due to it getting queued in other
>> >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
>> >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
>> >> > allow inserting it as an element into a BPF list or rbtree. Back then
>> >> > we didn't add that as the posting only used the PIFO map.
>> >> >
>> >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
>> >> > well.
>> >>
>> >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
>> >> rbtree would make a lot of sense if it can be done safely. I am less
>> >> sure about xdp_frames, mostly for performance reasons, but if it does
>> >> turn out to be useful whichever mechanism we add for skbs should be
>> >> fairly straight forward to reuse.
>> >>
>> >> > The verifier side changes should be fairly minimal, just allowing the
>> >> > use of a known kernel type as the contained object in a list or
>> >> > rbtree, and the field pointing to this allowlisted list or rbnode.
>> >>
>> >> I think one additional concern here is how we ensure that an skb has
>> >> been correctly removed from any rbtrees it sits in before it is being
>> >> transmitted to another part of the stack?
>> >
>> > I think one solution is to disallow shared ownership of skb in
>> > multiple lists or rbtrees. That is, users should not be able to
>> > acquire the reference of an skb from the ctx more than once in
>> > ".enqueue" or using bpf_refcount_acquire().
>>
>> Can the verifier enforce this, even across multiple enqueue/dequeue
>> calls? Not sure if acquiring a refcount ensures that the rbtree entry
>> has been cleared?
>>
>> Basically, I'm worried about a dequeue() op that does something like:
>>
>> skb = rbtree_head();
>> // skb->rbnode is not cleared
>> return skb; // stack will keep processing it
>>
>> I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
>> the verifier is already ensuring that a node cannot be read from a tree
>> without being properly cleared from it?
>>
>
> I see what you are saying now, and thanks Kumar for the clarification!
>
> I was thinking about how to prevent an skb from being added to lists
> and rbtrees at the same time, since list and rbnode share the same
> space. Hence the suggestion.

Ah, yes, good point, that is also a concern, certainly!

-Toke


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
@ 2024-03-01 19:28             ` Amery Hung
  2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 13+ messages in thread
From: Amery Hung @ 2024-03-01 19:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Toke Høiland-Jørgensen, lsf-pc, bpf

On Fri, Mar 1, 2024 at 7:06 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Fri, 1 Mar 2024 at 16:01, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Amery Hung <ameryhung@gmail.com> writes:
> > >
> > > > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >>
> > > >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> > > >>
> > > >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> > > >> >>
> > > >> >> Hi all,
> > > >> >>
> > > >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> > > >> >> support bpf qdisc using struct_ops, we found some limitations of
> > > >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> > > >> >> list, we can discuss in more detail to make struct_ops a more
> > > >> >> generic/palatable approach to replace kernel functions.
> > > >> >>
> > > >> >> In addition, I would like to discuss supporting adding kernel objects
> > > >> >> to bpf_list/rbtree, which may have performance benefits in some
> > > >> >> applications and can improve the programming experience. The current
> > > >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> > > >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> > > >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> > > >> >> becomes comparable. We can discuss the approach and other potential
> > > >> >> use cases.
> > > >> >>
> > > >> >
> > > >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > > >> > patch set, we discussed the same thing, in that the sk_buff already
> > > >> > has space for a list or rbnode due to it getting queued in other
> > > >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > > >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > > >> > allow inserting it as an element into a BPF list or rbtree. Back then
> > > >> > we didn't add that as the posting only used the PIFO map.
> > > >> >
> > > >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > > >> > well.
> > > >>
> > > >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> > > >> rbtree would make a lot of sense if it can be done safely. I am less
> > > >> sure about xdp_frames, mostly for performance reasons, but if it does
> > > >> turn out to be useful whichever mechanism we add for skbs should be
> > > >> fairly straight forward to reuse.
> > > >>
> > > >> > The verifier side changes should be fairly minimal, just allowing the
> > > >> > use of a known kernel type as the contained object in a list or
> > > >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> > > >>
> > > >> I think one additional concern here is how we ensure that an skb has
> > > >> been correctly removed from any rbtrees it sits in before it is being
> > > >> transmitted to another part of the stack?
> > > >
> > > > I think one solution is to disallow shared ownership of skb in
> > > > multiple lists or rbtrees. That is, users should not be able to
> > > > acquire the reference of an skb from the ctx more than once in
> > > > ".enqueue" or using bpf_refcount_acquire().
> > >
> > > Can the verifier enforce this, even across multiple enqueue/dequeue
> > > calls? Not sure if acquiring a refcount ensures that the rbtree entry
> > > has been cleared?
> > >
> > > Basically, I'm worried about a dequeue() op that does something like:
> > >
> > > skb = rbtree_head();
> > > // skb->rbnode is not cleared
> > > return skb; // stack will keep processing it
> > >
> > > I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> > > the verifier is already ensuring that a node cannot be read from a tree
> > > without being properly cleared from it?
> > >
> >
> > I see what you are saying now, and thanks Kumar for the clarification!
> >
> > I was thinking about how to prevent an skb from being added to lists
> > and rbtrees at the same time, since list and rbnode share the same
> > space. Hence the suggestion.
> >
>
> In BPF qdisc programs, you could teach the verifier that the skb has
> reference semantics (ref_obj_id > 0),
> in such a case once you push it into a list or rbtree, the program
> will lose ownership of the skb and all pointers same as the skb will
> be marked invalid.
> You could use some peek helper to look at it, but will never have an
> skb with program ownership until you pop it back from a list or
> rbtree.
>

This part makes sense. In the enqueue() op of bpf qdisc, I use a kfunc
to acquire an skb kptr (ref_obj_id > 0) from the skb in ctx for now.
Martin suggested tracking reads from ctx and assigning ref_obj_id.

However, either way, if users can do this multiple times in one
enqueue() call like below, they can acquire multiple references to the
same skb and put them on different lists/rbtrees. This is what I'd
like to avoid.

SEC("struct_ops/bpf_fifo_enqueue")
int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
struct bpf_sk_buff_ptr *to_free)
{
        ...
        skb_kptr_a = bpf_skb_acquire(skb);
        skb_kptr_b = bpf_skb_acquire(skb);

        bpf_list_push_back(&list_1, skb_kptr_a->bpf_list);
        bpf_list_push_back(&list_2, skb_kptr_b->bpf_list);
        ...

Thanks,
Amery

> In the XDP queueing series, we taught the verifier to have reference
> semantics for xdp_md in the dequeue program, and then return such a
> pointer from the program back to the kernel.
> The changes to allow PTR_TO_PACKET accesses were also fairly simple,
> the verifier just needs to know that comparison of data, data_end can
> only be done for pkt pointers coming from the same xdp_md (as there
> can be multiple in the same program at a time).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 19:28             ` Amery Hung
@ 2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
  2024-03-01 23:30                 ` Amery Hung
  0 siblings, 1 reply; 13+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-03-01 20:07 UTC (permalink / raw)
  To: Amery Hung, Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, lsf-pc, bpf

On Fri, 1 Mar 2024 at 20:28, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Fri, Mar 1, 2024 at 7:06 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Fri, 1 Mar 2024 at 16:01, Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >
> > > > Amery Hung <ameryhung@gmail.com> writes:
> > > >
> > > > > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > > >>
> > > > >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> > > > >>
> > > > >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> > > > >> >>
> > > > >> >> Hi all,
> > > > >> >>
> > > > >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> > > > >> >> support bpf qdisc using struct_ops, we found some limitations of
> > > > >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> > > > >> >> list, we can discuss in more detail to make struct_ops a more
> > > > >> >> generic/palatable approach to replace kernel functions.
> > > > >> >>
> > > > >> >> In addition, I would like to discuss supporting adding kernel objects
> > > > >> >> to bpf_list/rbtree, which may have performance benefits in some
> > > > >> >> applications and can improve the programming experience. The current
> > > > >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> > > > >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> > > > >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> > > > >> >> becomes comparable. We can discuss the approach and other potential
> > > > >> >> use cases.
> > > > >> >>
> > > > >> >
> > > > >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > > > >> > patch set, we discussed the same thing, in that the sk_buff already
> > > > >> > has space for a list or rbnode due to it getting queued in other
> > > > >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > > > >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > > > >> > allow inserting it as an element into a BPF list or rbtree. Back then
> > > > >> > we didn't add that as the posting only used the PIFO map.
> > > > >> >
> > > > >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > > > >> > well.
> > > > >>
> > > > >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> > > > >> rbtree would make a lot of sense if it can be done safely. I am less
> > > > >> sure about xdp_frames, mostly for performance reasons, but if it does
> > > > >> turn out to be useful whichever mechanism we add for skbs should be
> > > > >> fairly straight forward to reuse.
> > > > >>
> > > > >> > The verifier side changes should be fairly minimal, just allowing the
> > > > >> > use of a known kernel type as the contained object in a list or
> > > > >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> > > > >>
> > > > >> I think one additional concern here is how we ensure that an skb has
> > > > >> been correctly removed from any rbtrees it sits in before it is being
> > > > >> transmitted to another part of the stack?
> > > > >
> > > > > I think one solution is to disallow shared ownership of skb in
> > > > > multiple lists or rbtrees. That is, users should not be able to
> > > > > acquire the reference of an skb from the ctx more than once in
> > > > > ".enqueue" or using bpf_refcount_acquire().
> > > >
> > > > Can the verifier enforce this, even across multiple enqueue/dequeue
> > > > calls? Not sure if acquiring a refcount ensures that the rbtree entry
> > > > has been cleared?
> > > >
> > > > Basically, I'm worried about a dequeue() op that does something like:
> > > >
> > > > skb = rbtree_head();
> > > > // skb->rbnode is not cleared
> > > > return skb; // stack will keep processing it
> > > >
> > > > I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> > > > the verifier is already ensuring that a node cannot be read from a tree
> > > > without being properly cleared from it?
> > > >
> > >
> > > I see what you are saying now, and thanks Kumar for the clarification!
> > >
> > > I was thinking about how to prevent an skb from being added to lists
> > > and rbtrees at the same time, since list and rbnode share the same
> > > space. Hence the suggestion.
> > >
> >
> > In BPF qdisc programs, you could teach the verifier that the skb has
> > reference semantics (ref_obj_id > 0),
> > in such a case once you push it into a list or rbtree, the program
> > will lose ownership of the skb and all pointers same as the skb will
> > be marked invalid.
> > You could use some peek helper to look at it, but will never have an
> > skb with program ownership until you pop it back from a list or
> > rbtree.
> >
>
> This part makes sense. In the enqueue() op of bpf qdisc, I use a kfunc
> to acquire an skb kptr (ref_obj_id > 0) from the skb in ctx for now.
> Martin suggested tracking reads from ctx and assigning ref_obj_id.
>
> However, either way, if users can do this multiple times in one
> enqueue() call like below, they can acquire multiple references to the
> same skb and put them on different lists/rbtrees. This is what I'd
> like to avoid.
>
> SEC("struct_ops/bpf_fifo_enqueue")
> int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> struct bpf_sk_buff_ptr *to_free)
> {
>         ...
>         skb_kptr_a = bpf_skb_acquire(skb);
>         skb_kptr_b = bpf_skb_acquire(skb);

Yeah, this acquire kfunc is the root cause. That basically is a sign
that 'multiple references are ok' even though that's not the case.
It would have been the least intrusive way to allow skb kptrs, but it
wouldn't work well to enforce unique ownership of the skb.

Instead of this, we could teach the verifier that a struct ops
argument can be an acquired pointer being passed as a parameter to the
BPF program.
Then, on entry into the program, it would have a reference state
created for it and the corresponding ref_obj_id be assigned to that
when loading it from ctx.
When the reference state goes away, loading it again from ctx will
just mark it as an unknown scalar or return an error.

Then, you can only push the skb into a list or rbtree once.

>
>         bpf_list_push_back(&list_1, skb_kptr_a->bpf_list);
>         bpf_list_push_back(&list_2, skb_kptr_b->bpf_list);
>         ...
>
> Thanks,
> Amery
>
> > In the XDP queueing series, we taught the verifier to have reference
> > semantics for xdp_md in the dequeue program, and then return such a
> > pointer from the program back to the kernel.
> > The changes to allow PTR_TO_PACKET accesses were also fairly simple,
> > the verifier just needs to know that comparison of data, data_end can
> > only be done for pkt pointers coming from the same xdp_md (as there
> > can be multiple in the same program at a time).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf qdisc
  2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
@ 2024-03-01 23:30                 ` Amery Hung
  0 siblings, 0 replies; 13+ messages in thread
From: Amery Hung @ 2024-03-01 23:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Martin KaFai Lau, Toke Høiland-Jørgensen, lsf-pc, bpf

On Fri, Mar 1, 2024 at 12:08 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 1 Mar 2024 at 20:28, Amery Hung <ameryhung@gmail.com> wrote:
> >
> > On Fri, Mar 1, 2024 at 7:06 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Fri, 1 Mar 2024 at 16:01, Amery Hung <ameryhung@gmail.com> wrote:
> > > >
> > > > On Fri, Mar 1, 2024 at 6:08 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > > >
> > > > > Amery Hung <ameryhung@gmail.com> writes:
> > > > >
> > > > > > On Wed, Feb 28, 2024 at 6:36 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > > > >>
> > > > > >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> > > > > >>
> > > > > >> > On Mon, 26 Feb 2024 at 19:04, Amery Hung <ameryhung@gmail.com> wrote:
> > > > > >> >>
> > > > > >> >> Hi all,
> > > > > >> >>
> > > > > >> >> I would like to discuss bpf qdisc in the BPF track. As we now try to
> > > > > >> >> support bpf qdisc using struct_ops, we found some limitations of
> > > > > >> >> bpf/struct_ops. While some have been discussed briefly on the mailing
> > > > > >> >> list, we can discuss in more detail to make struct_ops a more
> > > > > >> >> generic/palatable approach to replace kernel functions.
> > > > > >> >>
> > > > > >> >> In addition, I would like to discuss supporting adding kernel objects
> > > > > >> >> to bpf_list/rbtree, which may have performance benefits in some
> > > > > >> >> applications and can improve the programming experience. The current
> > > > > >> >> bpf fq in the RFC has a 6% throughput loss compared to the native
> > > > > >> >> counterpart due to memory allocation in enqueue() to store skb kptr.
> > > > > >> >> With a POC I wrote that allows adding skb to bpf_list, the throughput
> > > > > >> >> becomes comparable. We can discuss the approach and other potential
> > > > > >> >> use cases.
> > > > > >> >>
> > > > > >> >
> > > > > >> > When discussing this with Toke (Cc'd) long ago for the XDP queueing
> > > > > >> > patch set, we discussed the same thing, in that the sk_buff already
> > > > > >> > has space for a list or rbnode due to it getting queued in other
> > > > > >> > layers (TCP OoO queue, qdiscs, etc.) so it would make sense to teach
> > > > > >> > the verifier that it is a valid bpf_list_node and bpf_rb_node and
> > > > > >> > allow inserting it as an element into a BPF list or rbtree. Back then
> > > > > >> > we didn't add that as the posting only used the PIFO map.
> > > > > >> >
> > > > > >> > I think not only sk_buff, you can do a similar thing with xdp_buff as
> > > > > >> > well.
> > > > > >>
> > > > > >> Yeah, I agree that allowing skbs to be inserted directly into a BPF
> > > > > >> rbtree would make a lot of sense if it can be done safely. I am less
> > > > > >> sure about xdp_frames, mostly for performance reasons, but if it does
> > > > > >> turn out to be useful whichever mechanism we add for skbs should be
> > > > > >> fairly straight forward to reuse.
> > > > > >>
> > > > > >> > The verifier side changes should be fairly minimal, just allowing the
> > > > > >> > use of a known kernel type as the contained object in a list or
> > > > > >> > rbtree, and the field pointing to this allowlisted list or rbnode.
> > > > > >>
> > > > > >> I think one additional concern here is how we ensure that an skb has
> > > > > >> been correctly removed from any rbtrees it sits in before it is being
> > > > > >> transmitted to another part of the stack?
> > > > > >
> > > > > > I think one solution is to disallow shared ownership of skb in
> > > > > > multiple lists or rbtrees. That is, users should not be able to
> > > > > > acquire the reference of an skb from the ctx more than once in
> > > > > > ".enqueue" or using bpf_refcount_acquire().
> > > > >
> > > > > Can the verifier enforce this, even across multiple enqueue/dequeue
> > > > > calls? Not sure if acquiring a refcount ensures that the rbtree entry
> > > > > has been cleared?
> > > > >
> > > > > Basically, I'm worried about a dequeue() op that does something like:
> > > > >
> > > > > skb = rbtree_head();
> > > > > // skb->rbnode is not cleared
> > > > > return skb; // stack will keep processing it
> > > > >
> > > > > I'm a little fuzzy on how the bpf rbtree stuff works, though, so maybe
> > > > > the verifier is already ensuring that a node cannot be read from a tree
> > > > > without being properly cleared from it?
> > > > >
> > > >
> > > > I see what you are saying now, and thanks Kumar for the clarification!
> > > >
> > > > I was thinking about how to prevent an skb from being added to lists
> > > > and rbtrees at the same time, since list and rbnode share the same
> > > > space. Hence the suggestion.
> > > >
> > >
> > > In BPF qdisc programs, you could teach the verifier that the skb has
> > > reference semantics (ref_obj_id > 0),
> > > in such a case once you push it into a list or rbtree, the program
> > > will lose ownership of the skb and all pointers same as the skb will
> > > be marked invalid.
> > > You could use some peek helper to look at it, but will never have an
> > > skb with program ownership until you pop it back from a list or
> > > rbtree.
> > >
> >
> > This part makes sense. In the enqueue() op of bpf qdisc, I use a kfunc
> > to acquire an skb kptr (ref_obj_id > 0) from the skb in ctx for now.
> > Martin suggested tracking reads from ctx and assigning ref_obj_id.
> >
> > However, either way, if users can do this multiple times in one
> > enqueue() call like below, they can acquire multiple references to the
> > same skb and put them on different lists/rbtrees. This is what I'd
> > like to avoid.
> >
> > SEC("struct_ops/bpf_fifo_enqueue")
> > int BPF_PROG(bpf_fifo_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> > struct bpf_sk_buff_ptr *to_free)
> > {
> >         ...
> >         skb_kptr_a = bpf_skb_acquire(skb);
> >         skb_kptr_b = bpf_skb_acquire(skb);
>
> Yeah, this acquire kfunc is the root cause. That basically is a sign
> that 'multiple references are ok' even though that's not the case.
> It would have been the least intrusive way to allow skb kptrs, but it
> wouldn't work well to enforce unique ownership of the skb.
>
> Instead of this, we could teach the verifier that a struct ops
> argument can be an acquired pointer being passed as a parameter to the
> BPF program.
> Then, on entry into the program, it would have a reference state
> created for it and the corresponding ref_obj_id be assigned to that
> when loading it from ctx.
> When the reference state goes away, loading it again from ctx will
> just mark it as an unknown scalar or return an error.

This seems to be the more natural way to enforce such semantics rather
than hacking kfuncs. I will try to incorporate this into the next RFC.
Thank you!

-Amery

>
> Then, you can only push the skb into a list or rbtree once.
>
> >
> >         bpf_list_push_back(&list_1, skb_kptr_a->bpf_list);
> >         bpf_list_push_back(&list_2, skb_kptr_b->bpf_list);
> >         ...
> >
> > Thanks,
> > Amery
> >
> > > In the XDP queueing series, we taught the verifier to have reference
> > > semantics for xdp_md in the dequeue program, and then return such a
> > > pointer from the program back to the kernel.
> > > The changes to allow PTR_TO_PACKET accesses were also fairly simple,
> > > the verifier just needs to know that comparison of data, data_end can
> > > only be done for pkt pointers coming from the same xdp_md (as there
> > > can be multiple in the same program at a time).

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-03-01 23:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-26 18:03 [LSF/MM/BPF TOPIC] bpf qdisc Amery Hung
2024-02-26 18:10 ` Kumar Kartikeya Dwivedi
2024-02-28 14:36   ` Toke Høiland-Jørgensen
2024-02-28 23:01     ` Amery Hung
2024-03-01 14:08       ` Toke Høiland-Jørgensen
2024-03-01 14:11         ` Kumar Kartikeya Dwivedi
2024-03-01 14:23           ` Toke Høiland-Jørgensen
2024-03-01 15:00         ` Amery Hung
2024-03-01 15:06           ` Kumar Kartikeya Dwivedi
2024-03-01 19:28             ` Amery Hung
2024-03-01 20:07               ` Kumar Kartikeya Dwivedi
2024-03-01 23:30                 ` Amery Hung
2024-03-01 15:06           ` Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox