Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: sdf@google.com
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Cong Wang" <cong.wang@bytedance.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Jamal Hadi Salim" <jhs@mojatatu.com>,
	"Jiri Pirko" <jiri@resnulli.us>
Subject: Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
Date: Fri, 24 Jun 2022 13:51:40 -0700	[thread overview]
Message-ID: <YrYj3LPaHV7thgJW@google.com> (raw)
In-Reply-To: <20220602041028.95124-1-xiyou.wangcong@gmail.com>

On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

> Here is a summary of design decisions I made:

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>     too.

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

>     c) Two eBPF helpers are introduced for skb map operations:
>     bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>     not allowed.

>     d) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion.

>     e) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event.

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.

> TODO:
> 1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.

next prev parent reply	other threads:[~2022-06-24 20:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 3/5] bpf: introduce skb map and flow map Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-03 23:14   ` Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify() Cong Wang
2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
2022-06-24 17:35   ` Dave Taht
2022-06-24 20:51 ` sdf [this message]
  -- strict thread matches above, loose matches on Subject: below --
2022-09-22  7:46 杨培灏

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrYj3LPaHV7thgJW@google.com \
    --to=sdf@google.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).