All of lore.kernel.org
 help / color / mirror / Atom feed
From: sdf@google.com
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Cong Wang" <cong.wang@bytedance.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Jamal Hadi Salim" <jhs@mojatatu.com>,
	"Jiri Pirko" <jiri@resnulli.us>
Subject: Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
Date: Fri, 24 Jun 2022 13:51:40 -0700	[thread overview]
Message-ID: <YrYj3LPaHV7thgJW@google.com> (raw)
In-Reply-To: <20220602041028.95124-1-xiyou.wangcong@gmail.com>

On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

> Here is a summary of design decisions I made:

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>     too.

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

>     c) Two eBPF helpers are introduced for skb map operations:
>     bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>     not allowed.

>     d) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion.

>     e) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event.

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.

> TODO:
> 1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.

  parent reply	other threads:[~2022-06-24 20:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-02  4:10 [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 1/5] net: introduce skb_rbtree_walk_safe() Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 2/5] bpf: move map in map declarations to bpf.h Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 3/5] bpf: introduce skb map and flow map Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 4/5] net_sched: introduce eBPF based Qdisc Cong Wang
2022-06-03 23:14   ` Cong Wang
2022-06-02  4:10 ` [RFC Patch v5 5/5] net_sched: introduce helper bpf_skb_tc_classify() Cong Wang
2022-06-24 16:52 ` [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc Toke Høiland-Jørgensen
2022-06-24 17:35   ` Dave Taht
2022-06-24 20:51 ` sdf [this message]
  -- strict thread matches above, loose matches on Subject: below --
2022-09-22  7:46 杨培灏

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrYj3LPaHV7thgJW@google.com \
    --to=sdf@google.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.