All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Martin KaFai Lau <kafai@fb.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	John Fastabend <john.fastabend@gmail.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, Cong Wang <cong.wang@bytedance.com>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@resnulli.us>
Subject: Re: [RFC Patch net-next] net_sched: introduce eBPF based Qdisc
Date: Thu, 02 Sep 2021 18:57:17 +0200	[thread overview]
Message-ID: <871r66ud8y.fsf@toke.dk> (raw)
In-Reply-To: <20210901174543.xukawl7ylkqzbuax@kafai-mbp.dhcp.thefacebook.com>

Martin KaFai Lau <kafai@fb.com> writes:

> On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > Cong Wang wrote:
>> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> > Please explain more on this.  What is currently missing
>> >> > to make qdisc in struct_ops possible?
>> >> 
>> >> I think you misunderstand this point. The reason why I avoid it is
>> >> _not_ anything is missing, quite oppositely, it is because it requires
>> >> a lot of work to implement a Qdisc with struct_ops approach, literally
>> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> >> WIth current approach, programmers only need to implement two
>> >> eBPF programs (enqueue and dequeue).
> _if_ it is using as a qdisc object/interface,
> the patch "looks" easier because it obscures some of the ops/interface
> from the bpf user.  The user will eventually ask for more flexibility
> and then an on-par interface as the kernel's qdisc.  If there are some
> common 'ops', the common bpf code can be shared as a library in userspace
> or there is also kfunc call to call into the kernel implementation.
> For existing kernel qdisc author,  it will be easier to use the same
> interface also.

The question is if it's useful to provide the full struct_ops for
qdiscs? Having it would allow a BPF program to implement that interface
towards userspace (things like statistics, classes etc), but the
question is if anyone is going to bother with that given the wealth of
BPF-specific introspection tools already available?

My hope is that we can (longer term) develop some higher-level tools to
express queueing policies that can then generate the BPF code needed to
implement them. Or as a start just some libraries to make this easier,
which I think is also what you're hinting at here? :)

>> > Another idea. Rather than work with qdisc objects which creates all
>> > these issues with how to work with existing interfaces, filters, etc.
>> > Why not create an sk_buff map? Then this can be used from the existing
>> > egress/ingress hooks independent of the actual qdisc being used.
>> 
>> I agree. In fact, I'm working on doing just this for XDP, and I see no
>> reason why the map type couldn't be reused for skbs as well. Doing it
>> this way has a couple of benefits:
>> 
>> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>>   implement that with a single queue map. Or do you want to build a full
>>   hierarchical queueing structure? Just instantiate as many queue maps
>>   as you need to achieve this. Etc.
> Agree.  Regardless how the interface may look like,
> I even think being able to queue/dequeue an skb into different bpf maps
> should be the first thing to do here.  Looking forward to your patches.

Thanks! Guess I should go work on them, then :D

>> - The behaviour is defined entirely by BPF program behaviour, and does
>>   not require setting up a qdisc hierarchy in addition to writing BPF
>>   code.
> Interesting idea.  If it does not need to use the qdisc object/interface
> and be able to do the qdisc hierarchy setup in a programmable way, it may
> be nice.  It will be useful for the future patches to come with some
> bpf prog examples to do that.

Absolutely; we plan to include example algorithm implementations as well!

>> - It should be possible to structure the hooks in a way that allows
>>   reusing queueing algorithm implementations between the qdisc and XDP
>>   layers.
>> 
>> > You mention skb should not be exposed to userspace? Why? Whats the
>> > reason for this? Anyways we can make kernel only maps if we want or
>> > scrub the data before passing it to userspace. We do this already in
>> > some cases.
>> 
>> Yup, that's my approach as well.
>> 
>> > IMO it seems cleaner and more general to allow sk_buffs
>> > to be stored in maps and pulled back out later for enqueue/dequeue.
>> 
>> FWIW there's some gnarly details here (for instance, we need to make
>> sure the BPF program doesn't leak packet references after they are
>> dequeued from the map). My idea is to use a scheme similar to what we do
>> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
>> actually remove the packet from the queue until the BPF program exits
>> (so the kernel can make sure things are accounted correctly).
> The verifier is tracking the sk's references.  Can it be reused to
> track the skb's reference?

I was vaguely aware that it does this, but have not looked at the
details. Would be great if this was possible; will see how far I get
with it, and iterate from there (with your help, hopefully :))

-Toke


  parent reply	other threads:[~2021-09-02 16:57 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-21  1:02 [RFC Patch net-next] net_sched: introduce eBPF based Qdisc Cong Wang
2021-08-24 23:47 ` Martin KaFai Lau
2021-09-01  4:39   ` Cong Wang
2021-09-01  5:45     ` John Fastabend
2021-09-01 10:42       ` Toke Høiland-Jørgensen
2021-09-01 17:45         ` Martin KaFai Lau
2021-09-01 18:03           ` Alexei Starovoitov
2021-09-02 16:57           ` Toke Høiland-Jørgensen [this message]
2021-09-02 20:40             ` John Fastabend
2021-09-02 22:27               ` Toke Høiland-Jørgensen
2021-09-02 23:35                 ` Martin KaFai Lau
2021-09-03 14:44                   ` Toke Høiland-Jørgensen
2021-09-03 15:33                     ` Jamal Hadi Salim
2021-09-10  6:55                     ` Martin KaFai Lau
2021-09-10 11:31                       ` Toke Høiland-Jørgensen
2021-09-04  1:09           ` Cong Wang
2021-09-17  4:19             ` Martin KaFai Lau
2021-09-04  1:30         ` Cong Wang
2021-09-06 11:45           ` Toke Høiland-Jørgensen
2021-09-04  1:05       ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871r66ud8y.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.