Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kui-Feng Lee <sinquersw@gmail.com>
To: Martin KaFai Lau <martin.lau@linux.dev>,
	Amery Hung <ameryhung@gmail.com>
Cc: bpf@vger.kernel.org, yangpeihao@sjtu.edu.cn, toke@redhat.com,
	jhs@mojatatu.com, jiri@resnulli.us, sdf@google.com,
	xiyou.wangcong@gmail.com, yepeilin.cs@gmail.com,
	netdev@vger.kernel.org, Kui-Feng Lee <thinker.li@gmail.com>
Subject: Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
Date: Tue, 30 Jan 2024 09:49:26 -0800	[thread overview]
Message-ID: <845df264-adb3-4e00-bb8e-2a0ac1d331ae@gmail.com> (raw)
In-Reply-To: <8c00bd63-2d00-401e-af6d-1b6aebac4701@linux.dev>



On 1/29/24 22:39, Martin KaFai Lau wrote:
> On 1/26/24 5:17 PM, Amery Hung wrote:
>> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau 
>> <martin.lau@linux.dev> wrote:
>>>
>>> On 1/23/24 9:22 PM, Amery Hung wrote:
>>>>> I looked at the high level of the patchset. The major ops that it 
>>>>> wants to be
>>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and 
>>>>> ".reset" in
>>>>> patch 4 and patch 5).
>>>>>
>>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach 
>>>>> types (each for
>>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new 
>>>>> "bpf_qdisc_ctx" in the
>>>>> uapi. It is no long an acceptable way to add new bpf extension.
>>>>>
>>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely 
>>>>> implemented
>>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops 
>>>>> for Qdisc_ops
>>>>> can be created. The bpf Qdisc_ops can be loaded through the 
>>>>> existing struct_ops api.
>>>>>
>>>> Partially. If using struct_ops, I think we'll need another structure
>>>> like the following in bpf qdisc to be implemented with struct_ops bpf:
>>>>
>>>> struct bpf_qdisc_ops {
>>>>       int (*enqueue) (struct sk_buff *skb)
>>>>       void (*dequeue) (void)
>>>>       void (*init) (void)
>>>>       void (*reset) (void)
>>>> };
>>>>
>>>> Then, Qdisc_ops will wrap around them to handle things that cannot be
>>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>>>
>>> We can see how those limitations (calling sch_tree_lock() and 
>>> returning a ptr)
>>> can be addressed in bpf. This will also help other similar use cases.
>>>
>>
>> For kptr, I wonder if we can support the following semantics in bpf if
>> they make sense:
> 
> I think they are useful but they are not fully supported now.
> 
> Some thoughts below.
> 
>> 1. Passing a referenced kptr into a bpf program, which will also need
>> to be released, or exchanged into maps or allocated objects.
> 
> "enqueue" should be the one considering here:
> 
> struct Qdisc_ops {
>      /* ... */
>      int                     (*enqueue)(struct sk_buff *skb,
>                         struct Qdisc *sch,
>                         struct sk_buff **to_free);
> 
> };
> 
> The verifier only marks the skb as a trusted kptr but does not mark its 
> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
> 
>      if (prog_args_trusted(prog))
>          info->reg_type |= PTR_TRUSTED;
> 
> The verifier does not know the skb ownership is passed into the 
> ".enqueue" ops and does not know the bpf prog needs to release it or 
> store it in a map.
> 
> The verifier tracks the reference state when a KF_ACQUIRE kfunc is 
> called (just an example, not saying we need to use KF_ACQUIRE kfunc). 
> Take a look at acquire_reference_state() which is the useful one here.
> 
> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier 
> can always acquire_reference_state() for the "struct sk_buff *skb" 
> argument.
> 
> Take a look at a recent RFC: 
> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC 
> patch is tagging the argument could be NULL by appending "__nullable" to 
> the argument name. The verifier will enforce that the bpf prog must 
> check for NULL first.
> 
> The similar idea can be used here but with a different tagging (for 
> example, "__must_release", admittedly not a good name). While the RFC 
> patch is in-progress, for now, may be hardcode for the ".enqueue" ops in 
> check_struct_ops_btf_id() and always acquire_reference_state() for the 
> skb. This part can be adjusted later once the RFC patch will be in shape.
> 
> 
> Then one more thing is to track when the struct_ops bpf prog is actually 
> reading the value of the skb pointer. One thing is worth to mention 
> here, e.g. a struct_ops prog for enqueue:
> 
> SEC("struct_ops")
> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>           struct sk_buff **to_free)
> {
>      return bpf_qdisc_drop(skb, sch, to_free);
> }
> 
> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to 
> an array of __u64 as the only argument. The skb is actually in ctx[0], 
> sch is in ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. 
> r1 = ctx[0]), btf_ctx_access() marks the reg_type to PTR_TRUSTED. It 
> needs to also initialize the reg->ref_obj_id by the id obtained earlier 
> from acquire_reference_state() during check_struct_ops_btf_id() somehow.
> 
> 
>> 2. Returning a kptr from a program and treating it as releasing the 
>> reference.
> 
> e.g. for dequeue:
> 
> struct Qdisc_ops {
>      /* ... */
>      struct sk_buff *        (*dequeue)(struct Qdisc *);
> };
> 
> 
> Right now the verifier should complain on check_reference_leak() if the 
> struct_ops bpf prog is returning a referenced kptr.
> 
> Unlike an argument, the return type of a function does not have a name 
> to tag. It is the first case that a struct_ops bpf_prog returning a 

We may tag the stub functions instead, right?
Is the purpose here to return a referenced pointer from a struct_ops
operator without verifier complaining?

> pointer. One idea is to assume it must be a trusted pointer 
> (PTR_TRUSTED) and the verifier should check it is indeed with 
> PTR_TRUSTED flag.
> 
> May be release_reference_state() can be called to assume the kernel will 
> release it as long as the return pointer type is PTR_TRUSTED and the 
> type matches the return type of the ops. Take a look at 
> check_return_code().
>

next prev parent reply	other threads:[~2024-01-30 17:49 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
2024-01-23 23:51   ` Martin KaFai Lau
2024-01-24  5:22     ` Amery Hung
2024-01-26  2:22       ` Martin KaFai Lau
2024-01-27  1:17         ` Amery Hung
2024-01-30  6:39           ` Martin KaFai Lau
2024-01-30 17:49             ` Kui-Feng Lee [this message]
2024-01-31  1:01               ` Martin KaFai Lau
2024-01-31 16:49                 ` Kui-Feng Lee
2024-01-31 16:59                   ` Amery Hung
2024-01-31 16:23             ` Amery Hung
2024-02-02  1:47               ` Martin KaFai Lau
2024-02-09 20:14                 ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 4/8] net_sched: Add reset program Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 5/8] net_sched: Add init program Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC Amery Hung
2024-01-23  0:17   ` Andrii Nakryiko
2024-01-23 19:40     ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc Amery Hung
2024-01-24 10:29   ` Daniel Borkmann
2024-01-26 19:49     ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc Amery Hung
2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
2024-01-24 10:10   ` Daniel Borkmann
2024-01-24 12:09   ` Jamal Hadi Salim
2024-01-24 13:07     ` Daniel Borkmann
2024-01-24 14:11       ` Jamal Hadi Salim
2024-01-24 15:26         ` Daniel Borkmann
2024-01-24 21:26           ` Amery Hung
2024-01-25 11:57             ` Daniel Borkmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=845df264-adb3-4e00-bb8e-2a0ac1d331ae@gmail.com \
    --to=sinquersw@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=thinker.li@gmail.com \
    --cc=toke@redhat.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yangpeihao@sjtu.edu.cn \
    --cc=yepeilin.cs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).