From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CAA514E2C6; Tue, 30 Jan 2024 17:49:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706636971; cv=none; b=IMjAhaHisUO1ccTbg01jxkmAGdr3haUT81Y7TyFBpLVYFgAHUPg/Scsd4NDRtiNwUaGJYUyhdjA58G5m0+2nZUWW4f/+HXXbpaCayMtZDAwMJb5EsOAbDWrICzGiGizE7LF712XICBzO0Udfi1MAGlEsknFL1PMdOKYQWA2bVm8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706636971; c=relaxed/simple; bh=OYumdxXhyXFNrpTnjPuyVJJC5+19XtuMmzsdSuo82Go=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=mpi7iEJlBq+oJvgTEgiaB1waikarswQlgLsKZTwQZtyH3D9cZWVFchJzfUkj80CwzMQ6PLTfeE2seRDwUXjrchOwfkr03o9agWVO4nVzzXyl6jCj6gVAqwDUn6FMSQSdE4uyhgjTNVQ8AvIIk7hKqExt+EhNw5pj5EEDuRk+lRo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SFA4WTT/; arc=none smtp.client-ip=209.85.219.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SFA4WTT/" Received: by mail-yb1-f172.google.com with SMTP id 3f1490d57ef6-dc6432ee799so4289813276.0; Tue, 30 Jan 2024 09:49:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706636969; x=1707241769; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=ELR7zHCCPfH1FtIz7LAHyE/HzjDEzixfdKr1qjQ9D4Y=; b=SFA4WTT/UtFsZyJ6vmnlUV9w4MIonBtGYuLsrn/c0Ep/zoCh4TVS+2N5sWtmYZ1eCc vf4Hm1ZCXXNF2QoiBrH2m1FegwkU+15shXZQ0DHwQCxa9KRHQSE7no6P8MiCqX4OE+Fy +2/x4QHWxyuAwPNPgIU0pJ7UAD0pYrRiANE3q3prDSRB5BUeP7P24OIV+VsOfFjRkv53 b3TrZb42Fw1yVbUYDUvgWg8yRZ4700Rxo1kePD1uLmXDmeIC+r1rlyE1pq5w304X7K7E fZAyYqCeSz2npZr4KtV8fU5s52ebqJ1/E9xGJsMDf/FYQg6gEsz3bRW8F9rcZFghui/z iP5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706636969; x=1707241769; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ELR7zHCCPfH1FtIz7LAHyE/HzjDEzixfdKr1qjQ9D4Y=; b=JLjnEhp062Bdpq3tmcBTxvWJMpYMf5UYWjzKfpQsZBFlC69AnreHR95cq02sq+kUnh VDF7HhzZFSMFtFxV2Bn8ITgQFPa3DENvsJWMdVABKnR+zeKykqCHObzyEW9zB5nz2vFN 5hWczUMaPK6aUFyhBAed0JKvrNGjgod283ZSVLlca64gU/G5/qNPdLbLXnx6b615/E72 YZ6jjeV3z3tBdSUFFBhudIpitmH925mSaW6VgOu8daY0n0ERq6bd/N15JjktwuaCg8/r qXei6GHpt6YUvpAVfCgcJ2ZpSPZB3o4l63ya8lMJglJjGNRXcdcQiHuCO6U4Mctp11Et hkwg== X-Gm-Message-State: AOJu0YwbY3YEjhnGW5S54rjjJquHpKVF3UuEnsH4Tkx9e2OQbpk5IrX+ ebFA3oNSIwX6+1rZiMa48s7nEDlIDaHaMiM+0ivXNS8PRynifYYy X-Google-Smtp-Source: AGHT+IFMeTJEgALTPGO91oHqLXTFvleiwbnR4ngooQcY55xavAcEM5y9SaXRXJArrVbUac5pzyLnlQ== X-Received: by 2002:a05:6902:1005:b0:dc2:3e3f:afd7 with SMTP id w5-20020a056902100500b00dc23e3fafd7mr8244401ybt.54.1706636968926; Tue, 30 Jan 2024 09:49:28 -0800 (PST) Received: from ?IPV6:2600:1700:6cf8:1240:3c9e:c544:4041:5027? ([2600:1700:6cf8:1240:3c9e:c544:4041:5027]) by smtp.gmail.com with ESMTPSA id 81-20020a250154000000b00dbed74597cesm3093886ybb.16.2024.01.30.09.49.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 30 Jan 2024 09:49:28 -0800 (PST) Message-ID: <845df264-adb3-4e00-bb8e-2a0ac1d331ae@gmail.com> Date: Tue, 30 Jan 2024 09:49:26 -0800 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc Content-Language: en-US To: Martin KaFai Lau , Amery Hung Cc: bpf@vger.kernel.org, yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, sdf@google.com, xiyou.wangcong@gmail.com, yepeilin.cs@gmail.com, netdev@vger.kernel.org, Kui-Feng Lee References: <232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung@bytedance.com> <0484f7f7-715f-4084-b42d-6d43ebb5180f@linux.dev> <01fdb720-c0dc-495d-a42d-756aa2bf4455@linux.dev> <8c00bd63-2d00-401e-af6d-1b6aebac4701@linux.dev> From: Kui-Feng Lee In-Reply-To: <8c00bd63-2d00-401e-af6d-1b6aebac4701@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 1/29/24 22:39, Martin KaFai Lau wrote: > On 1/26/24 5:17 PM, Amery Hung wrote: >> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau >> wrote: >>> >>> On 1/23/24 9:22 PM, Amery Hung wrote: >>>>> I looked at the high level of the patchset. The major ops that it >>>>> wants to be >>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and >>>>> ".reset" in >>>>> patch 4 and patch 5). >>>>> >>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach >>>>> types (each for >>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new >>>>> "bpf_qdisc_ctx" in the >>>>> uapi. It is no long an acceptable way to add new bpf extension. >>>>> >>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely >>>>> implemented >>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops >>>>> for Qdisc_ops >>>>> can be created. The bpf Qdisc_ops can be loaded through the >>>>> existing struct_ops api. >>>>> >>>> Partially. If using struct_ops, I think we'll need another structure >>>> like the following in bpf qdisc to be implemented with struct_ops bpf: >>>> >>>> struct bpf_qdisc_ops { >>>>       int (*enqueue) (struct sk_buff *skb) >>>>       void (*dequeue) (void) >>>>       void (*init) (void) >>>>       void (*reset) (void) >>>> }; >>>> >>>> Then, Qdisc_ops will wrap around them to handle things that cannot be >>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr). >>> >>> We can see how those limitations (calling sch_tree_lock() and >>> returning a ptr) >>> can be addressed in bpf. This will also help other similar use cases. >>> >> >> For kptr, I wonder if we can support the following semantics in bpf if >> they make sense: > > I think they are useful but they are not fully supported now. > > Some thoughts below. > >> 1. Passing a referenced kptr into a bpf program, which will also need >> to be released, or exchanged into maps or allocated objects. > > "enqueue" should be the one considering here: > > struct Qdisc_ops { >     /* ... */ >     int                     (*enqueue)(struct sk_buff *skb, >                        struct Qdisc *sch, >                        struct sk_buff **to_free); > > }; > > The verifier only marks the skb as a trusted kptr but does not mark its > reg->ref_obj_id. Take a look at btf_ctx_access(). In particular: > >     if (prog_args_trusted(prog)) >         info->reg_type |= PTR_TRUSTED; > > The verifier does not know the skb ownership is passed into the > ".enqueue" ops and does not know the bpf prog needs to release it or > store it in a map. > > The verifier tracks the reference state when a KF_ACQUIRE kfunc is > called (just an example, not saying we need to use KF_ACQUIRE kfunc). > Take a look at acquire_reference_state() which is the useful one here. > > Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier > can always acquire_reference_state() for the "struct sk_buff *skb" > argument. > > Take a look at a recent RFC: > https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/ > which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC > patch is tagging the argument could be NULL by appending "__nullable" to > the argument name. The verifier will enforce that the bpf prog must > check for NULL first. > > The similar idea can be used here but with a different tagging (for > example, "__must_release", admittedly not a good name). While the RFC > patch is in-progress, for now, may be hardcode for the ".enqueue" ops in > check_struct_ops_btf_id() and always acquire_reference_state() for the > skb. This part can be adjusted later once the RFC patch will be in shape. > > > Then one more thing is to track when the struct_ops bpf prog is actually > reading the value of the skb pointer. One thing is worth to mention > here, e.g. a struct_ops prog for enqueue: > > SEC("struct_ops") > int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch, >          struct sk_buff **to_free) > { >     return bpf_qdisc_drop(skb, sch, to_free); > } > > Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to > an array of __u64 as the only argument. The skb is actually in ctx[0], > sch is in ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. > r1 = ctx[0]), btf_ctx_access() marks the reg_type to PTR_TRUSTED. It > needs to also initialize the reg->ref_obj_id by the id obtained earlier > from acquire_reference_state() during check_struct_ops_btf_id() somehow. > > >> 2. Returning a kptr from a program and treating it as releasing the >> reference. > > e.g. for dequeue: > > struct Qdisc_ops { >     /* ... */ >     struct sk_buff *        (*dequeue)(struct Qdisc *); > }; > > > Right now the verifier should complain on check_reference_leak() if the > struct_ops bpf prog is returning a referenced kptr. > > Unlike an argument, the return type of a function does not have a name > to tag. It is the first case that a struct_ops bpf_prog returning a We may tag the stub functions instead, right? Is the purpose here to return a referenced pointer from a struct_ops operator without verifier complaining? > pointer. One idea is to assume it must be a trusted pointer > (PTR_TRUSTED) and the verifier should check it is indeed with > PTR_TRUSTED flag. > > May be release_reference_state() can be called to assume the kernel will > release it as long as the return pointer type is PTR_TRUSTED and the > type matches the return type of the ops. Take a look at > check_return_code(). >