From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 01E3C7E for ; Wed, 23 Aug 2023 00:18:50 +0000 (UTC) Received: from out-10.mta1.migadu.com (out-10.mta1.migadu.com [IPv6:2001:41d0:203:375::a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 380F1CC7 for ; Tue, 22 Aug 2023 17:18:46 -0700 (PDT) Message-ID: <4a601c15-25d5-ec97-849a-97e54ace8f0d@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1692749924; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NSA5ruGwj7MgXlm2/jQ8OAXbtboM3JLZKRpakehBmv0=; b=N2+1cxBG/viWgA0lbbesRwQNSO81GGowgzg4hM/+lAaTZhVJfrd2q5YxQcLDZ4XPl65tRl 4gCx/MuLy99hfnU66kCFOYF6JSZUdB1uRCauPdJvVv2ZpMakW6fXvSeV0Id1eV7d9r8ky+ LKcWXsZ3b8kWa5rZKnV7hszbyhty84Q= Date: Tue, 22 Aug 2023 17:18:40 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Reply-To: yonghong.song@linux.dev Subject: Re: [PATCH v2 bpf-next 5/7] bpf: Consider non-owning refs to refcounted nodes RCU protected Content-Language: en-US To: Alexei Starovoitov , David Marchevsky Cc: Dave Marchevsky , bpf@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Kernel Team References: <20230821193311.3290257-1-davemarchevsky@fb.com> <20230821193311.3290257-6-davemarchevsky@fb.com> <21f00803-d20f-e584-6512-67e5107e3865@linux.dev> <20230822234529.z6ogvsptbivobdmg@MacBook-Pro-8.local> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yonghong Song In-Reply-To: <20230822234529.z6ogvsptbivobdmg@MacBook-Pro-8.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On 8/22/23 4:45 PM, Alexei Starovoitov wrote: > On Tue, Aug 22, 2023 at 01:47:01AM -0400, David Marchevsky wrote: >> On 8/21/23 10:37 PM, Yonghong Song wrote: >>> >>> >>> On 8/21/23 12:33 PM, Dave Marchevsky wrote: >>>> An earlier patch in the series ensures that the underlying memory of >>>> nodes with bpf_refcount - which can have multiple owners - is not reused >>>> until RCU grace period has elapsed. This prevents >>>> use-after-free with non-owning references that may point to >>>> recently-freed memory. While RCU read lock is held, it's safe to >>>> dereference such a non-owning ref, as by definition RCU GP couldn't have >>>> elapsed and therefore underlying memory couldn't have been reused. >>>> >>>>  From the perspective of verifier "trustedness" non-owning refs to >>>> refcounted nodes are now trusted only in RCU CS and therefore should no >>>> longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them >>>> MEM_RCU in order to reflect this new state. >>>> >>>> Signed-off-by: Dave Marchevsky >>>> --- >>>>   include/linux/bpf.h   |  3 ++- >>>>   kernel/bpf/verifier.c | 13 ++++++++++++- >>>>   2 files changed, 14 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>>> index eced6400f778..12596af59c00 100644 >>>> --- a/include/linux/bpf.h >>>> +++ b/include/linux/bpf.h >>>> @@ -653,7 +653,8 @@ enum bpf_type_flag { >>>>       MEM_RCU            = BIT(13 + BPF_BASE_TYPE_BITS), >>>>         /* Used to tag PTR_TO_BTF_ID | MEM_ALLOC references which are non-owning. >>>> -     * Currently only valid for linked-list and rbtree nodes. >>>> +     * Currently only valid for linked-list and rbtree nodes. If the nodes >>>> +     * have a bpf_refcount_field, they must be tagged MEM_RCU as well. >>>>        */ >>>>       NON_OWN_REF        = BIT(14 + BPF_BASE_TYPE_BITS), >>>>   diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >>>> index 8db0afa5985c..55607ab30522 100644 >>>> --- a/kernel/bpf/verifier.c >>>> +++ b/kernel/bpf/verifier.c >>>> @@ -8013,6 +8013,7 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, >>>>       case PTR_TO_BTF_ID | PTR_TRUSTED: >>>>       case PTR_TO_BTF_ID | MEM_RCU: >>>>       case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF: >>>> +    case PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU: >>>>           /* When referenced PTR_TO_BTF_ID is passed to release function, >>>>            * its fixed offset must be 0. In the other cases, fixed offset >>>>            * can be non-zero. This was already checked above. So pass >>>> @@ -10479,6 +10480,7 @@ static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env, >>>>   static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state *reg) >>>>   { >>>>       struct bpf_verifier_state *state = env->cur_state; >>>> +    struct btf_record *rec = reg_btf_record(reg); >>>>         if (!state->active_lock.ptr) { >>>>           verbose(env, "verifier internal error: ref_set_non_owning w/o active lock\n"); >>>> @@ -10491,6 +10493,9 @@ static int ref_set_non_owning(struct bpf_verifier_env *env, struct bpf_reg_state >>>>       } >>>>         reg->type |= NON_OWN_REF; >>>> +    if (rec->refcount_off >= 0) >>>> +        reg->type |= MEM_RCU; >>> >>> Should the above MEM_RCU marking be done unless reg access is in >>> rcu critical section? >> >> I think it is fine, since non-owning references currently exist only within >> spin_lock CS. Based on Alexei's comments on v1 of this series [0], preemption >> disabled + spin_lock CS should imply RCU CS. >> >> [0]: https://lore.kernel.org/bpf/20230802230715.3ltalexaczbomvbu@MacBook-Pro-8.local/ >> >>> >>> I think we still have issues for state resetting >>> with bpf_spin_unlock() and bpf_rcu_read_unlock(), both of which >>> will try to convert the reg state to PTR_UNTRUSTED. >>> >>> Let us say reg state is >>>   PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU >>> >>> (1). If hitting bpf_spin_unlock(), since MEM_RCU is in >>> the reg state, the state should become >>>   PTR_TO_BTF_ID | MEM_ALLOC | MEM_RCU >>> some additional code might be needed so we wont have >>> verifier complaints about ref_obj_id == 0. >>> >>> (2). If hitting bpf_rcu_read_unlock(), the state should become >>>   PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF >>> since register access still in bpf_spin_lock() region. >> >> I agree w/ your comment in side reply stating that this >> case isn't possible since bpf_rcu_read_{lock,unlock} in spin_lock CS >> is currently not allowed. >> >>> >>> Does this make sense? >>> >> >> >> IIUC the specific reg state flow you're recommending is based on the convos >> we've had over the past few weeks re: getting rid of special non-owning ref >> lifetime rules, instead using RCU as much as possible. Specifically, this >> recommended change would remove non-owning ref clobbering, instead just removing >> NON_OWN_REF flag on bpf_spin_unlock so that such nodes can no longer be passed >> to collection kfuncs (refcount_acquire, etc). > > Overall the patch set makes sense to me, but I want to clarify above. > My understanding that after the patch set applied bpf_spin_unlock() > will invalidate_non_owning_refs(), so what Yonghong is saying in (1) > is not correct. > Instead PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU will become mark_reg_invalid(). I said it 'should become ...', but you are right. right now, it will do mark_reg_invalid(). So it is correct just MAYBE a little conservative. > > Re: (2) even if/when bpf_rcu_read_unlock() will allowed inside spinlocked region > it will convert PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU to > PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | PTR_UNTRUSTED > which is a buggy combination which we would need to address if rcu_unlock is allowed eventually. > > Did I get it right? > If so I think the whole set is good to do. >