From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hou Tao <houtao@huaweicloud.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>,
Martin KaFai Lau <martin.lau@kernel.org>,
bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()
Date: Wed, 13 Dec 2023 22:30:53 -0800 [thread overview]
Message-ID: <50044158-c732-4ca5-8a3c-d06d693f1c85@linux.dev> (raw)
In-Reply-To: <CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com>
On 12/13/23 8:17 PM, Alexei Starovoitov wrote:
> On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 12/8/23 12:30 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/8/2023 12:02 PM, Yonghong Song wrote:
>>>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>>>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>>>> I once hit a kasan error like below:
>>>>>>>
>>>>>>>
>>> SNIP
>>>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>>>> following code:
>>>>>>>
>>>>>>> meta = btf_find_struct_meta(btf, btf_id);
>>>>>>> if (!meta)
>>>>>>> return -EFAULT;
>>>>>>> rec->fields[i].graph_root.value_rec = meta->record;
>>>>>>>
>>>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>>>> struct_metas_tab.
>>>>>>> And it is possible that that particular record has been freed by
>>>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>>>
>>>>>>> Actually it is very hard to reproduce the failure with current
>>>>>>> bpf/bpf-next
>>>>>>> code, I only got the above error once. To increase reproducibility,
>>>>>>> I added
>>>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>>>> which
>>>>>>> significantly increased reproducibility.
>>>>>>>
>>>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>>>> index 5e43ddd1b83f..aae5b5213e93 100644
>>>>>>> --- a/kernel/bpf/syscall.c
>>>>>>> +++ b/kernel/bpf/syscall.c
>>>>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>>>> work_struct *work)
>>>>>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>>>>>> work);
>>>>>>> struct btf_record *rec = map->record;
>>>>>>>
>>>>>>> + mdelay(100);
>>>>>>> security_bpf_map_free(map);
>>>>>>> bpf_map_release_memcg(map);
>>>>>>> /* implementation dependent freeing */
>>>>>>>
>>>>>>> To fix the problem, we need to have a reference on btf in order to
>>>>>>> safeguard accessing field->graph_root.value_rec in
>>>>>>> map->ops->map_free().
>>>>>>> The function btf_parse_graph_root() is the place to get a btf
>>>>>>> reference.
>>>>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>>>>
>>>>>>> btf_parse
>>>>>>> ...
>>>>>>> btf_parse_fields
>>>>>>> ...
>>>>>>> btf_parse_graph_root
>>>>>>>
>>>>>>> map_check_btf
>>>>>>> btf_parse_fields
>>>>>>> ...
>>>>>>> btf_parse_graph_root
>>>>>>>
>>>>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>>>>> indirectly
>>>>>>> called from btf_parse() or map_check_btf().
>>>>>>>
>>>>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>>>>> btf is still in the middle to self-validation and initial reference
>>>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>>>>
>>>>>> Sorry for jumping in late.
>>>>>>
>>>>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>>>>> and "bool from_map_check". It seems to a bit too much to deal with
>>>>>> the error path for btf_parse().
>>> Maybe we could move the common btf used by kptr and graph_root into
>>> bpf_record and let the callers of btf_parse_fields() and
>>> btf_record_free() to decide the life cycle of btf in btf_record, so
>>> there will be less intrusive and less special case. The following is the
>> I didn't fully check the code but looks like we took a
>> btf reference at map_check_btf() and free it at the end
>> of bpf_map_free_deferred(). This is similar to my v1 patch,
>> not exactly the same but similar since they all do
>> btf_put() at the end of bpf_map_free_deferred().
>>
>> Through discussion, doing on-demand btf_get()/btf_put()
>> approach, similar to kptr approach, seems more favored.
>> This also has advantage to free btf at its earlist possible
>> point.
> Sorry. Looks like I recommended the wrong path.
>
> The approach of btf_parse_fields(... false | true)
> depending on where it's called and whether returned struct btf_record *
> will be kept within a type or within a map
> is pushing complexity too far.
> A year from now we'll forget these subtle details.
> There is an advantage to do btf_put() earli in bpf_map_put(),
> but in the common case it would be delayed just after queue_work.
> Which is a minor time delay.
> And for free_after_mult_rcu_gp much longer,
> but saving from freeing btf are minor compared to the map itself.
>
> I think it's cleaner to go back to v1 and simply move btf_put
> to bpf_map_free_deferred().
> A lot less things to worry about.
> Especially considering that BPF_RB_ROOT may not be the last such special
> record keeping type and every new type would need to think
> hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
> v1 avoids this future complexity.
No problem. Will send v6 tomorrow with v1.
next prev parent reply other threads:[~2023-12-14 6:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 21:09 [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-07 13:46 ` Hou Tao
2023-12-08 1:23 ` Martin KaFai Lau
2023-12-08 3:59 ` Yonghong Song
2023-12-08 4:02 ` Yonghong Song
2023-12-08 8:30 ` Hou Tao
2023-12-08 17:07 ` Yonghong Song
2023-12-14 4:17 ` Alexei Starovoitov
2023-12-14 6:30 ` Yonghong Song [this message]
2023-12-08 8:16 ` Martin KaFai Lau
2023-12-08 16:45 ` Yonghong Song
2023-12-08 18:26 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50044158-c732-4ca5-8a3c-d06d693f1c85@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox