From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hou Tao <houtao@huaweicloud.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Kernel Team <kernel-team@fb.com>,
Martin KaFai Lau <martin.lau@kernel.org>,
bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()
Date: Wed, 13 Dec 2023 22:30:53 -0800 [thread overview]
Message-ID: <50044158-c732-4ca5-8a3c-d06d693f1c85@linux.dev> (raw)
In-Reply-To: <CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com>
On 12/13/23 8:17 PM, Alexei Starovoitov wrote:
> On Fri, Dec 8, 2023 at 9:07 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 12/8/23 12:30 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/8/2023 12:02 PM, Yonghong Song wrote:
>>>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>>>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>>>> I once hit a kasan error like below:
>>>>>>>
>>>>>>>
>>> SNIP
>>>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>>>> following code:
>>>>>>>
>>>>>>> meta = btf_find_struct_meta(btf, btf_id);
>>>>>>> if (!meta)
>>>>>>> return -EFAULT;
>>>>>>> rec->fields[i].graph_root.value_rec = meta->record;
>>>>>>>
>>>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>>>> struct_metas_tab.
>>>>>>> And it is possible that that particular record has been freed by
>>>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>>>
>>>>>>> Actually it is very hard to reproduce the failure with current
>>>>>>> bpf/bpf-next
>>>>>>> code, I only got the above error once. To increase reproducibility,
>>>>>>> I added
>>>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>>>> which
>>>>>>> significantly increased reproducibility.
>>>>>>>
>>>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>>>> index 5e43ddd1b83f..aae5b5213e93 100644
>>>>>>> --- a/kernel/bpf/syscall.c
>>>>>>> +++ b/kernel/bpf/syscall.c
>>>>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>>>> work_struct *work)
>>>>>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>>>>>> work);
>>>>>>> struct btf_record *rec = map->record;
>>>>>>>
>>>>>>> + mdelay(100);
>>>>>>> security_bpf_map_free(map);
>>>>>>> bpf_map_release_memcg(map);
>>>>>>> /* implementation dependent freeing */
>>>>>>>
>>>>>>> To fix the problem, we need to have a reference on btf in order to
>>>>>>> safeguard accessing field->graph_root.value_rec in
>>>>>>> map->ops->map_free().
>>>>>>> The function btf_parse_graph_root() is the place to get a btf
>>>>>>> reference.
>>>>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>>>>
>>>>>>> btf_parse
>>>>>>> ...
>>>>>>> btf_parse_fields
>>>>>>> ...
>>>>>>> btf_parse_graph_root
>>>>>>>
>>>>>>> map_check_btf
>>>>>>> btf_parse_fields
>>>>>>> ...
>>>>>>> btf_parse_graph_root
>>>>>>>
>>>>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>>>>> indirectly
>>>>>>> called from btf_parse() or map_check_btf().
>>>>>>>
>>>>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>>>>> btf is still in the middle to self-validation and initial reference
>>>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>>>>
>>>>>> Sorry for jumping in late.
>>>>>>
>>>>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>>>>> and "bool from_map_check". It seems to a bit too much to deal with
>>>>>> the error path for btf_parse().
>>> Maybe we could move the common btf used by kptr and graph_root into
>>> bpf_record and let the callers of btf_parse_fields() and
>>> btf_record_free() to decide the life cycle of btf in btf_record, so
>>> there will be less intrusive and less special case. The following is the
>> I didn't fully check the code but looks like we took a
>> btf reference at map_check_btf() and free it at the end
>> of bpf_map_free_deferred(). This is similar to my v1 patch,
>> not exactly the same but similar since they all do
>> btf_put() at the end of bpf_map_free_deferred().
>>
>> Through discussion, doing on-demand btf_get()/btf_put()
>> approach, similar to kptr approach, seems more favored.
>> This also has advantage to free btf at its earlist possible
>> point.
> Sorry. Looks like I recommended the wrong path.
>
> The approach of btf_parse_fields(... false | true)
> depending on where it's called and whether returned struct btf_record *
> will be kept within a type or within a map
> is pushing complexity too far.
> A year from now we'll forget these subtle details.
> There is an advantage to do btf_put() earli in bpf_map_put(),
> but in the common case it would be delayed just after queue_work.
> Which is a minor time delay.
> And for free_after_mult_rcu_gp much longer,
> but saving from freeing btf are minor compared to the map itself.
>
> I think it's cleaner to go back to v1 and simply move btf_put
> to bpf_map_free_deferred().
> A lot less things to worry about.
> Especially considering that BPF_RB_ROOT may not be the last such special
> record keeping type and every new type would need to think
> hard whether it's BPF_RB_ROOT-like or BPF_LIST_NODE-like.
> v1 avoids this future complexity.
No problem. Will send v6 tomorrow with v1.
next prev parent reply other threads:[~2023-12-14 6:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 21:09 [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-07 13:46 ` Hou Tao
2023-12-08 1:23 ` Martin KaFai Lau
2023-12-08 3:59 ` Yonghong Song
2023-12-08 4:02 ` Yonghong Song
2023-12-08 8:30 ` Hou Tao
2023-12-08 17:07 ` Yonghong Song
2023-12-14 4:17 ` Alexei Starovoitov
2023-12-14 6:30 ` Yonghong Song [this message]
2023-12-08 8:16 ` Martin KaFai Lau
2023-12-08 16:45 ` Yonghong Song
2023-12-08 18:26 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50044158-c732-4ca5-8a3c-d06d693f1c85@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.