From: Yonghong Song <yonghong.song@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>,
Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free()
Date: Mon, 4 Dec 2023 23:01:24 -0800 [thread overview]
Message-ID: <9a308dc5-6765-4dcb-ba2b-43d257534ca0@linux.dev> (raw)
In-Reply-To: <6e6feeef-9d81-38c3-4426-42ab12dc9ad3@huaweicloud.com>
On 12/5/23 1:30 AM, Hou Tao wrote:
> Hi,
>
> On 12/5/2023 12:15 PM, Yonghong Song wrote:
>> On 12/4/23 8:31 PM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>>>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song
>>>> <yonghong.song@linux.dev> wrote:
>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>> I once hit a kasan error like below:
>>> SNIP
>>>>> So the problem is at rec->refcount_off in the above.
>>>>>
>>>>> I did some source code analysis and find the reason.
>>>>> CPU A CPU B
>>>>> bpf_map_put:
>>>>> ...
>>>>> btf_put with rcu callback
>>>>> ...
>>>>> bpf_map_free_deferred
>>>>> with system_unbound_wq
>>>>> ... ... ...
>>>>> ... btf_free_rcu: ...
>>>>> ... ...
>>>>> bpf_map_free_deferred:
>>>>> ... ...
>>>>> ... ---------> btf_struct_metas_free()
>>>>> ... | race condition ...
>>>>> ... --------->
>>>>> map->ops->map_free()
>>>>> ...
>>>>> ... btf->struct_meta_tab = NULL
>>>>>
>>>>> In the above, map_free() corresponds to array_map_free() and
>>>>> eventually
>>>>> calling bpf_rb_root_free() which calls:
>>>>> ...
>>>>> __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>>> ...
>>>>>
>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>> following code:
>>>>>
>>>>> meta = btf_find_struct_meta(btf, btf_id);
>>>>> if (!meta)
>>>>> return -EFAULT;
>>>>> rec->fields[i].graph_root.value_rec = meta->record;
>>>>>
>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>> struct_metas_tab.
>>>>> And it is possible that that particular record has been freed by
>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>
>>>>> Actually it is very hard to reproduce the failure with current
>>>>> bpf/bpf-next
>>>>> code, I only got the above error once. To increase reproducibility,
>>>>> I added
>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>> which
>>>>> significantly increased reproducibility.
>>> Also found the problem when developing the "fix the release of inner
>>> map" patch-set. I have written a selftest which could reliably reproduce
>>> the problem by using map-in-map + bpf_list_head. The reason of using
>>> map-in-map is to delay the release of inner map by using call_rcu() as
>>> well, so the free of bpf_map happens after the release of btf. Will post
>>> it later.
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 5e43ddd1b83f..aae5b5213e93 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>> struct btf_record *rec = map->record;
>>>>>
>>>>> + mdelay(100);
>>>>> security_bpf_map_free(map);
>>>>> bpf_map_release_memcg(map);
>>>>> /* implementation dependent freeing */
>>>>>
>>>>> To fix the problem, I moved btf_put() after map->ops->map_free() to
>>>>> ensure
>>>>> struct_metas available during map_free(). Rerun './test_progs -j'
>>>>> with the
>>>>> above mdelay() hack for a couple of times and didn't observe the
>>>>> error.
>>>>>
>>>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>> kernel/bpf/syscall.c | 6 +++++-
>>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>> {
>>>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>> struct btf_record *rec = map->record;
>>>>> + struct btf *btf = map->btf;
>>>>>
>>>>> security_bpf_map_free(map);
>>>>> bpf_map_release_memcg(map);
>>>>> /* implementation dependent freeing */
>>>>> map->ops->map_free(map);
>>>>> + /* Delay freeing of btf for maps, as map_free callback may
>>>>> need
>>>>> + * struct_meta info which will be freed with btf_put().
>>>>> + */
>>>>> + btf_put(btf);
>>>> The change makes sense to me, but logically I'd put it after
>>>> btf_record_free(rec), just in case if some of btf records ever refer
>>>> back to map's BTF or something (and just in general to keep it the
>>>> very last thing that's happening).
>>>>
>>>>
>>>> But it also seems like CI is not happy ([0]), please take a look,
>>>> thanks!
>>>>
>>>> [0]
>>>> https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
>>> The patch delays the release of BTF id of bpf map, so test_btf_id
>>> failed. Can we fix the problem by optionally pinning the btf in
>>> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
>>> be alive before the invocation of btf_record_free() ? We need to do the
>>> pinning optionally, because btf_record may be contained in btf directly
>>> (namely btf->struct_meta_tab) and is freed through btf_free().
>> Thanks for suggestion, I guess you want two cases:
>> - if map->record won't access any btf data (e.g.,
>> btf->struct_meta_tab),
>> we should keep current btf_put() workflow,
>> - if map->record accesses some btf data, we should call btf_put()
>> immediately before or after btf_record_free().
> Er, it is not what I want, although I have written a similar patch in
> which bpf_map_put() will call btf_put() and set map->btf as NULL if
> there is no BPF_LIST_HEAD and BPF_RB_ROOT fields in map->record,
> otherwise calling bpf_put() in bpf_put_free_deferred(). What I have
> suggested is to optionally pin btf in graph_root.btf just like
> btf_field_kptr does.
Okay, I see what you mean. This is actually what I kind of think
as well in below to identify *all* cases btf data might be accessed.
I didn't explicitly mention this approach in detail but the idea is
to get a reference count for btf and later release it during btf_record_free.
I think this should work. I need to do an audit then to find other potential
places, if exists, to do similar things. The current approach
is simpler but looks like we can do better with existing
btf_field_kptr approach.
>> This could be done but we need to be careful to find all cases
>> btf data might be accessed in map->record. The current approach
>> is simpler. I will post v2 with the change Andrii suggested and
>> fixed the failed test.
>>
>> If people really want to fine tune this like the above two cases, I can
>> investigate too.
>>
>>>>> /* Delay freeing of btf_record for maps, as map_free
>>>>> * callback usually needs access to them. It is better to
>>>>> do it here
>>>>> * than require each callback to do the free itself manually.
>>>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>>> if (atomic64_dec_and_test(&map->refcnt)) {
>>>>> /* bpf_map_free_id() must be called first */
>>>>> bpf_map_free_id(map);
>>>>> - btf_put(map->btf);
>>>>> INIT_WORK(&map->work, bpf_map_free_deferred);
>>>>> /* Avoid spawning kworkers, since they all might
>>>>> contend
>>>>> * for the same mutex like slab_mutex.
>>>>> --
>>>>> 2.34.1
>>>>>
>>>> .
>> .
next prev parent reply other threads:[~2023-12-05 7:01 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-04 17:39 [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-05 0:42 ` Andrii Nakryiko
2023-12-05 1:31 ` Hou Tao
2023-12-05 4:15 ` Yonghong Song
2023-12-05 6:30 ` Hou Tao
2023-12-05 7:01 ` Yonghong Song [this message]
2023-12-05 21:13 ` Alexei Starovoitov
2023-12-05 22:50 ` Yonghong Song
2023-12-05 3:58 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9a308dc5-6765-4dcb-ba2b-43d257534ca0@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox