BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yonghong.song@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free()
Date: Mon, 4 Dec 2023 20:15:39 -0800	[thread overview]
Message-ID: <513bafac-03fa-4c2f-ba7f-67de96f79a10@linux.dev> (raw)
In-Reply-To: <81d00866-7824-18e5-af71-e0a15a03e84f@huaweicloud.com>


On 12/4/23 8:31 PM, Hou Tao wrote:
> Hi,
>
> On 12/5/2023 8:42 AM, Andrii Nakryiko wrote:
>> On Mon, Dec 4, 2023 at 9:40 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> When running `./test_progs -j` in my local vm with latest kernel,
>>> I once hit a kasan error like below:
> SNIP
>>>
>>> So the problem is at rec->refcount_off in the above.
>>>
>>> I did some source code analysis and find the reason.
>>>                                    CPU A                        CPU B
>>>    bpf_map_put:
>>>      ...
>>>      btf_put with rcu callback
>>>      ...
>>>      bpf_map_free_deferred
>>>        with system_unbound_wq
>>>      ...                          ...                           ...
>>>      ...                          btf_free_rcu:                 ...
>>>      ...                          ...                           bpf_map_free_deferred:
>>>      ...                          ...
>>>      ...         --------->       btf_struct_metas_free()
>>>      ...         | race condition ...
>>>      ...         --------->                                     map->ops->map_free()
>>>      ...
>>>      ...                          btf->struct_meta_tab = NULL
>>>
>>> In the above, map_free() corresponds to array_map_free() and eventually
>>> calling bpf_rb_root_free() which calls:
>>>    ...
>>>    __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>>    ...
>>>
>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>>>
>>>    meta = btf_find_struct_meta(btf, btf_id);
>>>    if (!meta)
>>>      return -EFAULT;
>>>    rec->fields[i].graph_root.value_rec = meta->record;
>>>
>>> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
>>> And it is possible that that particular record has been freed by
>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>
>>> Actually it is very hard to reproduce the failure with current bpf/bpf-next
>>> code, I only got the above error once. To increase reproducibility, I added
>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>>> significantly increased reproducibility.
> Also found the problem when developing the "fix the release of inner
> map" patch-set. I have written a selftest which could reliably reproduce
> the problem by using map-in-map + bpf_list_head. The reason of using
> map-in-map is to delay the release of inner map by using call_rcu() as
> well, so the free of bpf_map happens after the release of btf. Will post
> it later.
>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>    index 5e43ddd1b83f..aae5b5213e93 100644
>>>    --- a/kernel/bpf/syscall.c
>>>    +++ b/kernel/bpf/syscall.c
>>>    @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>>          struct btf_record *rec = map->record;
>>>
>>>    +     mdelay(100);
>>>          security_bpf_map_free(map);
>>>          bpf_map_release_memcg(map);
>>>          /* implementation dependent freeing */
>>>
>>> To fix the problem, I moved btf_put() after map->ops->map_free() to ensure
>>> struct_metas available during map_free(). Rerun './test_progs -j' with the
>>> above mdelay() hack for a couple of times and didn't observe the error.
>>>
>>> Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
>>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>>   kernel/bpf/syscall.c | 6 +++++-
>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 0ed286b8a0f0..9c6c3738adfe 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -694,11 +694,16 @@ static void bpf_map_free_deferred(struct work_struct *work)
>>>   {
>>>          struct bpf_map *map = container_of(work, struct bpf_map, work);
>>>          struct btf_record *rec = map->record;
>>> +       struct btf *btf = map->btf;
>>>
>>>          security_bpf_map_free(map);
>>>          bpf_map_release_memcg(map);
>>>          /* implementation dependent freeing */
>>>          map->ops->map_free(map);
>>> +       /* Delay freeing of btf for maps, as map_free callback may need
>>> +        * struct_meta info which will be freed with btf_put().
>>> +        */
>>> +       btf_put(btf);
>> The change makes sense to me, but logically I'd put it after
>> btf_record_free(rec), just in case if some of btf records ever refer
>> back to map's BTF or something (and just in general to keep it the
>> very last thing that's happening).
>>
>>
>> But it also seems like CI is not happy ([0]), please take a look, thanks!
>>
>>    [0] https://github.com/kernel-patches/bpf/actions/runs/7090474333/job/19297672532
> The patch delays the release of BTF id of bpf map, so test_btf_id
> failed. Can we fix the problem by optionally pinning the btf in
> btf_field_graph_root just like btf_field_kptr, so the map BTF will still
> be alive before the invocation of btf_record_free() ? We need to do the
> pinning optionally, because btf_record may be contained in btf directly
> (namely btf->struct_meta_tab) and is freed through btf_free().

Thanks for suggestion, I guess you want two cases:
   - if map->record won't access any btf data (e.g., btf->struct_meta_tab),
     we should keep current btf_put() workflow,
   - if map->record accesses some btf data, we should call btf_put()
     immediately before or after btf_record_free().

This could be done but we need to be careful to find all cases
btf data might be accessed in map->record. The current approach
is simpler. I will post v2 with the change Andrii suggested and
fixed the failed test.

If people really want to fine tune this like the above two cases, I can
investigate too.

>>
>>>          /* Delay freeing of btf_record for maps, as map_free
>>>           * callback usually needs access to them. It is better to do it here
>>>           * than require each callback to do the free itself manually.
>>> @@ -727,7 +732,6 @@ void bpf_map_put(struct bpf_map *map)
>>>          if (atomic64_dec_and_test(&map->refcnt)) {
>>>                  /* bpf_map_free_id() must be called first */
>>>                  bpf_map_free_id(map);
>>> -               btf_put(map->btf);
>>>                  INIT_WORK(&map->work, bpf_map_free_deferred);
>>>                  /* Avoid spawning kworkers, since they all might contend
>>>                   * for the same mutex like slab_mutex.
>>> --
>>> 2.34.1
>>>
>> .

  reply	other threads:[~2023-12-05  4:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-04 17:39 [PATCH bpf] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-05  0:42 ` Andrii Nakryiko
2023-12-05  1:31   ` Hou Tao
2023-12-05  4:15     ` Yonghong Song [this message]
2023-12-05  6:30       ` Hou Tao
2023-12-05  7:01         ` Yonghong Song
2023-12-05 21:13           ` Alexei Starovoitov
2023-12-05 22:50             ` Yonghong Song
2023-12-05  3:58   ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=513bafac-03fa-4c2f-ba7f-67de96f79a10@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=houtao@huaweicloud.com \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox