From: Yonghong Song <yonghong.song@linux.dev>
To: Hou Tao <houtao@huaweicloud.com>,
Martin KaFai Lau <martin.lau@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()
Date: Fri, 8 Dec 2023 09:07:15 -0800 [thread overview]
Message-ID: <0e657fc3-d932-4bd6-9d74-54eff22d3641@linux.dev> (raw)
In-Reply-To: <0b3a96bd-4dfc-6d23-d473-f4351fbe84c2@huaweicloud.com>
On 12/8/23 12:30 AM, Hou Tao wrote:
> Hi,
>
> On 12/8/2023 12:02 PM, Yonghong Song wrote:
>> On 12/7/23 7:59 PM, Yonghong Song wrote:
>>> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>>>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>>>> When running `./test_progs -j` in my local vm with latest kernel,
>>>>> I once hit a kasan error like below:
>>>>>
>>>>>
> SNIP
>>>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>>>> following code:
>>>>>
>>>>> meta = btf_find_struct_meta(btf, btf_id);
>>>>> if (!meta)
>>>>> return -EFAULT;
>>>>> rec->fields[i].graph_root.value_rec = meta->record;
>>>>>
>>>>> So basically, 'value_rec' is a pointer to the record in
>>>>> struct_metas_tab.
>>>>> And it is possible that that particular record has been freed by
>>>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>>>
>>>>> Actually it is very hard to reproduce the failure with current
>>>>> bpf/bpf-next
>>>>> code, I only got the above error once. To increase reproducibility,
>>>>> I added
>>>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(),
>>>>> which
>>>>> significantly increased reproducibility.
>>>>>
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 5e43ddd1b83f..aae5b5213e93 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>>>> work_struct *work)
>>>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>>>> work);
>>>>> struct btf_record *rec = map->record;
>>>>>
>>>>> + mdelay(100);
>>>>> security_bpf_map_free(map);
>>>>> bpf_map_release_memcg(map);
>>>>> /* implementation dependent freeing */
>>>>>
>>>>> To fix the problem, we need to have a reference on btf in order to
>>>>> safeguard accessing field->graph_root.value_rec in
>>>>> map->ops->map_free().
>>>>> The function btf_parse_graph_root() is the place to get a btf
>>>>> reference.
>>>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>>>
>>>>> btf_parse
>>>>> ...
>>>>> btf_parse_fields
>>>>> ...
>>>>> btf_parse_graph_root
>>>>>
>>>>> map_check_btf
>>>>> btf_parse_fields
>>>>> ...
>>>>> btf_parse_graph_root
>>>>>
>>>>> Looking at the above call stack, the btf_parse_graph_root() is
>>>>> indirectly
>>>>> called from btf_parse() or map_check_btf().
>>>>>
>>>>> We cannot take a reference in btf_parse() case since at that moment,
>>>>> btf is still in the middle to self-validation and initial reference
>>>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>>> Thanks for the details analysis and clear explanation. It helps a lot.
>>>>
>>>> Sorry for jumping in late.
>>>>
>>>> I am trying to avoid making a special case for "bool has_btf_ref;"
>>>> and "bool from_map_check". It seems to a bit too much to deal with
>>>> the error path for btf_parse().
> Maybe we could move the common btf used by kptr and graph_root into
> bpf_record and let the callers of btf_parse_fields() and
> btf_record_free() to decide the life cycle of btf in btf_record, so
> there will be less intrusive and less special case. The following is the
I didn't fully check the code but looks like we took a
btf reference at map_check_btf() and free it at the end
of bpf_map_free_deferred(). This is similar to my v1 patch,
not exactly the same but similar since they all do
btf_put() at the end of bpf_map_free_deferred().
Through discussion, doing on-demand btf_get()/btf_put()
approach, similar to kptr approach, seems more favored.
This also has advantage to free btf at its earlist possible
point.
> code snippets for the idea (only simply tested):
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1ad0ed623f50..a0c4d696a231 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -230,6 +230,7 @@ struct btf_record {
> int spin_lock_off;
> int timer_off;
> int refcount_off;
> + struct btf *btf;
> struct btf_field fields[];
> };
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 966dace27fae..08b4a2784ee4 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3598,7 +3598,6 @@ static int btf_parse_kptr(struct btf *btf, struct
> btf_field *field,
> field->kptr.dtor = NULL;
> id = info->kptr.type_id;
> kptr_btf = btf;
> - btf_get(kptr_btf);
> goto found_dtor;
> }
> if (id < 0)
> @@ -3753,6 +3752,7 @@ struct btf_record *btf_parse_fields(struct btf
> *btf, const struct btf_type *t,
> if (!rec)
> return ERR_PTR(-ENOMEM);
>
> + rec->btf = btf;
> rec->spin_lock_off = -EINVAL;
> rec->timer_off = -EINVAL;
> rec->refcount_off = -EINVAL;
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 8ef269e66ba5..d9f4198158b6 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -56,6 +56,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> ret = PTR_ERR(inner_map_meta->record);
> goto free;
> }
> + if (inner_map_meta->record)
> + btf_get(inner_map_meta->record->btf);
> +
> /* Note: We must use the same BTF, as we also used
> btf_record_dup above
> * which relies on BTF being same for both maps, as some members
> like
> * record->fields.list_head have pointers like value_rec
> pointing into
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 30ed7756fc71..d2641e51a1a7 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -516,7 +516,8 @@ void btf_record_free(struct btf_record *rec)
> case BPF_KPTR_PERCPU:
> if (rec->fields[i].kptr.module)
> module_put(rec->fields[i].kptr.module);
> - btf_put(rec->fields[i].kptr.btf);
> + if (rec->fields[i].kptr.btf != rec->btf)
> + btf_put(rec->fields[i].kptr.btf);
> break;
> case BPF_LIST_HEAD:
> case BPF_LIST_NODE:
> @@ -537,8 +538,13 @@ void btf_record_free(struct btf_record *rec)
>
> void bpf_map_free_record(struct bpf_map *map)
> {
> + struct btf *btf = NULL;
> +
> + if (!IS_ERR_OR_NULL(map->record))
> + btf = map->record->btf;
> btf_record_free(map->record);
> map->record = NULL;
> + btf_put(btf);
> }
>
> struct btf_record *btf_record_dup(const struct btf_record *rec)
> @@ -561,7 +567,8 @@ struct btf_record *btf_record_dup(const struct
> btf_record *rec)
> case BPF_KPTR_UNREF:
> case BPF_KPTR_REF:
> case BPF_KPTR_PERCPU:
> - btf_get(fields[i].kptr.btf);
> + if (fields[i].kptr.btf != rec->btf)
> + btf_get(fields[i].kptr.btf);
> if (fields[i].kptr.module &&
> !try_module_get(fields[i].kptr.module)) {
> ret = -ENXIO;
> goto free;
> @@ -692,7 +699,10 @@ static void bpf_map_free_deferred(struct
> work_struct *work)
> {
> struct bpf_map *map = container_of(work, struct bpf_map, work);
> struct btf_record *rec = map->record;
> + struct btf *btf = NULL;
>
> + if (!IS_ERR_OR_NULL(rec))
> + btf = rec->btf;
> security_bpf_map_free(map);
> bpf_map_release_memcg(map);
> /* implementation dependent freeing */
> @@ -707,6 +717,7 @@ static void bpf_map_free_deferred(struct work_struct
> *work)
> * template bpf_map struct used during verification.
> */
> btf_record_free(rec);
> + btf_put(btf);
> }
>
> static void bpf_map_put_uref(struct bpf_map *map)
> @@ -1036,6 +1047,8 @@ static int map_check_btf(struct bpf_map *map,
> struct btf *btf,
> if (!IS_ERR_OR_NULL(map->record)) {
> int i;
>
> + btf_get(map->record->btf);
> +
> if (!bpf_capable()) {
> ret = -EPERM;
> goto free_map_tab;
>
>
>
> WDYT ?
>
>>>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse
>>>> help?
>>> No, it does not. The core reason is what Hao is mentioned in
>>> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
>>>
>>> We simply cannot take btf reference if called from btf_parse().
>>> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
>>> so we take ref for btf during btf_parse_fields(), then we have
>>> btf_put <=== expect refcount == 0 to start the destruction process
>>> ...
>>> btf_record_free <=== in which if graph_root, a btf reference
>>> will be hold
>>> so btf_put will never be able to actually free btf data.
>>> Yes, the kasan problem will be resolved but we leak memory.
>> Let me send another version with better commit message.
>
>> [...]
next prev parent reply other threads:[~2023-12-08 17:07 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 21:09 [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-07 13:46 ` Hou Tao
2023-12-08 1:23 ` Martin KaFai Lau
2023-12-08 3:59 ` Yonghong Song
2023-12-08 4:02 ` Yonghong Song
2023-12-08 8:30 ` Hou Tao
2023-12-08 17:07 ` Yonghong Song [this message]
2023-12-14 4:17 ` Alexei Starovoitov
2023-12-14 6:30 ` Yonghong Song
2023-12-08 8:16 ` Martin KaFai Lau
2023-12-08 16:45 ` Yonghong Song
2023-12-08 18:26 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0e657fc3-d932-4bd6-9d74-54eff22d3641@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.