From: Yonghong Song <yonghong.song@linux.dev>
To: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Hou Tao <houtao@huaweicloud.com>,
bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()
Date: Thu, 7 Dec 2023 20:02:49 -0800 [thread overview]
Message-ID: <ad71a99d-8b5f-44b4-99ee-5afb31c60bff@linux.dev> (raw)
In-Reply-To: <969852f3-34f8-45d9-bf2d-f6a4d5167e55@linux.dev>
On 12/7/23 7:59 PM, Yonghong Song wrote:
>
> On 12/7/23 5:23 PM, Martin KaFai Lau wrote:
>> On 12/6/23 1:09 PM, Yonghong Song wrote:
>>> When running `./test_progs -j` in my local vm with latest kernel,
>>> I once hit a kasan error like below:
>>>
>>> [ 1887.184724] BUG: KASAN: slab-use-after-free in
>>> bpf_rb_root_free+0x1f8/0x2b0
>>> [ 1887.185599] Read of size 4 at addr ffff888106806910 by task
>>> kworker/u12:2/2830
>>> [ 1887.186498]
>>> [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted:
>>> G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494
>>> [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>> 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>> [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
>>> [ 1887.190341] Call Trace:
>>> [ 1887.190666] <TASK>
>>> [ 1887.190949] dump_stack_lvl+0xac/0xe0
>>> [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0
>>> [ 1887.192019] ? panic+0x3c0/0x3c0
>>> [ 1887.192449] print_report+0x14f/0x720
>>> [ 1887.192930] ? preempt_count_sub+0x1c/0xd0
>>> [ 1887.193459] ? __virt_addr_valid+0xac/0x120
>>> [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0
>>> [ 1887.194572] kasan_report+0xc3/0x100
>>> [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0
>>> [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0
>>> [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0
>>> [ 1887.196736] ? preempt_count_sub+0x1c/0xd0
>>> [ 1887.197270] ? preempt_count_sub+0x1c/0xd0
>>> [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40
>>> [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260
>>> [ 1887.198883] array_map_free+0x1a3/0x260
>>> [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0
>>> [ 1887.199943] process_scheduled_works+0x3a2/0x6c0
>>> [ 1887.200549] worker_thread+0x633/0x890
>>> [ 1887.201047] ? __kthread_parkme+0xd7/0xf0
>>> [ 1887.201574] ? kthread+0x102/0x1d0
>>> [ 1887.202020] kthread+0x1ab/0x1d0
>>> [ 1887.202447] ? pr_cont_work+0x270/0x270
>>> [ 1887.202954] ? kthread_blkcg+0x50/0x50
>>> [ 1887.203444] ret_from_fork+0x34/0x50
>>> [ 1887.203914] ? kthread_blkcg+0x50/0x50
>>> [ 1887.204397] ret_from_fork_asm+0x11/0x20
>>> [ 1887.204913] </TASK>
>>> [ 1887.204913] </TASK>
>>> [ 1887.205209]
>>> [ 1887.205416] Allocated by task 2197:
>>> [ 1887.205881] kasan_set_track+0x3f/0x60
>>> [ 1887.206366] __kasan_kmalloc+0x6e/0x80
>>> [ 1887.206856] __kmalloc+0xac/0x1a0
>>> [ 1887.207293] btf_parse_fields+0xa15/0x1480
>>> [ 1887.207836] btf_parse_struct_metas+0x566/0x670
>>> [ 1887.208387] btf_new_fd+0x294/0x4d0
>>> [ 1887.208851] __sys_bpf+0x4ba/0x600
>>> [ 1887.209292] __x64_sys_bpf+0x41/0x50
>>> [ 1887.209762] do_syscall_64+0x4c/0xf0
>>> [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b
>>> [ 1887.210868]
>>> [ 1887.211074] Freed by task 36:
>>> [ 1887.211460] kasan_set_track+0x3f/0x60
>>> [ 1887.211951] kasan_save_free_info+0x28/0x40
>>> [ 1887.212485] ____kasan_slab_free+0x101/0x180
>>> [ 1887.213027] __kmem_cache_free+0xe4/0x210
>>> [ 1887.213514] btf_free+0x5b/0x130
>>> [ 1887.213918] rcu_core+0x638/0xcc0
>>> [ 1887.214347] __do_softirq+0x114/0x37e
>>>
>>> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>>>
>>> 00000000000034c0 <bpf_rb_root_free>:
>>> ; {
>>> 34c0: f3 0f 1e fa endbr64
>>> 34c4: e8 00 00 00 00 callq 0x34c9
>>> <bpf_rb_root_free+0x9>
>>> 34c9: 55 pushq %rbp
>>> 34ca: 48 89 e5 movq %rsp, %rbp
>>> ...
>>> ; if (rec && rec->refcount_off >= 0 &&
>>> 36aa: 4d 85 ed testq %r13, %r13
>>> 36ad: 74 a9 je 0x3658
>>> <bpf_rb_root_free+0x198>
>>> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi
>>> 36b3: e8 00 00 00 00 callq 0x36b8
>>> <bpf_rb_root_free+0x1f8>
>>> <==== kasan function
>>> 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d
>>> <==== use-after-free load
>>> 36bc: 45 85 ff testl %r15d, %r15d
>>> 36bf: 78 8c js 0x364d
>>> <bpf_rb_root_free+0x18d>
>>>
>>> So the problem is at rec->refcount_off in the above.
>>>
>>> I did some source code analysis and find the reason.
>>> CPU A CPU B
>>> bpf_map_put:
>>> ...
>>> btf_put with rcu callback
>>> ...
>>> bpf_map_free_deferred
>>> with system_unbound_wq
>>> ... ... ...
>>> ... btf_free_rcu: ...
>>> ... ... bpf_map_free_deferred:
>>> ... ...
>>> ... ---------> btf_struct_metas_free()
>>> ... | race condition ...
>>> ... ---------> map->ops->map_free()
>>> ...
>>> ... btf->struct_meta_tab = NULL
>>>
>>> In the above, map_free() corresponds to array_map_free() and eventually
>>> calling bpf_rb_root_free() which calls:
>>> ...
>>> __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
>>> ...
>>>
>>> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with
>>> following code:
>>>
>>> meta = btf_find_struct_meta(btf, btf_id);
>>> if (!meta)
>>> return -EFAULT;
>>> rec->fields[i].graph_root.value_rec = meta->record;
>>>
>>> So basically, 'value_rec' is a pointer to the record in
>>> struct_metas_tab.
>>> And it is possible that that particular record has been freed by
>>> btf_struct_metas_free() and hence we have a kasan error here.
>>>
>>> Actually it is very hard to reproduce the failure with current
>>> bpf/bpf-next
>>> code, I only got the above error once. To increase reproducibility,
>>> I added
>>> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
>>> significantly increased reproducibility.
>>>
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 5e43ddd1b83f..aae5b5213e93 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct
>>> work_struct *work)
>>> struct bpf_map *map = container_of(work, struct bpf_map,
>>> work);
>>> struct btf_record *rec = map->record;
>>>
>>> + mdelay(100);
>>> security_bpf_map_free(map);
>>> bpf_map_release_memcg(map);
>>> /* implementation dependent freeing */
>>>
>>> To fix the problem, we need to have a reference on btf in order to
>>> safeguard accessing field->graph_root.value_rec in
>>> map->ops->map_free().
>>> The function btf_parse_graph_root() is the place to get a btf
>>> reference.
>>> The following are rough call stacks reaching bpf_parse_graph_root():
>>>
>>> btf_parse
>>> ...
>>> btf_parse_fields
>>> ...
>>> btf_parse_graph_root
>>>
>>> map_check_btf
>>> btf_parse_fields
>>> ...
>>> btf_parse_graph_root
>>>
>>> Looking at the above call stack, the btf_parse_graph_root() is
>>> indirectly
>>> called from btf_parse() or map_check_btf().
>>>
>>> We cannot take a reference in btf_parse() case since at that moment,
>>> btf is still in the middle to self-validation and initial reference
>>> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
>>
>> Thanks for the details analysis and clear explanation. It helps a lot.
>>
>> Sorry for jumping in late.
>>
>> I am trying to avoid making a special case for "bool has_btf_ref;"
>> and "bool from_map_check". It seems to a bit too much to deal with
>> the error path for btf_parse().
>>
>> Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?
>
> No, it does not. The core reason is what Hao is mentioned in
> https://lore.kernel.org/bpf/47ee3265-23f7-2130-ff28-27bfaf3f7877@huaweicloud.com/
> We simply cannot take btf reference if called from btf_parse().
> Let us say we move refcount_set(&btf->refcnt, 1) earlier in btf_parse()
> so we take ref for btf during btf_parse_fields(), then we have
> btf_put <=== expect refcount == 0 to start the destruction process
> ...
> btf_record_free <=== in which if graph_root, a btf reference
> will be hold
> so btf_put will never be able to actually free btf data.
> Yes, the kasan problem will be resolved but we leak memory.
Let me send another version with better commit message.
[...]
next prev parent reply other threads:[~2023-12-08 4:02 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 21:09 [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-07 13:46 ` Hou Tao
2023-12-08 1:23 ` Martin KaFai Lau
2023-12-08 3:59 ` Yonghong Song
2023-12-08 4:02 ` Yonghong Song [this message]
2023-12-08 8:30 ` Hou Tao
2023-12-08 17:07 ` Yonghong Song
2023-12-14 4:17 ` Alexei Starovoitov
2023-12-14 6:30 ` Yonghong Song
2023-12-08 8:16 ` Martin KaFai Lau
2023-12-08 16:45 ` Yonghong Song
2023-12-08 18:26 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad71a99d-8b5f-44b4-99ee-5afb31c60bff@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=martin.lau@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.