From: Martin KaFai Lau <martin.lau@linux.dev>
To: Yonghong Song <yonghong.song@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>,
Hou Tao <houtao@huaweicloud.com>,
bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free()
Date: Thu, 7 Dec 2023 17:23:50 -0800 [thread overview]
Message-ID: <d1c0232c-a41c-4cce-9bdf-3a1e8850ed05@linux.dev> (raw)
In-Reply-To: <20231206210959.1035724-1-yonghong.song@linux.dev>
On 12/6/23 1:09 PM, Yonghong Song wrote:
> When running `./test_progs -j` in my local vm with latest kernel,
> I once hit a kasan error like below:
>
> [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
> [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
> [ 1887.186498]
> [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G OEL 6.7.0-rc3-00699-g90679706d486-dirty #494
> [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
> [ 1887.190341] Call Trace:
> [ 1887.190666] <TASK>
> [ 1887.190949] dump_stack_lvl+0xac/0xe0
> [ 1887.191423] ? nf_tcp_handle_invalid+0x1b0/0x1b0
> [ 1887.192019] ? panic+0x3c0/0x3c0
> [ 1887.192449] print_report+0x14f/0x720
> [ 1887.192930] ? preempt_count_sub+0x1c/0xd0
> [ 1887.193459] ? __virt_addr_valid+0xac/0x120
> [ 1887.194004] ? bpf_rb_root_free+0x1f8/0x2b0
> [ 1887.194572] kasan_report+0xc3/0x100
> [ 1887.195085] ? bpf_rb_root_free+0x1f8/0x2b0
> [ 1887.195668] bpf_rb_root_free+0x1f8/0x2b0
> [ 1887.196183] ? __bpf_obj_drop_impl+0xb0/0xb0
> [ 1887.196736] ? preempt_count_sub+0x1c/0xd0
> [ 1887.197270] ? preempt_count_sub+0x1c/0xd0
> [ 1887.197802] ? _raw_spin_unlock+0x1f/0x40
> [ 1887.198319] bpf_obj_free_fields+0x1d4/0x260
> [ 1887.198883] array_map_free+0x1a3/0x260
> [ 1887.199380] bpf_map_free_deferred+0x7b/0xe0
> [ 1887.199943] process_scheduled_works+0x3a2/0x6c0
> [ 1887.200549] worker_thread+0x633/0x890
> [ 1887.201047] ? __kthread_parkme+0xd7/0xf0
> [ 1887.201574] ? kthread+0x102/0x1d0
> [ 1887.202020] kthread+0x1ab/0x1d0
> [ 1887.202447] ? pr_cont_work+0x270/0x270
> [ 1887.202954] ? kthread_blkcg+0x50/0x50
> [ 1887.203444] ret_from_fork+0x34/0x50
> [ 1887.203914] ? kthread_blkcg+0x50/0x50
> [ 1887.204397] ret_from_fork_asm+0x11/0x20
> [ 1887.204913] </TASK>
> [ 1887.204913] </TASK>
> [ 1887.205209]
> [ 1887.205416] Allocated by task 2197:
> [ 1887.205881] kasan_set_track+0x3f/0x60
> [ 1887.206366] __kasan_kmalloc+0x6e/0x80
> [ 1887.206856] __kmalloc+0xac/0x1a0
> [ 1887.207293] btf_parse_fields+0xa15/0x1480
> [ 1887.207836] btf_parse_struct_metas+0x566/0x670
> [ 1887.208387] btf_new_fd+0x294/0x4d0
> [ 1887.208851] __sys_bpf+0x4ba/0x600
> [ 1887.209292] __x64_sys_bpf+0x41/0x50
> [ 1887.209762] do_syscall_64+0x4c/0xf0
> [ 1887.210222] entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [ 1887.210868]
> [ 1887.211074] Freed by task 36:
> [ 1887.211460] kasan_set_track+0x3f/0x60
> [ 1887.211951] kasan_save_free_info+0x28/0x40
> [ 1887.212485] ____kasan_slab_free+0x101/0x180
> [ 1887.213027] __kmem_cache_free+0xe4/0x210
> [ 1887.213514] btf_free+0x5b/0x130
> [ 1887.213918] rcu_core+0x638/0xcc0
> [ 1887.214347] __do_softirq+0x114/0x37e
>
> The error happens at bpf_rb_root_free+0x1f8/0x2b0:
>
> 00000000000034c0 <bpf_rb_root_free>:
> ; {
> 34c0: f3 0f 1e fa endbr64
> 34c4: e8 00 00 00 00 callq 0x34c9 <bpf_rb_root_free+0x9>
> 34c9: 55 pushq %rbp
> 34ca: 48 89 e5 movq %rsp, %rbp
> ...
> ; if (rec && rec->refcount_off >= 0 &&
> 36aa: 4d 85 ed testq %r13, %r13
> 36ad: 74 a9 je 0x3658 <bpf_rb_root_free+0x198>
> 36af: 49 8d 7d 10 leaq 0x10(%r13), %rdi
> 36b3: e8 00 00 00 00 callq 0x36b8 <bpf_rb_root_free+0x1f8>
> <==== kasan function
> 36b8: 45 8b 7d 10 movl 0x10(%r13), %r15d
> <==== use-after-free load
> 36bc: 45 85 ff testl %r15d, %r15d
> 36bf: 78 8c js 0x364d <bpf_rb_root_free+0x18d>
>
> So the problem is at rec->refcount_off in the above.
>
> I did some source code analysis and find the reason.
> CPU A CPU B
> bpf_map_put:
> ...
> btf_put with rcu callback
> ...
> bpf_map_free_deferred
> with system_unbound_wq
> ... ... ...
> ... btf_free_rcu: ...
> ... ... bpf_map_free_deferred:
> ... ...
> ... ---------> btf_struct_metas_free()
> ... | race condition ...
> ... ---------> map->ops->map_free()
> ...
> ... btf->struct_meta_tab = NULL
>
> In the above, map_free() corresponds to array_map_free() and eventually
> calling bpf_rb_root_free() which calls:
> ...
> __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
> ...
>
> Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
>
> meta = btf_find_struct_meta(btf, btf_id);
> if (!meta)
> return -EFAULT;
> rec->fields[i].graph_root.value_rec = meta->record;
>
> So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
> And it is possible that that particular record has been freed by
> btf_struct_metas_free() and hence we have a kasan error here.
>
> Actually it is very hard to reproduce the failure with current bpf/bpf-next
> code, I only got the above error once. To increase reproducibility, I added
> a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
> significantly increased reproducibility.
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5e43ddd1b83f..aae5b5213e93 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
> struct bpf_map *map = container_of(work, struct bpf_map, work);
> struct btf_record *rec = map->record;
>
> + mdelay(100);
> security_bpf_map_free(map);
> bpf_map_release_memcg(map);
> /* implementation dependent freeing */
>
> To fix the problem, we need to have a reference on btf in order to
> safeguard accessing field->graph_root.value_rec in map->ops->map_free().
> The function btf_parse_graph_root() is the place to get a btf reference.
> The following are rough call stacks reaching bpf_parse_graph_root():
>
> btf_parse
> ...
> btf_parse_fields
> ...
> btf_parse_graph_root
>
> map_check_btf
> btf_parse_fields
> ...
> btf_parse_graph_root
>
> Looking at the above call stack, the btf_parse_graph_root() is indirectly
> called from btf_parse() or map_check_btf().
>
> We cannot take a reference in btf_parse() case since at that moment,
> btf is still in the middle to self-validation and initial reference
> (refcount_set(&btf->refcnt, 1)) has not been triggered yet.
Thanks for the details analysis and clear explanation. It helps a lot.
Sorry for jumping in late.
I am trying to avoid making a special case for "bool has_btf_ref;" and "bool
from_map_check". It seems to a bit too much to deal with the error path for
btf_parse().
Would doing the refcount_set(&btf->refcnt, 1) earlier in btf_parse help?
> It is also unnecessary to take a reference since the value_rec is
> referring to a record in struct_meta_tab.
If we optimize for not taking a refcnt, how about not taking a refcnt for all
cases and postpone the btf_put(), instead of taking refcnt in one case but not
another. Like your fix in v1. The failed selftest can be changed or even removed
if it does not make sense anymore.
>
> But we do need to take a reference for map_check_btf() case as
> the value_rec refers to a record in struct_meta_tab and we want
> to make sure btf is not freed until value_rec usage at map_free
> is done.
>
> So the fix is to get a btf reference for map_check_btf() case
> whenever a graph_root is found. A boolean field 'has_btf_ref' is added to
> struct btf_field_graph_root so later the btf reference can be properly
> released.
>
> Rerun './test_progs -j' with the above mdelay() hack for a couple
> of times and didn't observe the error.
> Running Hou's test ([1]) is also successful.
>
> [1] https://lore.kernel.org/bpf/20231206110625.3188975-1-houtao@huaweicloud.com/
>
next prev parent reply other threads:[~2023-12-08 1:23 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-06 21:09 [PATCH bpf-next v4] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-07 13:46 ` Hou Tao
2023-12-08 1:23 ` Martin KaFai Lau [this message]
2023-12-08 3:59 ` Yonghong Song
2023-12-08 4:02 ` Yonghong Song
2023-12-08 8:30 ` Hou Tao
2023-12-08 17:07 ` Yonghong Song
2023-12-14 4:17 ` Alexei Starovoitov
2023-12-14 6:30 ` Yonghong Song
2023-12-08 8:16 ` Martin KaFai Lau
2023-12-08 16:45 ` Yonghong Song
2023-12-08 18:26 ` Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d1c0232c-a41c-4cce-9bdf-3a1e8850ed05@linux.dev \
--to=martin.lau@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=houtao@huaweicloud.com \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox