BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 1/2] bpf: Fix a race condition between btf_put() and map_free()
@ 2023-12-05  6:04 Yonghong Song
  2023-12-05  6:04 ` [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test Yonghong Song
  0 siblings, 1 reply; 4+ messages in thread
From: Yonghong Song @ 2023-12-05  6:04 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

When running `./test_progs -j` in my local vm with latest kernel,
I once hit a kasan error like below:

  [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
  [ 1887.186498]
  [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706d486-dirty #494
  [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
  [ 1887.190341] Call Trace:
  [ 1887.190666]  <TASK>
  [ 1887.190949]  dump_stack_lvl+0xac/0xe0
  [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
  [ 1887.192019]  ? panic+0x3c0/0x3c0
  [ 1887.192449]  print_report+0x14f/0x720
  [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
  [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.194572]  kasan_report+0xc3/0x100
  [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
  [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
  [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
  [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
  [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
  [ 1887.198883]  array_map_free+0x1a3/0x260
  [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
  [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
  [ 1887.200549]  worker_thread+0x633/0x890
  [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
  [ 1887.201574]  ? kthread+0x102/0x1d0
  [ 1887.202020]  kthread+0x1ab/0x1d0
  [ 1887.202447]  ? pr_cont_work+0x270/0x270
  [ 1887.202954]  ? kthread_blkcg+0x50/0x50
  [ 1887.203444]  ret_from_fork+0x34/0x50
  [ 1887.203914]  ? kthread_blkcg+0x50/0x50
  [ 1887.204397]  ret_from_fork_asm+0x11/0x20
  [ 1887.204913]  </TASK>
  [ 1887.204913]  </TASK>
  [ 1887.205209]
  [ 1887.205416] Allocated by task 2197:
  [ 1887.205881]  kasan_set_track+0x3f/0x60
  [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
  [ 1887.206856]  __kmalloc+0xac/0x1a0
  [ 1887.207293]  btf_parse_fields+0xa15/0x1480
  [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
  [ 1887.208387]  btf_new_fd+0x294/0x4d0
  [ 1887.208851]  __sys_bpf+0x4ba/0x600
  [ 1887.209292]  __x64_sys_bpf+0x41/0x50
  [ 1887.209762]  do_syscall_64+0x4c/0xf0
  [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
  [ 1887.210868]
  [ 1887.211074] Freed by task 36:
  [ 1887.211460]  kasan_set_track+0x3f/0x60
  [ 1887.211951]  kasan_save_free_info+0x28/0x40
  [ 1887.212485]  ____kasan_slab_free+0x101/0x180
  [ 1887.213027]  __kmem_cache_free+0xe4/0x210
  [ 1887.213514]  btf_free+0x5b/0x130
  [ 1887.213918]  rcu_core+0x638/0xcc0
  [ 1887.214347]  __do_softirq+0x114/0x37e

The error happens at bpf_rb_root_free+0x1f8/0x2b0:

  00000000000034c0 <bpf_rb_root_free>:
  ; {
    34c0: f3 0f 1e fa                   endbr64
    34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
    34c9: 55                            pushq   %rbp
    34ca: 48 89 e5                      movq    %rsp, %rbp
  ...
  ;       if (rec && rec->refcount_off >= 0 &&
    36aa: 4d 85 ed                      testq   %r13, %r13
    36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
    36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
    36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                        <==== kasan function
    36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                        <==== use-after-free load
    36bc: 45 85 ff                      testl   %r15d, %r15d
    36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>

So the problem is at rec->refcount_off in the above.

I did some source code analysis and find the reason.
                                  CPU A                        CPU B
  bpf_map_put:
    ...
    btf_put with rcu callback
    ...
    bpf_map_free_deferred
      with system_unbound_wq
    ...                          ...                           ...
    ...                          btf_free_rcu:                 ...
    ...                          ...                           bpf_map_free_deferred:
    ...                          ...
    ...         --------->       btf_struct_metas_free()
    ...         | race condition ...
    ...         --------->                                     map->ops->map_free()
    ...
    ...                          btf->struct_meta_tab = NULL

In the above, map_free() corresponds to array_map_free() and eventually
calling bpf_rb_root_free() which calls:
  ...
  __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
  ...

Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:

  meta = btf_find_struct_meta(btf, btf_id);
  if (!meta)
    return -EFAULT;
  rec->fields[i].graph_root.value_rec = meta->record;

So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
And it is possible that that particular record has been freed by
btf_struct_metas_free() and hence we have a kasan error here.

Actually it is very hard to reproduce the failure with current bpf/bpf-next
code, I only got the above error once. To increase reproducibility, I added
a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
significantly increased reproducibility.

  diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
  index 5e43ddd1b83f..aae5b5213e93 100644
  --- a/kernel/bpf/syscall.c
  +++ b/kernel/bpf/syscall.c
  @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
        struct bpf_map *map = container_of(work, struct bpf_map, work);
        struct btf_record *rec = map->record;

  +     mdelay(100);
        security_bpf_map_free(map);
        bpf_map_release_memcg(map);
        /* implementation dependent freeing */

To fix the problem, I moved btf_put() after btf_record_free() to ensure
struct_metas available during map_free(). Rerun './test_progs -j' with the
above mdelay() hack for a couple of times and didn't observe the error.

Fixes: 958cf2e273f0 ("bpf: Introduce bpf_obj_new")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 kernel/bpf/syscall.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ebaccf77d56e..1d40371bed24 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,6 +694,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 {
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 	struct btf_record *rec = map->record;
+	struct btf *btf = map->btf;
 
 	security_bpf_map_free(map);
 	bpf_map_release_memcg(map);
@@ -709,6 +710,10 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	 * template bpf_map struct used during verification.
 	 */
 	btf_record_free(rec);
+	/* Delay freeing of btf for maps, as map_free callback may need
+	 * struct_meta info which will be freed with btf_put().
+	 */
+	btf_put(btf);
 }
 
 static void bpf_map_put_uref(struct bpf_map *map)
@@ -749,7 +754,6 @@ void bpf_map_put(struct bpf_map *map)
 	if (atomic64_dec_and_test(&map->refcnt)) {
 		/* bpf_map_free_id() must be called first */
 		bpf_map_free_id(map);
-		btf_put(map->btf);
 
 		WARN_ON_ONCE(atomic64_read(&map->sleepable_refcnt));
 		if (READ_ONCE(map->free_after_mult_rcu_gp))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test
  2023-12-05  6:04 [PATCH bpf-next v2 1/2] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
@ 2023-12-05  6:04 ` Yonghong Song
  2023-12-05  6:39   ` Hou Tao
  0 siblings, 1 reply; 4+ messages in thread
From: Yonghong Song @ 2023-12-05  6:04 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

With previous patch, one of subtests in test_btf_id becomes
flaky and may fail. The following is a failing example:

  Error: #26 btf
  Error: #26/174 btf/BTF ID
    Error: #26/174 btf/BTF ID
    btf_raw_create:PASS:check 0 nsec
    btf_raw_create:PASS:check 0 nsec
    test_btf_id:PASS:check 0 nsec
    ...
    test_btf_id:PASS:check 0 nsec
    test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1

The test tries to prove a btf_id not available after the map is closed.
But btf_id is freed only after workqueue and a rcu grace period, compared
to previous case just after a rcu grade period.

To fix the flaky test, I added a kern_sync_rcu() after closing map and
before querying btf id availability, essentially ensuring a rcu grace
period in the kernel, which seems making the test happy.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
---
 tools/testing/selftests/bpf/prog_tests/btf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c
index 8fb4a04fbbc0..7feb4223bbac 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf.c
@@ -4629,6 +4629,7 @@ static int test_btf_id(unsigned int test_num)
 
 	/* The map holds the last ref to BTF and its btf_id */
 	close(map_fd);
+	kern_sync_rcu();
 	map_fd = -1;
 	btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id);
 	if (CHECK(btf_fd[0] >= 0, "BTF lingers")) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test
  2023-12-05  6:04 ` [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test Yonghong Song
@ 2023-12-05  6:39   ` Hou Tao
  2023-12-05  7:10     ` Yonghong Song
  0 siblings, 1 reply; 4+ messages in thread
From: Hou Tao @ 2023-12-05  6:39 UTC (permalink / raw)
  To: Yonghong Song, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau

Hi,

On 12/5/2023 2:04 PM, Yonghong Song wrote:
> With previous patch, one of subtests in test_btf_id becomes
> flaky and may fail. The following is a failing example:
>
>   Error: #26 btf
>   Error: #26/174 btf/BTF ID
>     Error: #26/174 btf/BTF ID
>     btf_raw_create:PASS:check 0 nsec
>     btf_raw_create:PASS:check 0 nsec
>     test_btf_id:PASS:check 0 nsec
>     ...
>     test_btf_id:PASS:check 0 nsec
>     test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1
>
> The test tries to prove a btf_id not available after the map is closed.
> But btf_id is freed only after workqueue and a rcu grace period, compared
> to previous case just after a rcu grade period.

It is not accurate. Before applying the patch, the btf_id will be
released in btf_put() and there is no RCU grace period involved. After
applying the patch, the btf_id will be released after the running of
bpf_map_free_deferred kworker.
>
> To fix the flaky test, I added a kern_sync_rcu() after closing map and
> before querying btf id availability, essentially ensuring a rcu grace
> period in the kernel, which seems making the test happy.

kern_sync_rcu() doesn't guarantee the bpf_map_free_deferred kworker will
complete, so why not remove the test case instead ?
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>  tools/testing/selftests/bpf/prog_tests/btf.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c
> index 8fb4a04fbbc0..7feb4223bbac 100644
> --- a/tools/testing/selftests/bpf/prog_tests/btf.c
> +++ b/tools/testing/selftests/bpf/prog_tests/btf.c
> @@ -4629,6 +4629,7 @@ static int test_btf_id(unsigned int test_num)
>  
>  	/* The map holds the last ref to BTF and its btf_id */
>  	close(map_fd);
> +	kern_sync_rcu();
>  	map_fd = -1;
>  	btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id);
>  	if (CHECK(btf_fd[0] >= 0, "BTF lingers")) {


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test
  2023-12-05  6:39   ` Hou Tao
@ 2023-12-05  7:10     ` Yonghong Song
  0 siblings, 0 replies; 4+ messages in thread
From: Yonghong Song @ 2023-12-05  7:10 UTC (permalink / raw)
  To: Hou Tao, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, kernel-team,
	Martin KaFai Lau


On 12/5/23 1:39 AM, Hou Tao wrote:
> Hi,
>
> On 12/5/2023 2:04 PM, Yonghong Song wrote:
>> With previous patch, one of subtests in test_btf_id becomes
>> flaky and may fail. The following is a failing example:
>>
>>    Error: #26 btf
>>    Error: #26/174 btf/BTF ID
>>      Error: #26/174 btf/BTF ID
>>      btf_raw_create:PASS:check 0 nsec
>>      btf_raw_create:PASS:check 0 nsec
>>      test_btf_id:PASS:check 0 nsec
>>      ...
>>      test_btf_id:PASS:check 0 nsec
>>      test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1
>>
>> The test tries to prove a btf_id not available after the map is closed.
>> But btf_id is freed only after workqueue and a rcu grace period, compared
>> to previous case just after a rcu grade period.
> It is not accurate. Before applying the patch, the btf_id will be
> released in btf_put() and there is no RCU grace period involved. After

I missed it (and because I didn't double check the code).
Yes, btf_id is freed before going to rcu gp. So previously
reliable test now becomes not reliable due to workqueue.


> applying the patch, the btf_id will be released after the running of
> bpf_map_free_deferred kworker.
>> To fix the flaky test, I added a kern_sync_rcu() after closing map and
>> before querying btf id availability, essentially ensuring a rcu grace
>> period in the kernel, which seems making the test happy.
> kern_sync_rcu() doesn't guarantee the bpf_map_free_deferred kworker will
> complete, so why not remove the test case instead ?

Yes, I understand this. My hope is that kern_sync_rcu() can
make the test stable enough (that is why I am using 'seems making')
but no guarantees.

For this particular case, if I am doing refcount for btf as mentioned
in the comments of previous patch, we should be okay.

Will craft another version tomorrow with btf refcount approach.

>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>>   tools/testing/selftests/bpf/prog_tests/btf.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c
>> index 8fb4a04fbbc0..7feb4223bbac 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/btf.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/btf.c
>> @@ -4629,6 +4629,7 @@ static int test_btf_id(unsigned int test_num)
>>   
>>   	/* The map holds the last ref to BTF and its btf_id */
>>   	close(map_fd);
>> +	kern_sync_rcu();
>>   	map_fd = -1;
>>   	btf_fd[0] = bpf_btf_get_fd_by_id(map_info.btf_id);
>>   	if (CHECK(btf_fd[0] >= 0, "BTF lingers")) {

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-12-05  7:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-05  6:04 [PATCH bpf-next v2 1/2] bpf: Fix a race condition between btf_put() and map_free() Yonghong Song
2023-12-05  6:04 ` [PATCH bpf-next v2 2/2] selftests/bpf: Fix flaky test_btf_id test Yonghong Song
2023-12-05  6:39   ` Hou Tao
2023-12-05  7:10     ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox