* [PATCH bpf-next 0/2] Fix release of struct_ops map @ 2024-11-08 8:26 Xu Kuohai 2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai 2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai 0 siblings, 2 replies; 8+ messages in thread From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw) To: bpf, netdev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee From: Xu Kuohai <xukuohai@huawei.com> This series fix a bug I found when doing rcu waiting cleanup for struct_ops map. When there is sleepable prog in struct_ops map, the map risks being released while the prog is still running. Xu Kuohai (2): bpf: Fix release of struct_ops map selftests/bpf: Add test for struct_ops map release kernel/bpf/bpf_struct_ops.c | 37 +++-- kernel/bpf/syscall.c | 7 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 78 ++++++--- .../bpf/bpf_testmod/bpf_testmod_kfunc.h | 2 +- .../bpf/prog_tests/test_struct_ops_module.c | 154 ++++++++++++++++++ .../bpf/progs/struct_ops_map_release.c | 30 ++++ 6 files changed, 267 insertions(+), 41 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c -- 2.39.5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map 2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai @ 2024-11-08 8:26 ` Xu Kuohai 2024-11-08 17:00 ` Alexei Starovoitov 2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai 1 sibling, 1 reply; 8+ messages in thread From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw) To: bpf, netdev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee From: Xu Kuohai <xukuohai@huawei.com> The test in the follow-up patch triggers the following kernel panic: Oops: int3: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4 RIP: 0010:0xffffffffc0015041 Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000 RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000 R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0 FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0 Call Trace: <TASK> ? show_regs+0x68/0x80 ? die+0x3b/0x90 ? exc_int3+0xca/0xe0 ? asm_exc_int3+0x3e/0x50 run_struct_ops+0x58/0xb0 [bpf_testmod] param_attr_store+0xa2/0x100 module_attr_store+0x25/0x40 sysfs_kf_write+0x50/0x70 kernfs_fop_write_iter+0x146/0x1f0 vfs_write+0x27e/0x530 ksys_write+0x75/0x100 __x64_sys_write+0x1d/0x30 x64_sys_call+0x1d30/0x1f30 do_syscall_64+0x68/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f051831727f Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108 RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004 RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30 R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000 </TASK> It's because the sleepable prog is still running when the struct_ops map is released. To fix it, follow the approach used in bpf_tramp_image_put. First, before release struct_ops map, wait a rcu_tasks_trace gp for sleepable progs to finish. Then, wait a rcu_tasks gp for normal progs and the rest trampoline insns to finish. Additionally, switch to call_rcu to remove the synchronous waiting, as suggested by Alexei and Martin. Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.") Suggested-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> --- kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------ kernel/bpf/syscall.c | 7 ++++++- 2 files changed, 25 insertions(+), 19 deletions(-) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index fda3dd2ee984..dd5f9bf12791 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) */ if (btf_is_module(st_map->btf)) module_put(st_map->st_ops_desc->st_ops->owner); - - /* The struct_ops's function may switch to another struct_ops. - * - * For example, bpf_tcp_cc_x->init() may switch to - * another tcp_cc_y by calling - * setsockopt(TCP_CONGESTION, "tcp_cc_y"). - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called - * and its refcount may reach 0 which then free its - * trampoline image while tcp_cc_x is still running. - * - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog - * to finish. bpf-tcp-cc prog is non sleepable. - * A rcu_tasks gp is to wait for the last few insn - * in the tramopline image to finish before releasing - * the trampoline image. - */ - synchronize_rcu_mult(call_rcu, call_rcu_tasks); - __bpf_struct_ops_map_free(map); } @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) mutex_init(&st_map->lock); bpf_map_init_from_attr(map, attr); + /* The struct_ops's function may switch to another struct_ops. + * + * For example, bpf_tcp_cc_x->init() may switch to + * another tcp_cc_y by calling + * setsockopt(TCP_CONGESTION, "tcp_cc_y"). + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called + * and its refcount may reach 0 which then free its + * trampoline image while tcp_cc_x is still running. + * + * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp + * is to wait for sleepable progs in the map to finish. Then a + * rcu_tasks gp is to wait for the normal progs and the last few + * insns in the tramopline image to finish before releasing the + * trampoline image. + * + * Also see the comment in function bpf_tramp_image_put. + */ + WRITE_ONCE(map->free_after_mult_rcu_gp, true); + return map; errout_free: diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8254b2973157..ae927f087f04 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu) static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) { - if (rcu_trace_implies_rcu_gp()) + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); + + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) + /* See comment in the end of bpf_struct_ops_map_alloc */ + call_rcu_tasks(rcu, bpf_map_free_rcu_gp); + else if (rcu_trace_implies_rcu_gp()) bpf_map_free_rcu_gp(rcu); else call_rcu(rcu, bpf_map_free_rcu_gp); -- 2.39.5 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 1/2] bpf: Fix release of struct_ops map 2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai @ 2024-11-08 17:00 ` Alexei Starovoitov 0 siblings, 0 replies; 8+ messages in thread From: Alexei Starovoitov @ 2024-11-08 17:00 UTC (permalink / raw) To: Xu Kuohai Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee On Fri, Nov 8, 2024 at 12:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: > > From: Xu Kuohai <xukuohai@huawei.com> > > The test in the follow-up patch triggers the following kernel panic: > > Oops: int3: 0000 [#1] PREEMPT SMP PTI > CPU: 0 UID: 0 PID: 465 Comm: test_progs Tainted: G OE 6.12.0-rc4-gd1d187515 > Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-pr4 > RIP: 0010:0xffffffffc0015041 > Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ccc > RSP: 0018:ffffc9000187fd20 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: ffffffff82c54639 RDI: 0000000000000000 > RBP: ffffc9000187fd48 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 000000004cba6383 R12: 0000000000000000 > R13: 0000000000000002 R14: ffff88810438b7a0 R15: ffffffff8369d7a0 > FS: 00007f05178006c0(0000) GS:ffff888236e00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0508c94000 CR3: 0000000100d16003 CR4: 0000000000170ef0 > Call Trace: > <TASK> > ? show_regs+0x68/0x80 > ? die+0x3b/0x90 > ? exc_int3+0xca/0xe0 > ? asm_exc_int3+0x3e/0x50 > run_struct_ops+0x58/0xb0 [bpf_testmod] > param_attr_store+0xa2/0x100 > module_attr_store+0x25/0x40 > sysfs_kf_write+0x50/0x70 > kernfs_fop_write_iter+0x146/0x1f0 > vfs_write+0x27e/0x530 > ksys_write+0x75/0x100 > __x64_sys_write+0x1d/0x30 > x64_sys_call+0x1d30/0x1f30 > do_syscall_64+0x68/0x140 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > RIP: 0033:0x7f051831727f > Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 108 > RSP: 002b:00007f05177ffdf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007f05178006c0 RCX: 00007f051831727f > RDX: 0000000000000002 RSI: 00007f05177ffe30 RDI: 0000000000000004 > RBP: 00007f05177ffe90 R08: 0000000000000000 R09: 000000000000000b > R10: 0000000000000000 R11: 0000000000000293 R12: ffffffffffffff30 > R13: 0000000000000002 R14: 00007ffd926bfd50 R15: 00007f0517000000 > </TASK> > > It's because the sleepable prog is still running when the struct_ops > map is released. > > To fix it, follow the approach used in bpf_tramp_image_put. First, > before release struct_ops map, wait a rcu_tasks_trace gp for sleepable > progs to finish. Then, wait a rcu_tasks gp for normal progs and the > rest trampoline insns to finish. > > Additionally, switch to call_rcu to remove the synchronous waiting, > as suggested by Alexei and Martin. > > Fixes: b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.") > Suggested-by: Alexei Starovoitov <ast@kernel.org> > Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> > Signed-off-by: Xu Kuohai <xukuohai@huawei.com> > --- > kernel/bpf/bpf_struct_ops.c | 37 +++++++++++++++++++------------------ > kernel/bpf/syscall.c | 7 ++++++- > 2 files changed, 25 insertions(+), 19 deletions(-) > > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > index fda3dd2ee984..dd5f9bf12791 100644 > --- a/kernel/bpf/bpf_struct_ops.c > +++ b/kernel/bpf/bpf_struct_ops.c > @@ -865,24 +865,6 @@ static void bpf_struct_ops_map_free(struct bpf_map *map) > */ > if (btf_is_module(st_map->btf)) > module_put(st_map->st_ops_desc->st_ops->owner); > - > - /* The struct_ops's function may switch to another struct_ops. > - * > - * For example, bpf_tcp_cc_x->init() may switch to > - * another tcp_cc_y by calling > - * setsockopt(TCP_CONGESTION, "tcp_cc_y"). > - * During the switch, bpf_struct_ops_put(tcp_cc_x) is called > - * and its refcount may reach 0 which then free its > - * trampoline image while tcp_cc_x is still running. > - * > - * A vanilla rcu gp is to wait for all bpf-tcp-cc prog > - * to finish. bpf-tcp-cc prog is non sleepable. > - * A rcu_tasks gp is to wait for the last few insn > - * in the tramopline image to finish before releasing > - * the trampoline image. > - */ > - synchronize_rcu_mult(call_rcu, call_rcu_tasks); > - > __bpf_struct_ops_map_free(map); > } > > @@ -974,6 +956,25 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) > mutex_init(&st_map->lock); > bpf_map_init_from_attr(map, attr); > > + /* The struct_ops's function may switch to another struct_ops. > + * > + * For example, bpf_tcp_cc_x->init() may switch to > + * another tcp_cc_y by calling > + * setsockopt(TCP_CONGESTION, "tcp_cc_y"). > + * During the switch, bpf_struct_ops_put(tcp_cc_x) is called > + * and its refcount may reach 0 which then free its > + * trampoline image while tcp_cc_x is still running. > + * > + * Since struct_ops prog can be sleepable, a rcu_tasks_trace gp > + * is to wait for sleepable progs in the map to finish. Then a > + * rcu_tasks gp is to wait for the normal progs and the last few > + * insns in the tramopline image to finish before releasing the > + * trampoline image. > + * > + * Also see the comment in function bpf_tramp_image_put. > + */ > + WRITE_ONCE(map->free_after_mult_rcu_gp, true); > + > return map; > > errout_free: > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 8254b2973157..ae927f087f04 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -886,7 +886,12 @@ static void bpf_map_free_rcu_gp(struct rcu_head *rcu) > > static void bpf_map_free_mult_rcu_gp(struct rcu_head *rcu) > { > - if (rcu_trace_implies_rcu_gp()) > + struct bpf_map *map = container_of(rcu, struct bpf_map, rcu); > + > + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) > + /* See comment in the end of bpf_struct_ops_map_alloc */ The fix makes sense, but pls copy paste a sentence here instead of the above comment. Like: " rcu_tasks gp is necessary to wait for struct_ops bpf trampoline to finish. Unlike all other bpf trampolines struct_ops trampoline is not protected by percpu_ref. " > + call_rcu_tasks(rcu, bpf_map_free_rcu_gp); > + else if (rcu_trace_implies_rcu_gp()) pw-bot: cr ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release 2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai 2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai @ 2024-11-08 8:26 ` Xu Kuohai 2024-11-08 19:39 ` Martin KaFai Lau 1 sibling, 1 reply; 8+ messages in thread From: Xu Kuohai @ 2024-11-08 8:26 UTC (permalink / raw) To: bpf, netdev Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Yonghong Song, Kui-Feng Lee From: Xu Kuohai <xukuohai@huawei.com> Add a test to ensure that struct_ops map is released safely. Without the previous fix, this test triggers kernel panic due to UAF. The test runs multiple threads concurrently. Some threads create and destroy struct_ops maps, while others execute progs within the maps. Each map has two progs, one sleepable, and the other unsleepable. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 78 ++++++--- .../bpf/bpf_testmod/bpf_testmod_kfunc.h | 2 +- .../bpf/prog_tests/test_struct_ops_module.c | 154 ++++++++++++++++++ .../bpf/progs/struct_ops_map_release.c | 30 ++++ 4 files changed, 242 insertions(+), 22 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/struct_ops_map_release.c diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 987d41af71d2..72d21e8ba5d4 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -1016,6 +1016,11 @@ __bpf_kfunc int bpf_kfunc_st_ops_inc10(struct st_ops_args *args) return args->a; } +__bpf_kfunc void bpf_kfunc_msleep(u32 msecs) +{ + msleep(msecs); +} + BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids) BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc) BTF_ID_FLAGS(func, bpf_kfunc_call_test1) @@ -1056,6 +1061,7 @@ BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_prologue, KF_TRUSTED_ARGS | KF_SLEEPABL BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_pro_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_kfunc_st_ops_inc10, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_kfunc_msleep, KF_SLEEPABLE) BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids) static int bpf_testmod_ops_init(struct btf *btf) @@ -1096,6 +1102,29 @@ static const struct bpf_verifier_ops bpf_testmod_verifier_ops = { .is_valid_access = bpf_testmod_ops_is_valid_access, }; +static int bpf_testmod_test_1(void) +{ + return 0; +} + +static void bpf_testmod_test_2(int a, int b) +{ +} + +static int bpf_testmod_ops__test_maybe_null(int dummy, + struct task_struct *task__nullable) +{ + return 0; +} + +static struct bpf_testmod_ops __bpf_testmod_ops = { + .test_1 = bpf_testmod_test_1, + .test_2 = bpf_testmod_test_2, + .test_maybe_null = bpf_testmod_ops__test_maybe_null, +}; + +static struct bpf_testmod_ops *__bpf_dummy_ops = &__bpf_testmod_ops; + static int bpf_dummy_reg(void *kdata, struct bpf_link *link) { struct bpf_testmod_ops *ops = kdata; @@ -1108,20 +1137,14 @@ static int bpf_dummy_reg(void *kdata, struct bpf_link *link) if (ops->test_2) ops->test_2(4, ops->data); - return 0; -} - -static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) -{ -} + WRITE_ONCE(__bpf_dummy_ops, ops); -static int bpf_testmod_test_1(void) -{ return 0; } -static void bpf_testmod_test_2(int a, int b) +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) { + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops); } static int bpf_testmod_tramp(int value) @@ -1129,18 +1152,6 @@ static int bpf_testmod_tramp(int value) return 0; } -static int bpf_testmod_ops__test_maybe_null(int dummy, - struct task_struct *task__nullable) -{ - return 0; -} - -static struct bpf_testmod_ops __bpf_testmod_ops = { - .test_1 = bpf_testmod_test_1, - .test_2 = bpf_testmod_test_2, - .test_maybe_null = bpf_testmod_ops__test_maybe_null, -}; - struct bpf_struct_ops bpf_bpf_testmod_ops = { .verifier_ops = &bpf_testmod_verifier_ops, .init = bpf_testmod_ops_init, @@ -1375,6 +1386,31 @@ static void bpf_testmod_exit(void) unregister_bpf_testmod_uprobe(); } +static int run_struct_ops(const char *val, const struct kernel_param *kp) +{ + int ret; + unsigned int repeat; + struct bpf_testmod_ops *ops; + + ret = kstrtouint(val, 10, &repeat); + if (ret) + return ret; + + if (repeat > 10000) + return -ERANGE; + + while (repeat-- > 0) { + ops = READ_ONCE(__bpf_dummy_ops); + if (ops->test_1) + ops->test_1(); + if (ops->test_2) + ops->test_2(0, 0); + } + + return 0; +} + +module_param_call(run_struct_ops, run_struct_ops, NULL, NULL, 0200); module_init(bpf_testmod_init); module_exit(bpf_testmod_exit); diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h index b58817938deb..260faebd5633 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod_kfunc.h @@ -158,5 +158,5 @@ void bpf_kfunc_trusted_vma_test(struct vm_area_struct *ptr) __ksym; void bpf_kfunc_trusted_task_test(struct task_struct *ptr) __ksym; void bpf_kfunc_trusted_num_test(int *ptr) __ksym; void bpf_kfunc_rcu_task_test(struct task_struct *ptr) __ksym; - +void bpf_kfunc_msleep(__u32 msecs) __ksym; #endif /* _BPF_TESTMOD_KFUNC_H */ diff --git a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c index 75a0dea511b3..df744d51cade 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c +++ b/tools/testing/selftests/bpf/prog_tests/test_struct_ops_module.c @@ -9,6 +9,7 @@ #include "struct_ops_nulled_out_cb.skel.h" #include "struct_ops_forgotten_cb.skel.h" #include "struct_ops_detach.skel.h" +#include "struct_ops_map_release.skel.h" #include "unsupported_ops.skel.h" static void check_map_info(struct bpf_map_info *info) @@ -246,6 +247,157 @@ static void test_struct_ops_forgotten_cb(void) struct_ops_forgotten_cb__destroy(skel); } +struct test_context { + pthread_mutex_t mutex; + pthread_cond_t cond; + int total_threads; + int wait_threads; + int dead_threads; + int repeat; + int loop; +}; + +static int wait_others(struct test_context *ctx) +{ + int ret = 0; + + pthread_mutex_lock(&ctx->mutex); + + if (ctx->dead_threads) { + pthread_cond_broadcast(&ctx->cond); + pthread_mutex_unlock(&ctx->mutex); + return -1; + } + + ++ctx->wait_threads; + if (ctx->wait_threads >= ctx->total_threads) { + pthread_cond_broadcast(&ctx->cond); + ctx->wait_threads = 0; + } else { + pthread_cond_wait(&ctx->cond, &ctx->mutex); + if (ctx->dead_threads) + ret = -1; + } + + pthread_mutex_unlock(&ctx->mutex); + + return ret; +} + +static void mark_dead(struct test_context *ctx) +{ + pthread_mutex_lock(&ctx->mutex); + ctx->dead_threads++; + pthread_cond_broadcast(&ctx->cond); + pthread_mutex_unlock(&ctx->mutex); +} + +static int load_release(struct test_context *ctx) +{ + int ret = 0; + struct bpf_link *link = NULL; + struct struct_ops_map_release *skel = NULL; + + skel = struct_ops_map_release__open_and_load(); + if (!ASSERT_OK_PTR(skel, "open_and_load")) { + ret = -1; + mark_dead(ctx); + goto out; + } + + link = bpf_map__attach_struct_ops(skel->maps.testmod_ops); + if (!ASSERT_OK_PTR(link, "attach_struct_ops")) { + ret = -1; + mark_dead(ctx); + goto out; + } + + if (wait_others(ctx)) { + ret = -1; + goto out; + } + +out: + bpf_link__destroy(link); + struct_ops_map_release__destroy(skel); + return ret; +} + +static void *thread_load_release(void *arg) +{ + struct test_context *ctx = (struct test_context *)arg; + + for (int i = 0; i < ctx->loop; i++) + if (load_release(ctx)) + break; + return NULL; +} + +static void *thread_run_prog(void *arg) +{ + int fd; + int len; + char buf[8]; + struct test_context *ctx = (struct test_context *)arg; + + fd = open("/sys/module/bpf_testmod/parameters/run_struct_ops", O_WRONLY); + if (!ASSERT_OK_FD(fd, "open run_struct_ops for write")) { + mark_dead(ctx); + return NULL; + } + + len = snprintf(buf, sizeof(buf), "%d", ctx->repeat); + if (!ASSERT_GT(len, 0, "snprintf repeat number")) { + mark_dead(ctx); + goto out; + } + + for (int i = 0; i < ctx->loop; i++) { + if (wait_others(ctx)) + goto out; + if (!ASSERT_EQ(write(fd, buf, len), len, "write file")) { + mark_dead(ctx); + goto out; + } + } + +out: + close(fd); + return NULL; +} + +#define NR_REL_THREAD 2 +#define NR_RUN_THREAD 8 +#define NR_THREAD (NR_REL_THREAD + NR_RUN_THREAD) +#define NR_REPEAT 4 +#define NR_LOOP 5 + +static void test_struct_ops_map_release(void) +{ + int i, j; + pthread_t t[NR_THREAD]; + struct test_context ctx = { + .loop = NR_LOOP, + .repeat = NR_REPEAT, + .total_threads = NR_THREAD, + .wait_threads = 0, + .dead_threads = 0, + }; + + pthread_mutex_init(&ctx.mutex, NULL); + pthread_cond_init(&ctx.cond, NULL); + + j = 0; + for (i = 0; i < NR_REL_THREAD; i++) + pthread_create(&t[j++], NULL, thread_load_release, &ctx); + + for (i = 0; i < NR_RUN_THREAD; i++) + pthread_create(&t[j++], NULL, thread_run_prog, &ctx); + + for (i = 0; i < NR_THREAD; i++) + pthread_join(t[i], NULL); +} + /* Detach a link from a user space program */ static void test_detach_link(void) { @@ -310,6 +462,8 @@ void serial_test_struct_ops_module(void) test_struct_ops_nulled_out_cb(); if (test__start_subtest("struct_ops_forgotten_cb")) test_struct_ops_forgotten_cb(); + if (test__start_subtest("struct_ops_map_release")) + test_struct_ops_map_release(); if (test__start_subtest("test_detach_link")) test_detach_link(); RUN_TESTS(unsupported_ops); diff --git a/tools/testing/selftests/bpf/progs/struct_ops_map_release.c b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c new file mode 100644 index 000000000000..78aa5e1875b6 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/struct_ops_map_release.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (C) 2024 Huawei Technologies Co., Ltd */ + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#include "../bpf_testmod/bpf_testmod.h" +#include "../bpf_testmod/bpf_testmod_kfunc.h" + +char _license[] SEC("license") = "GPL"; + +SEC("struct_ops.s/test_1") +int BPF_PROG(test_1_prog) +{ + bpf_kfunc_msleep(100); + return 0; +} + +SEC("struct_ops/test_2") +int BPF_PROG(test_2_prog, int a, int b) +{ + return 0; +} + +SEC(".struct_ops") +struct bpf_testmod_ops testmod_ops = { + .test_1 = (void *)test_1_prog, + .test_2 = (void *)test_2_prog +}; -- 2.39.5 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release 2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai @ 2024-11-08 19:39 ` Martin KaFai Lau 2024-11-09 8:40 ` Xu Kuohai 0 siblings, 1 reply; 8+ messages in thread From: Martin KaFai Lau @ 2024-11-08 19:39 UTC (permalink / raw) To: Xu Kuohai Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Yonghong Song, Kui-Feng Lee On 11/8/24 12:26 AM, Xu Kuohai wrote: > -static void bpf_testmod_test_2(int a, int b) > +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) > { > + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops); > } [ ... ] > +static int run_struct_ops(const char *val, const struct kernel_param *kp) > +{ > + int ret; > + unsigned int repeat; > + struct bpf_testmod_ops *ops; > + > + ret = kstrtouint(val, 10, &repeat); > + if (ret) > + return ret; > + > + if (repeat > 10000) > + return -ERANGE; > + > + while (repeat-- > 0) { > + ops = READ_ONCE(__bpf_dummy_ops); I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait for all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops. The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced instead to return a new tcp-cc-ops to be used. > + if (ops->test_1) > + ops->test_1(); > + if (ops->test_2) > + ops->test_2(0, 0); > + } > + > + return 0; > +} ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release 2024-11-08 19:39 ` Martin KaFai Lau @ 2024-11-09 8:40 ` Xu Kuohai 2024-11-11 21:30 ` Martin KaFai Lau 0 siblings, 1 reply; 8+ messages in thread From: Xu Kuohai @ 2024-11-09 8:40 UTC (permalink / raw) To: Martin KaFai Lau Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Yonghong Song, Kui-Feng Lee On 11/9/2024 3:39 AM, Martin KaFai Lau wrote: > On 11/8/24 12:26 AM, Xu Kuohai wrote: >> -static void bpf_testmod_test_2(int a, int b) >> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) >> { >> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops); >> } > > [ ... ] > >> +static int run_struct_ops(const char *val, const struct kernel_param *kp) >> +{ >> + int ret; >> + unsigned int repeat; >> + struct bpf_testmod_ops *ops; >> + >> + ret = kstrtouint(val, 10, &repeat); >> + if (ret) >> + return ret; >> + >> + if (repeat > 10000) >> + return -ERANGE; >> + >> + while (repeat-- > 0) { >> + ops = READ_ONCE(__bpf_dummy_ops); > > I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait for > all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops. > Thanks for the explanation. Are you saying that it's not the struct_ops framework's responsibility to ensure the struct_ops map is not released while it may be still in use? And the "bug" in this series should be "fixed" in the test, namely this patch? > The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced > instead to return a new tcp-cc-ops to be used. Not quite clear, but from the description, it seems that the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could be just removed in some way, no need to do a cleanup to switch it to call_rcu. > >> + if (ops->test_1) >> + ops->test_1(); >> + if (ops->test_2) >> + ops->test_2(0, 0); >> + } >> + >> + return 0; >> +} ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release 2024-11-09 8:40 ` Xu Kuohai @ 2024-11-11 21:30 ` Martin KaFai Lau 2024-11-12 12:22 ` Xu Kuohai 0 siblings, 1 reply; 8+ messages in thread From: Martin KaFai Lau @ 2024-11-11 21:30 UTC (permalink / raw) To: Xu Kuohai Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Yonghong Song, Kui-Feng Lee On 11/9/24 12:40 AM, Xu Kuohai wrote: > On 11/9/2024 3:39 AM, Martin KaFai Lau wrote: >> On 11/8/24 12:26 AM, Xu Kuohai wrote: >>> -static void bpf_testmod_test_2(int a, int b) >>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) >>> { >>> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops); >>> } >> >> [ ... ] >> >>> +static int run_struct_ops(const char *val, const struct kernel_param *kp) >>> +{ >>> + int ret; >>> + unsigned int repeat; >>> + struct bpf_testmod_ops *ops; >>> + >>> + ret = kstrtouint(val, 10, &repeat); >>> + if (ret) >>> + return ret; >>> + >>> + if (repeat > 10000) >>> + return -ERANGE; >>> + >>> + while (repeat-- > 0) { >>> + ops = READ_ONCE(__bpf_dummy_ops); >> >> I don't think it is the usual bpf_struct_ops implementation which only uses >> READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a >> refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use >> kthread_flush_work() to wait for all ops calling finished. Meaning I don't >> think the current bpf_struct_ops unreg implementation will run into this issue >> for sleepable ops. >> > > Thanks for the explanation. > > Are you saying that it's not the struct_ops framework's > responsibility to ensure the struct_ops map is not > released while it may be still in use? And the "bug" in > this series should be "fixed" in the test, namely this > patch? Yeah, it is what I was trying to say. I don't think there is thing to fix. Think about extending a subsystem by a kernel module. The subsystem will also do the needed protection itself during the unreg process. There is already a bpf_try_module_get() to help the subsystem. >> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for >> the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its >> own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should >> have been introduced instead to return a new tcp-cc-ops to be used. > > Not quite clear, but from the description, it seems that > the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could This synchronize_rcu_mult is only need for the tcp_congestion_ops (bpf_tcp_ca.c). May be it is cleaner to just make a special case for "tcp_congestion_ops" in st_ops->name in map_alloc and only set free_after_mult_rcu_gp to TRUE for this one case, then it won't slow down other struct_ops map freeing also. imo, the test in this patch is not needed in its current form also since it is not how the kernel subsystem implements unreg in struct_ops. > be just removed in some way, no need to do a cleanup to > switch it to call_rcu. > >> >>> + if (ops->test_1) >>> + ops->test_1(); >>> + if (ops->test_2) >>> + ops->test_2(0, 0); >>> + } >>> + >>> + return 0; >>> +} > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release 2024-11-11 21:30 ` Martin KaFai Lau @ 2024-11-12 12:22 ` Xu Kuohai 0 siblings, 0 replies; 8+ messages in thread From: Xu Kuohai @ 2024-11-12 12:22 UTC (permalink / raw) To: Martin KaFai Lau Cc: bpf, netdev, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Eduard Zingerman, Yonghong Song, Kui-Feng Lee On 11/12/2024 5:30 AM, Martin KaFai Lau wrote: > On 11/9/24 12:40 AM, Xu Kuohai wrote: >> On 11/9/2024 3:39 AM, Martin KaFai Lau wrote: >>> On 11/8/24 12:26 AM, Xu Kuohai wrote: >>>> -static void bpf_testmod_test_2(int a, int b) >>>> +static void bpf_dummy_unreg(void *kdata, struct bpf_link *link) >>>> { >>>> + WRITE_ONCE(__bpf_dummy_ops, &__bpf_testmod_ops); >>>> } >>> >>> [ ... ] >>> >>>> +static int run_struct_ops(const char *val, const struct kernel_param *kp) >>>> +{ >>>> + int ret; >>>> + unsigned int repeat; >>>> + struct bpf_testmod_ops *ops; >>>> + >>>> + ret = kstrtouint(val, 10, &repeat); >>>> + if (ret) >>>> + return ret; >>>> + >>>> + if (repeat > 10000) >>>> + return -ERANGE; >>>> + >>>> + while (repeat-- > 0) { >>>> + ops = READ_ONCE(__bpf_dummy_ops); >>> >>> I don't think it is the usual bpf_struct_ops implementation which only uses READ_ONCE and WRITE_ONCE to protect the registered ops. tcp-cc uses a refcnt+rcu. It seems hid uses synchronize_srcu(). sched_ext seems to also use kthread_flush_work() to wait >>> for all ops calling finished. Meaning I don't think the current bpf_struct_ops unreg implementation will run into this issue for sleepable ops. >>> >> >> Thanks for the explanation. >> >> Are you saying that it's not the struct_ops framework's >> responsibility to ensure the struct_ops map is not >> released while it may be still in use? And the "bug" in >> this series should be "fixed" in the test, namely this >> patch? > > Yeah, it is what I was trying to say. I don't think there is thing to fix. Think about extending a subsystem by a kernel module. The subsystem will also do the needed protection itself during the unreg process. There is already a bpf_try_module_get() to > help the subsystem. > Got it >>> The current synchronize_rcu_mult(call_rcu, call_rcu_tasks) is only needed for the tcp-cc because a tcp-cc's ops (which uses refcnt+rcu) can decrement its own refcnt. Looking back, this was a mistake (mine). A new tcp-cc ops should have been introduced >>> instead to return a new tcp-cc-ops to be used. >> >> Not quite clear, but from the description, it seems that >> the synchronize_rcu_mult(call_rcu, call_rcu_tasks) could > > This synchronize_rcu_mult is only need for the tcp_congestion_ops (bpf_tcp_ca.c). May be it is cleaner to just make a special case for "tcp_congestion_ops" in st_ops->name in map_alloc and only set free_after_mult_rcu_gp to TRUE for this one case, then it > won't slow down other struct_ops map freeing also. > OK, will git it a try, thanks. > imo, the test in this patch is not needed in its current form also since it is not how the kernel subsystem implements unreg in struct_ops. > >> be just removed in some way, no need to do a cleanup to >> switch it to call_rcu. >> >>> >>>> + if (ops->test_1) >>>> + ops->test_1(); >>>> + if (ops->test_2) >>>> + ops->test_2(0, 0); >>>> + } >>>> + >>>> + return 0; >>>> +} >> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-11-12 12:22 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-08 8:26 [PATCH bpf-next 0/2] Fix release of struct_ops map Xu Kuohai 2024-11-08 8:26 ` [PATCH bpf-next 1/2] bpf: " Xu Kuohai 2024-11-08 17:00 ` Alexei Starovoitov 2024-11-08 8:26 ` [PATCH bpf-next 2/2] selftests/bpf: Add test for struct_ops map release Xu Kuohai 2024-11-08 19:39 ` Martin KaFai Lau 2024-11-09 8:40 ` Xu Kuohai 2024-11-11 21:30 ` Martin KaFai Lau 2024-11-12 12:22 ` Xu Kuohai
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).